scala – Spark SQL嵌套withColumn
发布时间:2020-12-16 09:27:15 所属栏目:安全 来源:网络整理
导读:我有一个DataFrame,它有多个列,其中一些是结构.像这样的东西 root |-- foo: struct (nullable = true) | |-- bar: string (nullable = true) | |-- baz: string (nullable = true) |-- abc: array (nullable = true) | |-- element: struct (containsNull =
我有一个DataFrame,它有多个列,其中一些是结构.像这样的东西
root |-- foo: struct (nullable = true) | |-- bar: string (nullable = true) | |-- baz: string (nullable = true) |-- abc: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- def: struct (nullable = true) | | | |-- a: string (nullable = true) | | | |-- b: integer (nullable = true) | | | |-- c: string (nullable = true) 我想在列baz上应用UserDefinedFunction以用baz函数替换baz,但我无法弄清楚如何做到这一点.以下是所需输出的示例(请注意,baz现在是一个int) root |-- foo: struct (nullable = true) | |-- bar: string (nullable = true) | |-- baz: int (nullable = true) |-- abc: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- def: struct (nullable = true) | | | |-- a: string (nullable = true) | | | |-- b: integer (nullable = true) | | | |-- c: string (nullable = true) 看起来DataFrame.withColumn只适用于顶级列,但不适用于嵌套列.我正在使用Scala来解决这个问题. 有人可以帮我解决这个问题吗? 谢谢 解决方法
这很容易,只需使用一个点来选择嵌套结构,例如$“foo.baz”:
case class Foo(bar:String,baz:String) case class Record(foo:Foo) val df = Seq( Record(Foo("Hi","There")) ).toDF() df.printSchema root |-- foo: struct (nullable = true) | |-- bar: string (nullable = true) | |-- baz: string (nullable = true) val myUDF = udf((s:String) => { // do something with s s.toUpperCase }) df .withColumn("udfResult",myUDF($"foo.baz")) .show +----------+---------+ | foo|udfResult| +----------+---------+ |[Hi,There]| THERE| +----------+---------+ 如果要将UDF的结果添加到现有的struct foo,即获取: root |-- foo: struct (nullable = false) | |-- bar: string (nullable = true) | |-- baz: string (nullable = true) | |-- udfResult: string (nullable = true) 有两种选择: with withColumn: df .withColumn("udfResult",myUDF($"foo.baz")) .withColumn("foo",struct($"foo.*",$"udfResult")) .drop($"udfResult") 选择: df .select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo")) 编辑: df .withColumn("foo.baz",myUDF($"foo.baz")) 但可以这样做: // get all columns except foo.baz val structCols = df.select($"foo.*") .columns .filter(_!="baz") .map(name => col("foo."+name)) df.withColumn( "foo",struct((structCols:+myUDF($"foo.baz").as("baz")):_*) ) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |