scala – 从Spark DataFrame中删除一个嵌套列
发布时间:2020-12-16 19:07:16 所属栏目:安全 来源:网络整理
导读:我有一个DataFrame与模式 root |-- label: string (nullable = true) |-- features: struct (nullable = true) | |-- feat1: string (nullable = true) | |-- feat2: string (nullable = true) | |-- feat3: string (nullable = true) 虽然,我可以使用过滤数
我有一个DataFrame与模式
root |-- label: string (nullable = true) |-- features: struct (nullable = true) | |-- feat1: string (nullable = true) | |-- feat2: string (nullable = true) | |-- feat3: string (nullable = true) 虽然,我可以使用过滤数据框 val data = rawData .filter( !(rawData("features.feat1") <=> "100") ) 我无法删除列 val data = rawData .drop("features.feat1") 这是我在这里做错了吗?我也尝试(不成功)做drop(rawData(“features.feat1”)),尽管这样做没有什么意义. 提前致谢, 尼基尔 解决方法
这只是一个编程练习,但你可以尝试这样的:
import org.apache.spark.sql.{DataFrame,Column} import org.apache.spark.sql.types.{StructType,StructField} import org.apache.spark.sql.{functions => f} import scala.util.Try case class DFWithDropFrom(df: DataFrame) { def getSourceField(source: String): Try[StructField] = { Try(df.schema.fields.filter(_.name == source).head) } def getType(sourceField: StructField): Try[StructType] = { Try(sourceField.dataType.asInstanceOf[StructType]) } def genOutputCol(names: Array[String],source: String): Column = { f.struct(names.map(x => f.col(source).getItem(x).alias(x)): _*) } def dropFrom(source: String,toDrop: Array[String]): DataFrame = { getSourceField(source) .flatMap(getType) .map(_.fieldNames.diff(toDrop)) .map(genOutputCol(_,source)) .map(df.withColumn(source,_)) .getOrElse(df) } } 使用示例 scala> case class features(feat1: String,feat2: String,feat3: String) defined class features scala> case class record(label: String,features: features) defined class record scala> val df = sc.parallelize(Seq(record("a_label",features("f1","f2","f3")))).toDF df: org.apache.spark.sql.DataFrame = [label: string,features: struct<feat1:string,feat2:string,feat3:string>] scala> DFWithDropFrom(df).dropFrom("features",Array("feat1")).show +-------+--------+ | label|features| +-------+--------+ |a_label| [f2,f3]| +-------+--------+ scala> DFWithDropFrom(df).dropFrom("foobar",Array("feat1")).show +-------+----------+ | label| features| +-------+----------+ |a_label|[f1,f2,f3]| +-------+----------+ scala> DFWithDropFrom(df).dropFrom("features",Array("foobar")).show +-------+----------+ | label| features| +-------+----------+ |a_label|[f1,f3]| +-------+----------+ 添加一个implicit conversion,你很好去. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |