scala – 如何检测火花数据框是否有列
发布时间:2020-12-16 09:01:59 所属栏目:安全 来源:网络整理
导读:当我在spark sql中从json文件创建一个DataFrame时,如何在调用.select之前确定给定的列是否存在 例如json模式 { "a": { "b": 1,"c": 2 }} 这是我想做的: potential_columns = Seq("b","c","d")df = sqlContext.read.json(filename)potential_columns.map(col
当我在spark sql中从json文件创建一个DataFrame时,如何在调用.select之前确定给定的列是否存在
例如json模式 { "a": { "b": 1,"c": 2 } } 这是我想做的: potential_columns = Seq("b","c","d") df = sqlContext.read.json(filename) potential_columns.map(column => if(df.hasColumn(column)) df.select(s"a.$column")) 但是我找不到hasColumn的好功能.我最接近的是测试列是否在这个有点尴尬的数组: scala> df.select("a.*").columns res17: Array[String] = Array(b,c) 解决方法
假设它存在并让它失败与Try.平原简单,支持任意嵌套:
import scala.util.Try import org.apache.spark.sql.DataFrame def hasColumn(df: DataFrame,path: String) = Try(df(path)).isSuccess val df = sqlContext.read.json(sc.parallelize( """{"foo": [{"bar": {"foobar": 3}}]}""" :: Nil)) hasColumn(df,"foobar") // Boolean = false hasColumn(df,"foo") // Boolean = true hasColumn(df,"foo.bar") // Boolean = true hasColumn(df,"foo.bar.foobar") // Boolean = true hasColumn(df,"foo.bar.foobaz") // Boolean = false 甚至更简单: val columns = Seq( "foobar","foo","foo.bar","foo.bar.foobar","foo.bar.foobaz") columns.flatMap(c => Try(df(c)).toOption) // Seq[org.apache.spark.sql.Column] = List( // foo,foo.bar AS bar#12,foo.bar.foobar AS foobar#13) Python等效: from pyspark.sql.utils import AnalysisException from pyspark.sql import Row def has_column(df,col): try: df[col] return True except AnalysisException: return False df = sc.parallelize([Row(foo=[Row(bar=Row(foobar=3))])]).toDF() has_column(df,"foobar") ## False has_column(df,"foo") ## True has_column(df,"foo.bar") ## True has_column(df,"foo.bar.foobar") ## True has_column(df,"foo.bar.foobaz") ## False (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |