scala – 如何将show运算符的输出读回数据集?
发布时间:2020-12-16 18:14:42 所属栏目:安全 来源:网络整理
导读:假设我们有以下文本文件(df.show()命令的输出): +----+---------+--------+|col1| col2| col3|+----+---------+--------+| 1|pi number|3.141592|| 2| e number| 2.71828|+----+---------+--------+ 现在我想将其解析/解析为DataFrame / Dataset.什么是最“
假设我们有以下文本文件(df.show()命令的输出):
+----+---------+--------+ |col1| col2| col3| +----+---------+--------+ | 1|pi number|3.141592| | 2| e number| 2.71828| +----+---------+--------+ 现在我想将其解析/解析为DataFrame / Dataset.什么是最“闪亮”的方式来做到这一点? 附:我对scala和pyspark的解决方案感兴趣,这就是为什么使用两个标签的原因. 解决方法
更新:使用“UNIVOCITY”解析器lib我可以删除一行,我删除列名称中的空格:
斯卡拉: // read Spark Output Fixed width table: def readSparkOutput(filePath: String) : org.apache.spark.sql.DataFrame = { val t = spark.read .option("header","true") .option("inferSchema","true") .option("delimiter","|") .option("parserLib","UNIVOCITY") .option("ignoreLeadingWhiteSpace","true") .option("ignoreTrailingWhiteSpace","true") .option("comment","+") .csv(filePath) t.select(t.columns.filterNot(_.startsWith("_c")).map(t(_)):_*) } PySpark: def read_spark_output(file_path): t = spark.read .option("header","true") .option("inferSchema","true") .option("delimiter","|") .option("parserLib","UNIVOCITY") .option("ignoreLeadingWhiteSpace","true") .option("ignoreTrailingWhiteSpace","true") .option("comment","+") .csv("file:///tmp/spark.out") # select not-null columns return t.select([c for c in t.columns if not c.startswith("_")]) 用法示例: scala> val df = readSparkOutput("file:///tmp/spark.out") df: org.apache.spark.sql.DataFrame = [col1: int,col2: string ... 1 more field] scala> df.show +----+---------+--------+ |col1| col2| col3| +----+---------+--------+ | 1|pi number|3.141592| | 2| e number| 2.71828| +----+---------+--------+ scala> df.printSchema root |-- col1: integer (nullable = true) |-- col2: string (nullable = true) |-- col3: double (nullable = true) 老答案: 这是我在scala中的尝试(Spark 2.2): // read Spark Output Fixed width table: val t = spark.read .option("header","true") .option("inferSchema","true") .option("delimiter","|") .option("comment","+") .csv("file:///temp/spark.out") // select not-null columns val cols = t.columns.filterNot(c => c.startsWith("_c")).map(a => t(a)) // trim spaces from columns val colsTrimmed = t.columns.filterNot(c => c.startsWith("_c")).map(c => c.replaceAll("s+","")) // reanme columns using 'colsTrimmed' val df = t.select(cols:_*).toDF(colsTrimmed:_*) 它有效,但我有一种感觉,必须有更优雅的方式来做到这一点. scala> df.show +----+---------+--------+ |col1| col2| col3| +----+---------+--------+ | 1.0|pi number|3.141592| | 2.0| e number| 2.71828| +----+---------+--------+ scala> df.printSchema root |-- col1: double (nullable = true) |-- col2: string (nullable = true) |-- col3: double (nullable = true) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |