scala – Spark saveAsTextFile()写入多个文件而不是一个

发布时间：2020-12-16 18:37:40 所属栏目：安全来源：网络整理

导读：参见英文答案 how to make saveAsTextFile NOT split output into multiple file?????????????????????????????????????9个我现在正在笔记本电脑上使用Spark和Scala. 当我将RDD写入文件时,输出将写入两个文件“part-00000”和“part-00001”.如何强制Spark

参见英文答案 > how to make saveAsTextFile NOT split output into multiple file?????????????????????????????????????9个
我现在正在笔记本电脑上使用Spark和Scala.

当我将RDD写入文件时,输出将写入两个文件“part-00000”和“part-00001”.如何强制Spark / Scala写入一个文件？

我的代码目前是：

myRDD.map(x => x._1 + "," + x._2).saveAsTextFile("/path/to/output")

在哪里我removing the parentheses写出关键值对.

解决方法

“问题”确实是一个特征,它是由RDD的分区方式产生的,因此它在n个部分中分开,其中n是分区的数量.要解决此问题,您只需在RDD上使用 repartition将分区数更改为1.文件说明：

repartition(numPartitions)

Return a new RDD that has exactly numPartitions partitions.

Can increase or decrease the level of parallelism in this RDD. Internally,this uses a shuffle to redistribute data. If you are
decreasing the number of partitions in this RDD,consider using
coalesce,which can avoid performing a shuffle.

例如,此更改应该有效.

myRDD.map(x => x._1 + "," + x._2).repartition(1).saveAsTextFile("/path/to/output")

正如文档所说,您也可以使用coalesce,这在减少分区数量时实际上是推荐的选项.但是,将分区数量减少到一个被认为是一个坏主意,因为它会导致数据混乱到一个节点并失去并行性.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!