scala – Spark：阅读文本文件后重新分配策略

发布时间：2020-12-16 19:01:08 所属栏目：安全来源：网络整理

导读：我以这种方式推出了我的群集： /usr/lib/spark/bin/spark-submit --class MyClass --master yarn-cluster--num-executors 3 --driver-memory 10g --executor-memory 10g --executor-cores 4 /path/to/jar.jar 我做的第一件事是读一个大文本文件,并计数： val

我以这种方式推出了我的群集：

/usr/lib/spark/bin/spark-submit --class MyClass --master yarn-cluster--num-executors 3 --driver-memory 10g --executor-memory 10g --executor-cores 4 /path/to/jar.jar

我做的第一件事是读一个大文本文件,并计数：

val file = sc.textFile("/path/to/file.txt.gz")
println(file.count())

当这样做时,我看到只有我的一个节点正在读取文件并执行计数(因为我只看到一个任务).这是预期的吗我应该重新分配我的RDD,或者当我使用地图缩小功能时,Spark会为我做吗？

解决方法

看起来你正在使用一个gzip压缩文件.

报价my answer here：

I think you’ve hit a fairly typical problem with gzipped files in that they cannot be loaded in parallel. More specifically,a single gzipped file cannot be loaded in parallel by multiple tasks,so Spark will load it with 1 task and thus give you an RDD with 1 partition.

您需要在加载RDD之后明确地重新分配RDD,以便更多的任务可以并行运行.

例如：

val file = sc.textFile("/path/to/file.txt.gz").repartition(sc.defaultParallelism * 3)
println(file.count())

关于你的问题的意见,设置minPartitions在这里没有帮助的原因是因为a gzipped file is not splittable,所以Spark将始终使用1个任务来读取文件.

如果您在阅读常规文本文件时设置minPartition,或者使用像bzip2这样的可压缩压缩格式压缩的文件,您将看到Spark将实际部署并行执行的任务数量(最多可达到集群中可用的核心数量)读取文件.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!