scala – Spark zipWithIndex安全并行实现吗？

发布时间：2020-12-16 09:56:51 所属栏目：安全来源：网络整理

导读：如果我有一个文件,我每行都有一个RDD zipWithIndex, ([row1,id1001,name,address],0)([row2,1)...([row100000,100000) 如果我重新加载文件,我能获得相同的索引顺序吗？由于它并行运行,其他行可能会以不同方式划分？解决方法可以对RDD进行排序,也可以对订单

如果我有一个文件,我每行都有一个RDD zipWithIndex,

([row1,id1001,name,address],0)
([row2,1)
...
([row100000,100000)

如果我重新加载文件,我能获得相同的索引顺序吗？由于它并行运行,其他行可能会以不同方式划分？

解决方法

可以对RDD进行排序,也可以对订单进行排序.此顺序用于使用.zipWithIndex()创建索引.

每次获得相同的订单取决于之前的调用在您的程序中执行的操作.文档提到.groupBy()可以破坏顺序或生成不同的顺序.可能还有其他调用也可以执行此操作.

我想如果你需要保证特定的排序,你可以在调用.zipWithIndex()之前调用.sortBy().

这在.zipWithIndex() scala API docs中有解释

public RDD<scala.Tuple2<T,Object>> zipWithIndex() Zips this RDD with
its element indices. The ordering is first based on the partition
index and then the ordering of items within each partition. So the
first item in the first partition gets index 0,and the last item in
the last partition receives the largest index. This is similar to
Scala’s zipWithIndex but it uses Long instead of Int as the index
type. This method needs to trigger a spark job when this RDD contains
more than one partitions.

Note that some RDDs,such as those returned by groupBy(),do not guarantee order of elements in a partition. The index assigned to each element is therefore not guaranteed,and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee the same index assignments,you should sort the RDD with sortByKey() or save it to a file.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!