scala – 将键/值对列表转换为spark中每个键的值列表

发布时间：2020-12-16 18:39:21 所属栏目：安全来源：网络整理

导读：我们需要有效地转换大型键/值对列表,如下所示： val providedData = List( (new Key("1"),new Val("one")),(new Key("1"),new Val("un")),new Val("ein")),(new Key("2"),new Val("two")),new Val("deux")),new Val("zwei"))) 到每个键的值列表,如下所示： v

我们需要有效地转换大型键/值对列表,如下所示：

val providedData = List(
        (new Key("1"),new Val("one")),(new Key("1"),new Val("un")),new Val("ein")),(new Key("2"),new Val("two")),new Val("deux")),new Val("zwei"))
)

到每个键的值列表,如下所示：

val expectedData = List(
  (new Key("1"),List(
    new Val("one"),new Val("un"),new Val("ein"))),List(
    new Val("two"),new Val("deux"),new Val("zwei")))
)

键值对来自大键/值存储(Accumulo),因此键将被排序,但通常会跨越spark分区边界.每个键可以有数百万个键和数百个值.

我认为这个工作的正确工具是spark的combineByKey操作,但是只能找到泛型类型(如Int)的简洁示例,我一直无法推广到用户定义的类型,如上所述.

由于我怀疑很多其他人会有同样的问题,我希望有人可以提供scala语法的完全指定(详细)和简洁示例,以便将combineByKey与上面的用户定义类型一起使用,或者可能指出更好的工具我错过了

解决方法

我不是Spark专家,但基于 this question,我认为你可以做到以下几点：

val rdd = sc.parallelize(providedData)

rdd.combineByKey(
    // createCombiner: add first value to a list
    (x: Val) => List(x),// mergeValue: add new value to existing list
    (acc: List[Val],x) => x :: acc,// mergeCominber: combine the 2 lists
    (acc1: List[Val],acc2: List[Val]) => acc1 ::: acc2
)

使用aggregateByKey：

rdd.aggregateByKey(List[Val]())(
    (acc,(acc1,acc2) => acc1 ::: acc2
)

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!