Understanding Cubert Concepts(二)Co-Partitioned Blocks
Understanding Cubert Concepts(二):Cubert Co-Partitioned Blocks话接上文Cubert PartitionedBlocks,我们介绍了Cubert的核心Block概念之一的分区块,它是一种根据 好了,本文将着重讲Cubert Block中的另一种Block,Co-PartitionedBlock. Co-partitioned Blocks让我们来看下另一种创建blocks的方式: 举个例子:
比如我们对dataset P的 memberIds from 0 to 1000 => block 0
memberIds from 1001 to 1500 => block 1
and so on until block N
至此,我们做的都是
我们要生成与DataSet P 同样 具体来说,就是 对于 这种根据其它已经partitionedBlock来进行创建一致性分区Block的过程叫做 BLOCKGEN BY INDEX Checklist如果想要使用cubert来进行开发,那么我们必须遵从下面三个准则:
Creating Co-Partitioned Blocks要创建Co-PartitionedBlocks,还是需要
eg: // the primary dataset
JOB "our first BLOCKGEN"
REDUCERS 10;
MAP {
data = LOAD "/path/to/data" USING AVRO();
}
//根据memberId来作为分区键,根据timestamp来进行sort
BLOCKGEN data BY ROW 1000 PARTITIONED ON memberId SORTED ON timestamp;
//注意,这里必须存储为RUBIX FILE FORMAT
STORE data INTO "/path/to/output" USING RUBIX();
END
JOB "our first blockgen by index"
REDUCERS 20;
MAP {
data = LOAD "/path/to/other/data" USING AVRO();
}
//注意 INDEX的 Path 为 上一个JOB的存储目录
BLOCKGEN data BY INDEX "/path/to/output" PARTITIONED ON memberId SORTED ON some_column;
STORE data INTO "/path/to/other/output" USING RUBIX();
END
Idiom of Resorting Blocks
在上一个例子里的
JOB "resorting blocks"
REDUCERS 10;
MAP {
data = LOAD "/path/to/output" USING RUBIX();
}
BLOCKGEN data BY INDEX "/path/to/output" PARTITIONED ON memberId SORTED ON pagekey;
STORE data INTO "/path/to/resorted-output" USING RUBIX();
END
注意:
同样,我们可以对数据集B,C都使用indexA进行重排序处理,因为他们都是 参考 Cubert官方文档blocks Ps:本文的写作是基于对Cubert官方文档的翻译和个人对Cubert的理解综合完成 :)
(编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |