scala – 关于spark的DBSCAN：哪个实现

发布时间：2020-12-16 18:59:43 所属栏目：安全来源：网络整理

导读：我想在Spark上做一些DBSCAN.我目前发现了2个实现： https://github.com/irvingc/dbscan-on-spark https://github.com/alitouka/spark_dbscan 我已经使用其github中给出的sbt配置测试了第一个但是： jar中的函数与doc或github上的源中的函数不同.例如,我在jar

我想在Spark上做一些DBSCAN.我目前发现了2个实现：

> https://github.com/irvingc/dbscan-on-spark
> https://github.com/alitouka/spark_dbscan

我已经使用其github中给出的sbt配置测试了第一个但是：

> jar中的函数与doc或github上的源中的函数不同.例如,我在jar中找不到列车功能
>我设法使用fit函数(在jar中找到)运行测试,但是epsilon的错误配置(从小到大)将代码置于无限循环中.

代码：

val model = DBSCAN.fit(eps,minPoints,values,parallelism)

有人设法与第一个图书馆合作吗？

有人测试过第二个吗？

解决方法

请尝试 ELKI.由于这是Java,因此从Scala调用应该很容易.

ELKI经过了很好的优化,使用索引可以扩展到相当大的数据集.

我们试图在我们的基准测试研究中包含其中一个Spark实现 – 但它耗尽内存(并且它是内存耗尽的唯一实现…… Spark和Mahout的k-means也是最慢的)：

Hans-Peter Kriegel,Erich Schubert,and Arthur Zimek.
07001
In: Knowledge and Information Systems (KAIS). 2016,1–38

Neukirchen教授在本技术报告中对DBSCAN的并行实施进行了基准测试：

Helmut Neukirchen
07002

显然他有一些Spark实现工作,但注意到：

The result is devastating: none of the implementations for Apache Spark is anywhere near to the HPC implementations. In particular on bigger (but still rather small) data sets,most of them fail completely and do not even deliver correct results.

早些时候：

When running any of the “Spark DBSCAN” implementations while making use of all available cores of our cluster,we experienced out-of-memory exceptions.

(另外,“Spark DBSCAN”在928个核心上花费了2406秒,ELKI在1个核心上花了997秒用于较小的基准测试 – 其他Spark实现也没有太好,特别是它没有返回正确的结果…… )

“DBSCAN on Spark” did not crash,but returned completely wrong
clusters.

While “DBSCAN on Spark” finishes faster,it delivered
completely wrong clustering results. Due to the hopelessly long run-times of the DBSCAN
implementations for Spark already with the maximum number of cores,we did not perform
measurements with a lower number of cores.

您可以将double [] []数组包装为ELKI数据库：

// Adapter to load data from an existing array.
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data);
// Create a database (which may contain multiple relations!)
Database db = new StaticArrayDatabase(dbc,null);
// Load the data into the database (do NOT forget to initialize...)
db.initialize();

Clustering<Model> c = new DBSCAN<NumberVector>(
  EuclideanDistanceFunction.STATIC,eps,minpts).run(db);

for(Cluster<KMeansModel> clu : c.getAllClusters()) {
  // Process clusters
}

另请参见：Java API example(特别是,如何将DBID映射回行索引).为了获得更好的性能,请将索引工厂(例如新的CoverTree.Factory(…))作为第二个参数传递给StaticArrayDatabase构造函数.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!