scala – 与简单方法相比,min散列的Spark Jaccard相似度计算速度
发布时间:2020-12-16 18:28:45 所属栏目:安全 来源:网络整理
导读:鉴于2个巨大的值列表,我试图使用 Scala在Spark中计算它们之间的 jaccard similarity. 假设colHashed1包含第一个值列表,colHashed2包含第二个列表. 方法1(琐碎的方法): val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.u
鉴于2个巨大的值列表,我试图使用
Scala在Spark中计算它们之间的
jaccard similarity.
假设colHashed1包含第一个值列表,colHashed2包含第二个列表. 方法1(琐碎的方法): val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble) 方法2(使用minHashing): 我使用了here解释的方法. import java.util.zip.CRC32 def getCRC32 (s : String) : Int = { val crc=new CRC32 crc.update(s.getBytes) return crc.getValue.toInt & 0xffffffff } val maxShingleID = Math.pow(2,32)-1 def pickRandomCoeffs(kIn : Int) : Array[Int] = { var k = kIn val randList = Array.fill(k){0} while(k > 0) { // Get a random shingle ID. var randIndex = (Math.random()*maxShingleID).toInt // Ensure that each random number is unique. while(randList.contains(randIndex)) { randIndex = (Math.random()*maxShingleID).toInt } // Add the random number to the list. k = k - 1 randList(k) = randIndex } return randList } val colHashed1 = list1Values.map(a => getCRC32(a)) val colHashed2 = list2Values.map(a => getCRC32(a)) val nextPrime = 4294967311L val numHashes = 10 val coeffA = pickRandomCoeffs(numHashes) val coeffB = pickRandomCoeffs(numHashes) var signature1 = Array.fill(numHashes){0} for (i <- 0 to numHashes-1) { // Evaluate the hash function. val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) // Track the lowest hash code seen. signature1(i) = hashCodeRDD.min.toInt } var signature2 = Array.fill(numHashes){0} for (i <- 0 to numHashes-1) { // Evaluate the hash function. val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) // Track the lowest hash code seen. signature2(i) = hashCodeRDD.min.toInt } var count = 0 // Count the number of positions in the minhash signature which are equal. for(k <- 0 to numHashes-1) { if(signature1(k) == signature2(k)) count = count + 1 } val jSimilarity = count/numHashes.toDouble 方法1似乎总是在时间方面优于方法2.当我分析代码时,方法2中的RDD上的min()函数调用需要很长时间,并且该函数被调用多次,具体取决于使用了多少个散列函数. 与重复的min()函数调用相比,方法1中使用的交集和并集操作似乎更快. 我不明白为什么minHashing在这里没有帮助.我预计minHashing与普通方法相比工作得更快.我在这里做错了吗? 可以查看样本数据here 解决方法
Jaccard与MinHash的相似性并未给出一致的结果:
import java.util.zip.CRC32 object Jaccard { def getCRC32(s: String): Int = { val crc = new CRC32 crc.update(s.getBytes) return crc.getValue.toInt & 0xffffffff } def pickRandomCoeffs(kIn: Int,maxShingleID: Double): Array[Int] = { var k = kIn val randList = Array.ofDim[Int](k) while (k > 0) { // Get a random shingle ID. var randIndex = (Math.random() * maxShingleID).toInt // Ensure that each random number is unique. while (randList.contains(randIndex)) { randIndex = (Math.random() * maxShingleID).toInt } // Add the random number to the list. k = k - 1 randList(k) = randIndex } return randList } def approach2(list1Values: List[String],list2Values: List[String]) = { val maxShingleID = Math.pow(2,32) - 1 val colHashed1 = list1Values.map(a => getCRC32(a)) val colHashed2 = list2Values.map(a => getCRC32(a)) val nextPrime = 4294967311L val numHashes = 10 val coeffA = pickRandomCoeffs(numHashes,maxShingleID) val coeffB = pickRandomCoeffs(numHashes,maxShingleID) val signature1 = for (i <- 0 until numHashes) yield { val hashCodeRDD = colHashed1.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime) hashCodeRDD.min.toInt // Track the lowest hash code seen. } val signature2 = for (i <- 0 until numHashes) yield { val hashCodeRDD = colHashed2.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime) hashCodeRDD.min.toInt // Track the lowest hash code seen } val count = (0 until numHashes) .map(k => if (signature1(k) == signature2(k)) 1 else 0) .fold(0)(_ + _) val jSimilarity = count / numHashes.toDouble jSimilarity } // def approach1(list1Values: List[String],list2Values: List[String]) = { // val colHashed1 = list1Values.toSet // val colHashed2 = list2Values.toSet // // val jSimilarity = colHashed1.intersection(colHashed2).distinct.count / (colHashed1.union(colHashed2).distinct.count.toDouble) // jSimilarity // } def approach1(list1Values: List[String],list2Values: List[String]) = { val colHashed1 = list1Values.toSet val colHashed2 = list2Values.toSet val jSimilarity = (colHashed1 & colHashed2).size / (colHashed1 ++ colHashed2).size.toDouble jSimilarity } def main(args: Array[String]) { val list1Values = List("a","b","c") val list2Values = List("a","d") for (i <- 0 until 5) { println(s"Iteration ${i}") println(s" - Approach 1: ${approach1(list1Values,list2Values)}") println(s" - Approach 2: ${approach2(list1Values,list2Values)}") } } } OUTPUT: Iteration 0 - Approach 1: 0.5 - Approach 2: 0.5 Iteration 1 - Approach 1: 0.5 - Approach 2: 0.5 Iteration 2 - Approach 1: 0.5 - Approach 2: 0.8 Iteration 3 - Approach 1: 0.5 - Approach 2: 0.8 Iteration 4 - Approach 1: 0.5 - Approach 2: 0.4 你为什么用它? (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |