如何使用R通过余弦相似性有效地检索顶部K-相似向量?
我正在研究一个高维问题(约4k项)并且想要检索顶部k-相似(通过余弦相似性)并且不能进行成对计算.
我的训练集是600万x 4k矩阵,我想预测600k x 4k矩阵. 检索600k x 4k矩阵中每个项目的k相似项目的最有效方法是什么? 理想情况下,我想得到一个600k x 10的矩阵(即600k项目中每个项目的前10个相似项目). ps:我研究过SO网站,发现几乎所有“R中的余弦相似”问题都是指cosine_sim(vector1,vector2).但是这个问题涉及cosine_sim(matrix1,matrix2). 更新 set.seed(123) train<-matrix(round(runif(30),0),nrow=6,ncol=5) set.seed(987) test<-matrix(round(runif(20),nrow=4,ncol=5) train [1,] 0 1 1 0 1 [2,] 1 1 1 1 1 [3,] 0 1 0 1 1 [4,] 1 0 1 1 1 [5,] 1 1 0 1 0 [6,] 0 0 0 1 0 test [1,] 0 1 1 0 0 [2,] 1 0 1 0 1 [3,] 1 0 0 0 0 [4,] 1 0 0 1 1 coSim<-function(mat1,mat2,topK){ require(plyr) #mat2: is the testset #mat1: is the training set. We will find cosine similarity between each row in testset and every row in trainingset. #topK: user-input. for each row in testset we will return 'topk' similar rows(index) from the testset #set up an empty result matrix. nrow(result) will be the same as the cartesian product between mat1 & mat2. result<-matrix(rep(NA,nrow(mat1)*nrow(mat2)),nrow=nrow(mat1)*nrow(mat2),ncol=3) k=1 for(i in 1:nrow(mat2)){ for(j in 1:nrow(mat1)){ result[k,1]<-i result[k,2]<-j result[k,3]<-crossprod(mat1[j,],mat2[i,])/sqrt(crossprod(mat1[j,]) * crossprod(mat2[i,])) k<-k+1 } } #sort the result matrix by cosine similarity found for each row in testset. not sure how to keep topK from each group so convert to df result<-as.data.frame(result) colnames(result)<-c("testRowId","trainRowId","CosineSimilarity") result<-ddply(result,"testRowId",function(x) head(x[order(x$CosineSimilarity,decreasing = TRUE),topK)) resultMat<-matrix(result$trainRowId,nrow=nrow(mat2),ncol=topK,byrow=T) finalResult<-list(similarity=result,index=resultMat) } system.time(cosineSim<-coSim(train,test,topK=2)) #0.12 secs cosineSim $similarity testRowId trainRowId CosineSimilarity 1 1 1 0.8164966 2 1 2 0.6324555 3 2 4 0.8660254 4 2 2 0.7745967 5 3 5 0.5773503 6 3 4 0.5000000 7 4 4 0.8660254 8 4 2 0.7745967 $index [,1] [,2] [1,] 1 2 [2,] 4 2 [3,] 5 4 [4,] 4 2 set.seed(123) train<-matrix(round(runif(1000000),nrow=5000,ncol=200) set.seed(987) test<-matrix(round(runif(400000),nrow=2000,ncol=200) system.time(cosineSim<-coSim(train,topK=50)) #380secs 当我使用5000×200矩阵进行相同的功能训练和2000×200矩阵进行测试时,它花费了超过380秒. 理想情况下,我希望看到一些想法,我不必计算每一行之间的相似性.如果那是不可能的,关于如何矢量化上述代码的一些指示将是有帮助的. 解决方法
无需计算每一行的相似度.您可以使用此代替:
coSim2<-function(mat1,topK){ #similarity computation: xy <- tcrossprod(mat1,mat2) xx <- rowSums(mat1^2) yy <- rowSums(mat2^2) result <- xy/sqrt(outer(xx,yy)) #top similar rows from train (per row in test): top <- apply(result,2,order,decreasing=TRUE)[1:topK,] result_df <- data.frame(testRowId=c(col(top)),trainRowId=c(top)) result_df$CosineSimilarity <- result[as.matrix(result_df[,2:1])] list(similarity=result_df,index=t(top)) } 测试数据(我减少了你的列车矩阵) set.seed(123) train<-matrix(round(runif(100000),nrow=500,ncol=200) 结果: > system.time(cosineSim<-coSim(train,topK=50)) #380secs user system elapsed 41.71 1.59 43.72 > system.time(cosineSim2<-coSim2(train,topK=50)) #380secs user system elapsed 0.46 0.02 0.49 使用完整的5000 x 200训练矩阵,coSim2在7.8秒内运行. 另请注意: > any(cosineSim$similarity != cosineSim2$similarity) [1] FALSE > any(cosineSim$index != cosineSim2$index) [1] FALSE 你不能使用相同的,因为我的函数返回整数而不是行ID的双精度数. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |