根据语义相似性/相关性从列表中删除重复项

发布时间：2020-12-14 04:32:04 所属栏目：大数据来源：网络整理

导读：R tm：如何基于语义相似性去除列表中的项目？ ?v -c(“银行”,“银行”,“ford_suv”,“toyota_suv”,“nissan_suv”).我期望的解决方案是c(“bank”,“ nissan_suv“).也就是说,银行,银行和银行业务将减少到一个任期“银行”. SnowBall :: stemming不是一种

R tm：如何基于语义相似性去除列表中的项目？
?v< -c(“银行”,“银行”,“ford_suv”,“toyota_suv”,“nissan_suv”).我期望的解决方案是c(“bank”,“ nissan_suv“).也就是说,银行,银行和银行业务将减少到一个任期“银行”. SnowBall :: stemming不是一种选择,因为我必须保留各个国家的报纸风格.任何帮助或指导都会有用.

解决方法

我们可以使用adist计算单词之间的Levenshtein距离,并使用hclust将它们重新组合成簇

d <- adist(v)
rownames(d) <- v

这给出了术语之间的距离矩阵：

#              [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#bank              0    1    3    8    9    8    2   13    6     5     3     4
#banks             1    0    3    7    9    7    2   13    6     6     2     5
#banking           3    3    0    8   10    8    3   13    7     6     3     7
#ford_suv          8    7    8    0    5    6    8   12    7     7     8     4
#toyota_suv        9    9   10    5    0    6    9    7    4     9     9     9
#nissan_suv        8    7    8    6    6    0    8   13   10     4     8    10
#banker            2    2    3    8    9    8    0   12    6     6     1     6
#toyota_corolla   13   13   13   12    7   13   12    0    8    13    12    12
#toyota            6    6    7    7    4   10    6    8    0     6     7     5
#nissan            5    6    6    7    9    4    6   13    6     0     7     6
#bankers           3    2    3    8    9    8    1   12    7     7     0     6
#ford              4    5    7    4    9   10    6   12    5     6     6     0

然后我们可以使用method = ward.D将它传递给hclust

cl <- hclust(as.dist(d),method  = "ward.D")
plot(cl)

这使：

enter image description here

我们注意到4个不同的簇(我们可以使用rect.hclust(cl,4)来说明)

enter image description here

现在,我们可以将此结果转换为data.frame并使用它的最短项标记每个集群：

library(dplyr)
data.frame(group = cutree(cl,4)) %>%
  tibble::rownames_to_column("term") %>%
  group_by(group) %>%
  mutate(tag = term[nchar(term) == min(nchar(term))])

这使：

#Source: local data frame [12 x 3]
#Groups: group [4]
#
#             term group      tag
#            <chr> <int>    <chr>
#1            bank     1     bank
#2           banks     1     bank
#3         banking     1     bank
#4        ford_suv     2     ford
#5      toyota_suv     3   toyota
#6      nissan_suv     4   nissan
#7          banker     1     bank
#8  toyota_corolla     3   toyota
#9          toyota     3   toyota
#10         nissan     4   nissan
#11        bankers     1     bank
#12           ford     2     ford

如果我们只想为每个群集提取唯一标记,我们可以将?％>％distinct(tag)％>％.$tag添加到管道中,这将给出：

#[1] "bank"   "ford"   "toyota" "nissan"

参考

？adist

The (generalized) Levenshtein (or edit) distance between two strings s
and t is the minimal possibly weighted number of insertions,deletions
and substitutions needed to transform s into t (so that the
transformation exactly matches t).

？hclust

This function performs a hierarchical cluster analysis using a set of
dissimilarities for the n objects being clustered. Initially,each
object is assigned to its own cluster and then the algorithm proceeds
iteratively,at each stage joining the two most similar clusters,
continuing until there is just a single cluster.

注意：我在评论中使用了@Abdou提供的数据,因为它代表了一个更完整的用例

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!