如何计算余弦相似性给出2个句子串? – Python
发布时间:2020-12-14 05:01:07 所属栏目:大数据 来源:网络整理
导读:从 Python: tf-idf-cosine: to find document similarity,可以使用tf-idf余弦计算文档相似性。没有导入外部库,是否任何方式来计算2个字符串之间的余弦相似性? s1 = "This is a foo bar sentence ."s2 = "This sentence is similar to a foo bar sentence
从
Python: tf-idf-cosine: to find document similarity,可以使用tf-idf余弦计算文档相似性。没有导入外部库,是否任何方式来计算2个字符串之间的余弦相似性?
s1 = "This is a foo bar sentence ." s2 = "This sentence is similar to a foo bar sentence ." s3 = "What is this string ? Totally not related to the other two lines ." cosine_sim(s1,s2) # Should give high cosine similarity cosine_sim(s1,s3) # Shouldn't give high cosine similarity value cosine_sim(s2,s3) # Shouldn't give high cosine similarity value 解决方法
一个简单的纯Python实现将是:
import re,math from collections import Counter WORD = re.compile(r'w+') def get_cosine(vec1,vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x]**2 for x in vec1.keys()]) sum2 = sum([vec2[x]**2 for x in vec2.keys()]) denominator = math.sqrt(sum1) * math.sqrt(sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector(text): words = WORD.findall(text) return Counter(words) text1 = 'This is a foo bar sentence .' text2 = 'This sentence is similar to a foo bar sentence .' vector1 = text_to_vector(text1) vector2 = text_to_vector(text2) cosine = get_cosine(vector1,vector2) print 'Cosine:',cosine 打印: Cosine: 0.861640436855 这里使用的余弦公式描述为here。 这不包括tf-idf对单词的加权,但为了使用tf-idf,您需要有一个相当大的语料库来估计tfidf权重。 你还可以进一步发展,通过使用更复杂的方式从一段文字,词干或单词中提取单词等。 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |