计算词和词组频率的Python nltk

发布时间：2020-12-20 12:33:41 所属栏目：Python 来源：网络整理

导读：我正在使用NLTK并尝试将单词短语计数到特定文档的某个长度以及每个短语的频率.我将字符串标记为获取数据列表. from nltk.util import ngramsfrom nltk.tokenize import sent_tokenize,word_tokenizefrom nltk.collocations import *data = ["this","is","not

我正在使用NLTK并尝试将单词短语计数到特定文档的某个长度以及每个短语的频率.我将字符串标记为获取数据列表.

from nltk.util import ngrams
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.collocations import *


data = ["this","is","not","a","test","this","real","test"]

bigrams = ngrams(data,2)

bigrams_c = {}
for b in bigrams:
    if b not in bigrams_c:
        bigrams_c[b] = 1
    else:
        bigrams_c[b] += 1

上面的代码给出和输出如下：

(('is','this'),1)
(('test',2)
(('a','test'),3)
(('this','is'),4)
(('is','not'),1)
(('real',2)
(('is','real'),2)
(('not','a'),3)

这是我正在寻找的部分内容.

我的问题是,是否有更方便的方法来说明长度为4或5的短语而不重复此代码只更改计数变量？

解决方法

既然你标记了这个nltk,下面是如何使用nltk的方法来做到这一点,这些方法比标准python集合中的方法有更多的功能.

from nltk import ngrams,FreqDist
all_counts = dict()
for size in 2,3,4,5:
    all_counts[size] = FreqDist(ngrams(data,size))

字典all_counts的每个元素都是ngram频率的字典.例如,您可以获得五个最常见的三元组：

all_counts[3].most_common(5)

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!