加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 编程开发 > Python > 正文

python-如何获取句子文本中的双峰概率?

发布时间:2020-12-17 17:38:38 所属栏目:Python 来源:网络整理
导读:我的文本中有很多句子.如何使用nltk.ngrams进行处理? 这是我的代码: sequence = nltk.tokenize.word_tokenize(raw) bigram = ngrams(sequence,2) freq_dist = nltk.FreqDist(bigram) prob_dist = nltk.MLEProbDist(freq_dist) number_of_bigrams = freq_di

我的文本中有很多句子.如何使用nltk.ngrams进行处理?

这是我的代码:

   sequence = nltk.tokenize.word_tokenize(raw) 
   bigram = ngrams(sequence,2)
   freq_dist = nltk.FreqDist(bigram)
   prob_dist = nltk.MLEProbDist(freq_dist)
   number_of_bigrams = freq_dist.N()

但是,以上代码假定所有句子都是一个序列.但是,句子是分开的,我想一个句子的最后一个词与另一个句子的开始词无关.如何为这样的文本创建一个双字母组?我还需要基于`freq_dist的prob_dist和number_of_bigrams.

也有类似What are ngram counts and how to implement using nltk?的类似问题,但它们大多与单词序列有关.

最佳答案
您可以使用新的nltk.lm模块.这是一个示例,首先获取一些数据并将其标记化:

import os
import requests
import io #codecs

from nltk import word_tokenize,sent_tokenize 

# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
    with io.open('language-never-random.txt',encoding='utf8') as fin:
        text = fin.read()
else:
    url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
    text = requests.get(url).content.decode('utf8')
    with io.open('language-never-random.txt','w',encoding='utf8') as fout:
        fout.write(text)

# Tokenize the text.
tokenized_text = [list(map(str.lower,word_tokenize(sent))) 
              for sent in sent_tokenize(text)]

然后进行语言建模:

# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

n = 3
train_data,padded_sents = padded_everygram_pipeline(n,tokenized_text)

model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data,padded_sents)

获取计数:

model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language','is']]['never'] # i.e. Count('never'|'language is')

获取概率:

model.score('is','language'.split())  # P('is'|'language')
model.score('never','language is'.split())  # P('never'|'language is')

加载笔记本时,Kaggle平台上有一些问题,但在某些情况下,该笔记本应该可以很好地概述nltk.lm模块https://www.kaggle.com/alvations/n-gram-language-model-with-nltk.

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读