在python中完全复制R文本预处理

发布时间：2020-12-16 23:53:48 所属栏目：Python 来源：网络整理

导读：我希望使用Python以与在R中相同的方式预处理文档语料库.例如,给定初始语料库,语料库,我想最终得到一个预处理语料库,该语料库对应于使用以下语句生成的语料库R代码： library(tm)library(SnowballC)corpus = tm_map(corpus,tolower)corpus = tm_map(corpus,re

我希望使用Python以与在R中相同的方式预处理文档语料库.例如,给定初始语料库,语料库,我想最终得到一个预处理语料库,该语料库对应于使用以下语句生成的语料库R代码：

library(tm)
library(SnowballC)

corpus = tm_map(corpus,tolower)
corpus = tm_map(corpus,removePunctuation)
corpus = tm_map(corpus,removeWords,c("myword",stopwords("english")))
corpus = tm_map(corpus,stemDocument)

是否有一个简单或直接 – 最好是预先构建 – 在Python中执行此操作的方法？有没有办法确保完全相同的结果？

例如,我想预处理

@Apple ear pods are AMAZING! Best sound from in-ear headphones I’ve
ever had!

成

ear pod amaz best sound inear headphon ive ever

最佳答案

在预处理步骤中使nltk和tm之间的事情完全相同似乎很棘手,所以我认为最好的方法是使用rpy2在R中运行预处理并将结果拉入python：

import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv",stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus,c("apple",stemDocument)''')]

然后,您可以将其加载到scikit-learn中 – 您需要做的唯一事情是在CountVectorizer和DocumentTermMatrix之间匹配,删除长度小于3的条款：

from sklearn.feature_extraction.text import CountVectorizer
def mytokenizer(x):
    return [y for y in x.split() if len(y) > 2]

# Full document-term matrix
cv = CountVectorizer(tokenizer=mytokenizer)
X = cv.fit_transform(preproc)
X
# <1181x3289 sparse matrix of type '


让我们验证这与R匹配：

tweets = read.csv("tweets.csv",stemDocument)
dtm = DocumentTermMatrix(corpus)
dtm
# A document-term matrix (1181 documents,3289 terms)
# 
# Non-/sparse entries: 8980/3875329
# Sparsity           : 100%
# Maximal term length: 115 
# Weighting          : term frequency (tf)

sparse = removeSparseTerms(dtm,0.995)
sparse
# A document-term matrix (1181 documents,309 terms)
# 
# Non-/sparse entries: 4669/360260
# Sparsity           : 99%
# Maximal term length: 20 
# Weighting          : term frequency (tf)

如您所见,现在两种方法之间存储的元素和术语的数量完全匹配.


                        （编辑：李大同）
【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!