Big Text Corpus打破了tm_map

发布时间：2020-12-14 18:48:34 所属栏目：资源来源：网络整理

导读：在过去的几天里,我一直在打破这个.我搜索了所有的SO档案,并尝试了建议的解决方案,但似乎无法让这个工作.我在诸如2000 06,1995 -99等文件夹中有一组txt文档,并且想要运行一些基本的文本挖掘操作,例如创建文档术语矩阵和术语文档矩阵以及基于单词的共同位置进

在过去的几天里,我一直在打破这个.我搜索了所有的SO档案,并尝试了建议的解决方案,但似乎无法让这个工作.我在诸如2000 06,1995 -99等文件夹中有一组txt文档,并且想要运行一些基本的文本挖掘操作,例如创建文档术语矩阵和术语文档矩阵以及基于单词的共同位置进行一些操作.我的脚本适用于较小的语料库,但是,当我使用更大的语料库进行尝试时,它会让我失望.我已经粘贴了一个这样的文件夹操作的代码.

library(tm) # Framework for text mining.
library(SnowballC) # Provides wordStem() for stemming.
library(RColorBrewer) # Generate palette of colours for plots.
library(ggplot2) # Plot word frequencies.
library(magrittr)
library(Rgraphviz)
library(directlabels)

setwd("/ConvertedText")
txt <- file.path("2000 -06")

docs<-VCorpus(DirSource(txt,encoding = "UTF-8"),readerControl = list(language = "UTF-8"))
docs <- tm_map(docs,content_transformer(tolower),mc.cores=1)
docs <- tm_map(docs,removeNumbers,removePunctuation,stripWhitespace,removeWords,stopwords("SMART"),stopwords("en"),mc.cores=1)
#corpus creation complete

setwd("/ConvertedText/output")
dtm<-DocumentTermMatrix(docs)
tdm<-TermDocumentMatrix(docs)
m<-as.matrix(dtm)
write.csv(m,file="dtm.csv")
dtms<-removeSparseTerms(dtm,0.2)
m1<-as.matrix(dtms)
write.csv(m1,file="dtms.csv")
# matrix creation/storage complete

freq <- sort(colSums(as.matrix(dtm)),decreasing=TRUE)
wf <- data.frame(word=names(freq),freq=freq)
freq[1:50]
#adjust freq score in next line
p <- ggplot(subset(wf,freq>100),aes(word,freq))+ geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45,hjust=1))
ggsave("frequency2000-06.png",height=12,width=17,dpi=72)
# frequency graph generated


x<-as.matrix(findFreqTerms(dtm,lowfreq=1000))
write.csv(x,file="freqterms00-06.csv")
png("correlation2000-06.png",width=12,units="in",res=900)
graph.par(list(edges=list(col="lightblue",lty="solid",lwd=0.3)))
graph.par(list(nodes=list(col="darkgreen",lty="dotted",lwd=2,fontsize=50)))
plot(dtm,terms=findFreqTerms(dtm,lowfreq=1000)[1:50],corThreshold=0.7)
dev.off()

当我在tm_map中使用mc.cores = 1参数时,操作将无限期地继续.但是,如果我在tm_map中使用lazy = TRUE参数,它看似顺利,但后续操作会出现此错误.

Error in UseMethod("meta",x) : 
  no applicable method for 'meta' applied to an object of class "try-error"
In addition: Warning messages:
1: In mclapply(x$content[i],function(d) tm_reduce(d,x$lazy$maps)) :
  all scheduled cores encountered errors in user code
2: In mclapply(unname(content(x)),termFreq,control) :
  all scheduled cores encountered errors in user code

我一直在寻找一个解决方案,但一直都失败了.任何帮助将不胜感激！

最好！
?

解决方法

我找到了一个有效的解决方案.

背景/调试步骤

我尝试了几件不起作用的东西：

>将“content_transformer”添加到某个tm_map,全部添加到一个(totower)
>将“lazy = T”添加到tm_map
>尝试了一些并行计算包

虽然它不能用于我的两个脚本,但它每次都适用于第三个脚本.但是所有三个脚本的代码都是相同的,只是我加载的.rda文件的大小不同.这三种数据结构也相同.

>数据集1：大小 – 493.3KB =错误
>数据集2：大小 – 630.6KB =错误
>数据集3：大小 – 300.2KB =有效！

太奇怪了.

我的sessionInfo()输出：

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] snowfall_1.84-6    snow_0.3-13        Snowball_0.0-11    RWekajars_3.7.11-1 rJava_0.9-6              RWeka_0.4-23      
[7] slam_0.1-32        SnowballC_0.5.1    tm_0.6             NLP_0.1-5          twitteR_1.1.8      devtools_1.6      

loaded via a namespace (and not attached):
[1] bit_1.1-12     bit64_0.9-4    grid_3.1.2     httr_0.5       parallel_3.1.2 RCurl_1.95-4.3    rjson_0.2.14   stringr_0.6.2 
[9] tools_3.1.2

解

我刚加载数据后添加了这一行,现在一切正常：

MyCorpus <- tm_map(MyCorpus,content_transformer(function(x) iconv(x,to='UTF-8-MAC',sub='byte')),mc.cores=1)

在这里找到提示：http://davetang.org/muse/2013/04/06/using-the-r_twitter-package/(由于2014年11月26日的错误,作者更新了他的代码.)

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!