R文本挖掘:计算特定单词在语料库中出现的次数?
发布时间:2020-12-14 03:47:31 所属栏目:大数据 来源:网络整理
导读:我已经看到这个问题用其他语言回答但在R. [特别是对于R文本挖掘]我有一组从语料库中获得的常用短语.现在我想搜索这些短语出现在另一个语料库中的次数. 有没有办法在TM包中做到这一点? (或另一个相关的包) 例如,假设我有一系列短语,“标签”从CorpusA获得.另
我已经看到这个问题用其他语言回答但在R.
[特别是对于R文本挖掘]我有一组从语料库中获得的常用短语.现在我想搜索这些短语出现在另一个语料库中的次数. 有没有办法在TM包中做到这一点? (或另一个相关的包) 例如,假设我有一系列短语,“标签”从CorpusA获得.另一个语料库,CorpusB,有几千个子文本.我想知道在CorpusB中标签中的每个短语出现了多少次. 一如既往,感谢您的帮助! 解决方法
不完美,但这应该让你开始.
#User Defined Function strip <- function(x,digit.remove = TRUE,apostrophe.remove = FALSE){ strp <- function(x,digit.remove,apostrophe.remove){ x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?","1",as.character(x)))) x2 <- if(apostrophe.remove) gsub("'","",x2) else x2 ifelse(digit.remove==TRUE,gsub("[[:digit:]]",x2),x2) } unlist(lapply(x,function(x) Trim(strp(x =x,digit.remove = digit.remove,apostrophe.remove = apostrophe.remove)) )) } #================================================================== #Create 2 'corpus' documents (you'd have to actually do all this in tm corpus1 <- 'I have seen this question answered in other languages but not in R. [Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. Now I would like to search for the number of times these phrases have appeared in another corpus. Is there a way to do this in TM package? (Or another related package) For example,say I have an array of phrases,"tags" obtained from CorpusA. And another Corpus,of couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB. As always,I appreciate all your help!' corpus2 <- "What have you tried? If you have seen it answered in another language,why don't you try translating that language into R? – Eric Strom 2 hours ago I am not a coder,otherwise would do. I just do not know a way to do this. – appletree 1 hour ago Could you provide some example? or show what you have in mind for input and output? or a pseudo code? As it is I find the question a bit too general. As it sounds I think you could use regular expressions with grep to find your 'tags'. – AndresT 15 mins ago" #======================================================= #Clean up the text corpus1 <- gsub("s+"," ",gsub("n|t",corpus1)) corpus2 <- gsub("s+",corpus2)) corpus1.wrds <- as.vector(unlist(strsplit(strip(corpus1)," "))) corpus2.wrds <- as.vector(unlist(strsplit(strip(corpus2)," "))) #create frequency tables for each corpus corpus1.Freq <- data.frame(table(corpus1.wrds)) corpus1.Freq$corpus1.wrds <- as.character(corpus1.Freq$corpus1.wrds) corpus1.Freq <- corpus1.Freq[order(-corpus1.Freq$Freq),] rownames(corpus1.Freq) <- 1:nrow(corpus1.Freq) key.terms <- corpus1.Freq[corpus1.Freq$Freq>2,'corpus1.wrds'] #key words to match on corpus 2 corpus2.Freq <- data.frame(table(corpus2.wrds)) corpus2.Freq$corpus2.wrds <- as.character(corpus2.Freq$corpus2.wrds) corpus2.Freq <- corpus2.Freq[order(-corpus2.Freq$Freq),] rownames(corpus2.Freq) <- 1:nrow(corpus2.Freq) #Match key words to the words in corpus 2 corpus2.Freq[corpus2.Freq$corpus2.wrds %in%key.terms,] (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |