R文本挖掘 – 如何将R数据框列中的文本更改为具有双字母频率的多

发布时间：2020-12-14 03:46:43 所属栏目：大数据来源：网络整理

导读：除了问题 R Text mining – how to change texts in R data frame column into several columns with word frequencies?,我想知道如何设法用bigrams频率而不是单词频率制作列. 再次,非常感谢提前！这是示例数据框(感谢Tyler Rinker). person sex adult stat

除了问题 R Text mining – how to change texts in R data frame column into several columns with word frequencies?,我想知道如何设法用bigrams频率而不是单词频率制作列.
再次,非常感谢提前！

这是示例数据框(感谢Tyler Rinker).

person sex adult                                 state code
1         sam   m     0         Computer is fun. Not too fun.   K1
2        greg   m     0               No it's not,it's dumb.   K2
3     teacher   m     1                    What should we do?   K3
4         sam   m     0                  You liar,it stinks!   K4
5        greg   m     0               I am telling the truth!   K5
6       sally   f     0                How can we be certain?   K6
7        greg   m     0                      There is no way.   K7
8         sam   m     0                       I distrust you.   K8
9       sally   f     0           What are you talking about?   K9
10 researcher   f     1         Shall we move on?  Good then.  K10
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11

上面的数据集：

library(qdap); DATA

解决方法

qdap的开发版本(应该在接下来的几天内转到CRAN)会生成ngrams.现在你需要使用 dev version.在玩具数据集上这很快但是在更大的数据集上,例如qdap的mraja1数据集需要大约5分钟才能完成.你可以：

>更明智地选择双桅杆(即,不要全部使用它们,因为它会有一吨)
>等一下
> Run it in parallel
>找出另一种方法来做到这一点
>获得更快的计算机

这是获取qdap的dev版本并运行bigram搜索的代码：

library(devtools)
install_github("qdap","trinker")
library(qdap)

## this gets the bigrams
bigrams <- sapply(ngrams(DATA$state)[[c("all_n","n_2")]],paste,collapse=" ")

## This searches by grouping variable for bigram use
termco(DATA$state,DATA$person,bigrams)


## To get raw values
termco(DATA$state,bigrams)[["raw"]]

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!