R：文本挖掘学习笔记1 - tm Package

发布时间：2020-12-14 02:50:55 所属栏目：大数据来源：网络整理

导读：1. 文件读取和查看 DirSource() Corpus() inspect() tm 提供五种资源读取的方式： getSources() [1] "DataframeSource" "DirSource"?????? "URISource"?????? "VectorSource"??? "XMLSource" ? ?? 仅就.txt文件而言，学习初期常用的是直接从文件夹中读取： ?

1. 文件读取和查看

DirSource()

Corpus()

inspect()

tm 提供五种资源读取的方式：

> getSources()
[1] "DataframeSource" "DirSource"?????? "URISource"?????? "VectorSource"??? "XMLSource" ???

仅就.txt文件而言，学习初期常用的是直接从文件夹中读取：? DirSource()

DirSource() 读取文件夹下所有文件的路径，然后用 Corpus() 读取所有文件路径和路径下的内容，并构造语料库。

Corpus() 的结果是建立一个类似于matrix的Corpus集合，一个文件名对应一个文档内容，可用下标对文件进行查看。

Corpus() 赋值给一个变量以后，比如赋值给“docs”，输入 docs 或者 docs[1] 这种subset模式无法直接查看文档内容，必须要用到 inspect() 函数进行文本查看。

-+- -+-

2. 语料库的预处理

removeNumbers()

removePunctuation()

removeWords()

stemDocument()

stripWhitespace()

tm_map()

content_transformer()

2.1 自带语料库处理函数

tm包自带了5种变形函数

> getTransformations()
[1] "removeNumbers"???? "removePunctuation" "removeWords"?????? "stemDocument"?????
[5] "stripWhitespace"??

removeNumbers	去除所有数字
removePuncuation	去除所有标点符号
removeWords	去除指定文字，文字需要自定义，也可以使用自带函数stopwords()
stemDocument	提取单词词干
stripWhitespace	去除多余空格

以上五种变形方式可以直接用 tm_map() 应用到语料库中去，如：

> docs <- tm_map(docs,stemDocument)

如果被调用的变形函数有参数，在被 tm_map() 调用时，参数紧跟在变形函数后面，如：

> docs <- tm_map(docs,removeWords,c("can","finally"))

2.2 非tm自带的变形函数

如果需要使用tm包以外的函数，或者自定义函数，这些函数需要用 content_transformer() “包装”起来，r然后被 tm_map() 调用。如：

> toSpace <- content_transformer(function(x,pattern) gsub(pattern," ",x))
> docs <- tm_map(docs,toSpace,"/")

或者是：

> docs <- tm_map(docs,content_transformer(tolower))

tm_map() 也可以实现多参数的变形：

> toStr <- content_transformer(function(x,from,to) gsub(from,to,toStr,"directory","folder")
> docs <- tm_map(docs,"topics","subject")

语料库的整理并不需要用到的以上所有变形，通常情况是几种处理方法的结合使用，或者自行建立一个词库，用于词汇的排除和筛选。

-+-

3. 词频矩阵

DocumentTermMatrix()

TermDocumentMatrix()

removeSparseTerms()

findFreqTerms()

findAssocs()

3.1 词频矩阵创建

将处理过后语料库内容以独立的单词和文件名为两个维度构建矩阵，可以以文件名为行名，也可以以单词为行名，矩阵中的数值对应的是每个文件下每个单词出现的词频。两种不同的行列命名方式，对应两种不同的矩阵创建方法：前者用 DocumentTermMatrix() ，后者用 TermDocumentMatrix() 。个人更喜欢使用第二种方法。

>?dtm <- DocumentTermMatrix(docs)

> tdm <- TermDocumentMatrix(docs)

3.2 词频矩阵探索

创建完矩阵后，可以先查看一下它的类型。

> class(tdm)
[1] "TermDocumentMatrix"??? "simple_triplet_matrix"

可以看到tdm并不是R语言里纯粹意义上的matrix，所以想查看tdm的内容的时候，还是需要使用 inspect() 函数：

> inspect(tdm[1:10,1:3])
<<TermDocumentMatrix (terms: 10,documents: 3)>>
Non-/sparse entries: 8/22
Sparsity?????????? : 73%
Maximal term length: 9
Weighting????????? : term frequency (tf)

?????????? Docs
Terms?????? a.txt b.txt c.txt
? aback ? ? ? ? 0 ? ? 0 ? ? 3
? abaht ? ? ? ? 0 ? ? 0 ? ? 0
? abaiss ? ? ? ?0 ? ? 0 ? ? 0
? abandon ? ? ?21 ? ? 0 ? ?11
? abarbarea ? ? 0 ? ? 0 ? ? 0
? abas ? ? ? ? ?2 ? ? 0 ? ? 1
? abash ? ? ? ? 3 ? ? 0 ? ? 4
? abat ? ? ? ? ?1 ? ? 0 ? ? 0
? abati ? ? ? ? 0 ? ? 0 ? ? 0
? abattoir ? ? ?0 ? ? 0 ? ? 0

如果不想使用函数，则必须将tdm转换成R最常见的矩阵：

> as.matrix(tdm)[1:10,1:3]

? ? ? ? ? ?Docs
Terms?????? a.txt b.txt c.txt
? aback ? ? ? ? 0 ? ? 0 ? ? 3
? abaht ? ? ? ? 0 ? ? 0 ? ? 0
? abaiss ? ? ? ?0 ? ? 0 ? ? 0
? abandon ? ? ?21 ? ? 0 ? ?11
? abarbarea ? ? 0 ? ? 0 ? ? 0
? abas ? ? ? ? ?2 ? ? 0 ? ? 1
? abash ? ? ? ? 3 ? ? 0 ? ? 4
? abat ? ? ? ? ?1 ? ? 0 ? ? 0
? abati ? ? ? ? 0 ? ? 0 ? ? 0
? abattoir ? ? ?0 ? ? 0 ? ? 0

同样的道理，如果要对tdm做运算处理，也需要将其变成常用矩阵类型。

如果想查看单词的出现次数，则可以计算每行词频的和，然后排序：

> freq <- rowSums(as.matrix(tdm))
> head(sort(freq,decreasing = T))
said? one? man will like look?
9231 8485 4565 4518 4392 4172?

排序说明said和one是使用最多的单词，这跟数据源是小说有关系。但是这两个词对我们来说其实用处不大，甚至前六个次，对我们来说用处都不大，需要进一步筛选，这是后话。

也可以看词频的分布：

> head(table(freq),10)
freq
??? 1???? 2???? 3???? 4???? 5???? 6???? 7???? 8???? 9??? 10?
12638? 3447? 1798? 1305?? 874?? 721?? 597?? 499?? 374?? 330?

这个结果说明只出现一次的词汇有12638个，出现两次的有3447个，以此类推。

这样的词汇对我们来说用处也不大，我们可以去除那些出现次数很少的单词。

> dim(tdm)
[1] 29821??? 10
> tdms <- removeSparseTerms(tdm,0.1)
> dim(tdms)
[1] 1309?? 10

删减后，词频矩阵中的单词量从29821个减少到了1309个。 removeSparseTerms() 中的第二个参数是区间为(0,1)的小数，小数越小，保留的单词量越少。

还可以通过词汇的相关度和词频对语料库进行挖掘：

> findFreqTerms(tdms,lowfreq = 3500)??
[1] "come"? "know"? "like"? "littl" "look"? "man"?? "now"?? "one"?? "said"? "say"?? "time"?
[12] "will"?
> findAssocs(tdms,"ill",corlimit = 0.9)
???????? ill
behav?? 0.92
sake??? 0.92
pretend 0.91

可以看到：在语料库中，有12个词至少出现3500次，跟ill相关度超过0.9的词汇有3个：behav,sake,pretend. behav是经过 stemDocument() 处理后保留的词干，应该与behave相关。

在Corpus里还有一种dictionary的用法：

> inspect(DocumentTermMatrix(docs,list(dictionary = c("behav","pretend"))))
<<DocumentTermMatrix (documents: 10,terms: 3)>>
Non-/sparse entries: 30/0
Sparsity?????????? : 0%
Maximal term length: 7
Weighting????????? : term frequency (tf)

? ? ? ? ?Terms
Docs ? ? ?behav ill pretend
? a.txt ? ? ?20? 84????? 21
? b.txt ? ? ?15? 60????? 13
? c.txt ? ? ?11? 61????? 22
? d.txt ? ? ?11? 58????? 18
? e.txt ? ? ? 1? 10?????? 2
? f.txt ? ? ? 3? 41?????? 4
? g.txt ? ? ?12? 56????? 12
? h.txt ? ? ? 3? 14?????? 2
? i.txt ? ? ? 3?? 2?????? 1
? j.txt ? ? ?14? 56????? 17

当然这个也可以用另外一种方式实现：

> dtms <- t(as.matrix(tdms))

> dtms[,which(colnames(dtms) %in% c("behav","pretend"))]

? ? ? ? ?Terms
Docs ? ? ?behav ill pretend
? a.txt ? ? ?20? 84????? 21
? b.txt ? ? ?15? 60????? 13
? c.txt ? ? ?11? 61????? 22
? d.txt ? ? ?11? 58????? 18
? e.txt ? ? ? 1? 10?????? 2
? f.txt ? ? ? 3? 41?????? 4
? g.txt ? ? ?12? 56????? 12
? h.txt ? ? ? 3? 14?????? 2
? i.txt ? ? ? 3?? 2?????? 1
? j.txt ? ? ?14? 56????? 17

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!