使用R将PDF文件转换为文本文本文本挖掘
发布时间:2020-12-14 04:15:25 所属栏目:大数据 来源:网络整理
导读:我在文件夹中有近一千个pdf日记文章.我需要从整个文件夹的所有文章的摘要文本我.现在我正在做以下事情: dest - "~/A1.pdf"# set path to pdftotxt.exe and convert pdf to textexe - "C:/Program Files (x86)/xpdfbin-win-3.03/bin32/pdftotext.exe"system(
我在文件夹中有近一千个pdf日记文章.我需要从整个文件夹的所有文章的摘要文本我.现在我正在做以下事情:
dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <- "C:/Program Files (x86)/xpdfbin-win-3.03/bin32/pdftotext.exe" system(paste(""",exe,"" "",dest,""",sep = ""),wait = F) # get txt-file name and open it filetxt <- sub(".pdf",".txt",dest) shell.exec(filetxt) 这样,我将一个pdf文件转换为一个.txt文件,然后将另一个.txt文件中的摘要复制并手动编译.这个工作很麻烦 我需要一个代码,可以从文件夹中读取所有单独的文章,并将它们转换成仅包含每篇文章摘要的.txt文件.可以通过限制每篇文章中的摘要和引言之间的内容来完成;但我不能这样做.任何帮助是赞赏. 解决方法
是的,不是真正的R问题,因为IShouldBuyABoat笔记,但R可以做的只有微小的扭曲…
使用R将PDF文件转换为txt文件… # folder with 1000s of PDFs dest <- "C:UsersDesktop" # make a vector of PDF file names myfiles <- list.files(path = dest,pattern = "pdf",full.names = TRUE) # convert each PDF file that is named in the vector into a text file # text file is created in the same directory as the PDFs # note that my pdftotext.exe is in a different location to yours lapply(myfiles,function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"',paste0('"',i,'"')),wait = FALSE) ) 仅提取txt文件中的摘要… # if you just want the abstracts,we can use regex to extract that part of # each txt file,Assumes that the abstract is always between the words 'Abstract' # and 'Introduction' mytxtfiles <- list.files(path = dest,pattern = "txt",full.names = TRUE) abstracts <- lapply(mytxtfiles,function(i) { j <- paste0(scan(i,what = character()),collapse = " ") regmatches(j,gregexpr("(?<=Abstract).*?(?=Introduction)",j,perl=TRUE)) }) 将摘要写入单独的txt文件… # write abstracts as txt files # (or use them in the list for whatever you want to do next) lapply(1:length(abstracts),function(i) write.table(abstracts[i],file=paste(mytxtfiles[i],"abstract","txt",sep="."),quote = FALSE,row.names = FALSE,col.names = FALSE,eol = " " )) 现在你已经准备好在抽象上做一些文本挖掘了. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |