使用R接受cookie以下载PDF文件

发布时间：2020-12-14 16:34:15 所属栏目：资源来源：网络整理

导读：我在尝试下载PDF时遇到了问题. 例如,如果我在考古数据服务上有一个DOI的PDF文档,它将解析为this landing page 与embedded link in it to this pdf,但真正重定向到this其他链接. library(httr)将处理解析DOI,我们可以使用库(XML)从登陆页面中提取pdf URL但我

我在尝试下载PDF时遇到了问题.

例如,如果我在考古数据服务上有一个DOI的PDF文档,它将解析为this landing page
与embedded link in it to this pdf,但真正重定向到this其他链接.

library(httr)将处理解析DOI,我们可以使用库(XML)从登陆页面中提取pdf URL但我仍然坚持获取PDF本身.

如果我这样做：

download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf",destfile = "tmp.pdf")

然后我收到一个与http://archaeologydataservice.ac.uk/myads/相同的HTML文件

在How to use R to download a zipped file from a SSL page that requires cookies尝试答案让我想到这个：

library(httr)

terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes",t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")

# Accept the terms on the form,# generating the appropriate cookies

POST(terms,body = values)
GET(download,query = values)

# Actually download the file (this will take a while)

resp <- GET(download,query = values)

# write the content of the download to a binary file

writeBin(content(resp,"raw"),"c:/temp/thefile.zip")

但是在POST和GET函数之后,我只是获得了与download.file相同的cookie页面的HTML：

> GET(download,query = values)
Response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466]
  Date: 2016-01-06 00:35
  Status: 200
  Content-Type: text/html;charset=UTF-8
  Size: 21 kB
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "h...
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
        <head>
            <meta http-equiv="Content-Type" content="text/html; c...


            <title>Archaeology Data Service:  myADS</title>

            <link href="http://archaeologydataservice.ac.uk/css/u...
...

看看http://archaeologydataservice.ac.uk/about/Cookies,这个网站的cookie情况似乎很复杂.似乎这种cookie复杂性对于英国数据提供商来说并不罕见：automating the login to the uk data service website in R with RCurl or httr

如何使用R来浏览本网站上的cookie？

解决方法

你听到了 rOpenSci的请求！

这些页面之间有很多javascript,这使得尝试通过httr rvest解密有些烦人.试试RSelenium.这适用于OS X 10.11.2,R 3.2.3& Firefox已加载.

library(RSelenium)

# check if a sever is present,if not,get a server
checkForServer()

# get the server going
startServer()

dir.create("~/justcreateddir")
setwd("~/justcreateddir")

# we need PDFs to download instead of display in-browser
prefs <- makeFirefoxProfile(list(
  `browser.download.folderList` = as.integer(2),`browser.download.dir` = getwd(),`pdfjs.disabled` = TRUE,`plugin.scan.plid.all` = FALSE,`plugin.scan.Acrobat` = "99.0",`browser.helperApps.neverAsk.saveToDisk` = 'application/pdf'
))
# get a browser going
dr <- remoteDriver$new(extraCapabilities=prefs)
dr$open()

# go to the page with the PDF
dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")

# find the PDF link and "hit ENTER"
pdf_elem <- dr$findElement(using="css selector","a.dlb3")
pdf_elem$sendKeysToElement(list("uE007"))

# find the ACCEPT button and "hit ENTER"
# that will save the PDF to the default downloads directory
accept_elem <- dr$findElement(using="css selector","a[id$='agreeButton']")
accept_elem$sendKeysToElement(list("uE007"))

现在等待下载完成. R控制台在下载时不会很忙,因此在下载完成之前很容易意外关闭会话.

# close the session
dr$close()

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!