使用XML packagin R解析RSS提要

发布时间：2020-12-15 23:56:13 所属栏目：百科来源：网络整理

导读：我正在尝试抓取并解析以下RSS提要 http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml我已经查看了有关R和XML的其他查询,并且无法在我的问题上取得任何进展.每个条目的xml代码 item title![CDATA[Five Rockets Intercepted By Iron Drone System

我正在尝试抓取并解析以下RSS提要 http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml我已经查看了有关R和XML的其他查询,并且无法在我的问题上取得任何进展.每个条目的xml代码

<item>
     <title><![CDATA[Five Rockets Intercepted By Iron Drone Systems Over Be'er Sheva]]></title>
     <link>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</link>
     <description><![CDATA[<a href="http://www.haaretz.com/news/diplomacy-defense/live-blog-rockets-strike-tel-aviv-area-three-israelis-killed-in-attack-on-south-1.477960" target="_hplink">Haaretz reports</a> that five more rockets intercepted by Iron Dome systems over Be'er Sheva. In total,there have been 274 rockets fired and 105 intercepted. The IDF has attacked 250 targets in Gaza.]]></description>
     <guid>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</guid>
     <pubDate>2012-11-15T12:56:09-05:00</pubDate>
     <source url="http://huffingtonpost.com/rss/liveblog/liveblog-1213.xml">Huffingtonpost.com</source>
  </item>

对于每个条目/帖子,我想记录“日期”(pubDate),“标题”(标题),“描述”(全文清除).我曾尝试在R中使用xml包,但承认我是一个新手(很少有没有使用XML的经验,但有些R经验).我正在处理的代码,无处可去的是：

library(XML)

 xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"

 # Use the xmlTreePares-function to parse xml file directly from the web

 xmlfile <- xmlTreeParse(xml.url)

# Use the xmlRoot-function to access the top node

xmltop = xmlRoot(xmlfile)

xmlName(xmltop)

names( xmltop[[ 1 ]] )

  title          link   description      language     copyright 
  "title"        "link" "description"    "language"   "copyright" 
 category     generator          docs          item          item 
  "category"   "generator"        "docs"        "item"        "item"

但是,每当我尝试操纵并试图操纵“标题”或“描述”信息时,我都会不断地收到错误.任何帮助解决此代码的帮助,将是非常感谢.

谢谢,
托马斯

我正在使用优秀的Rcurl库和xpathSApply

这是脚本为您提供3个列表(标题,pubdates和描述)

library(RCurl)
library(XML)
xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"
script  <- getURL(xml.url)
doc     <- xmlParse(script)
titles    <- xpathSApply(doc,'//item/title',xmlValue)
descriptions    <- xpathSApply(doc,'//item/description',xmlValue)
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue)

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!