xml – 刮取分层数据

发布时间：2020-12-16 22:45:03 所属栏目：百科来源：网络整理

导读：我试图从 global Dept stores开始为大陆/国家划分部门商店列表.我正在运行以下代码以首先获得各大洲,因为我们可以看到XML层次结构是这样的,即每个大陆的国家都不是孩子该大陆的节点. url-"http://en.wikipedia.org/wiki/List_of_department_stores_by_countr

我试图从 global Dept stores开始为大陆/国家划分部门商店列表.我正在运行以下代码以首先获得各大洲,因为我们可以看到XML层次结构是这样的,即每个大陆的国家都不是孩子该大陆的节点.

> url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country"
> doc = htmlTreeParse(url,useInternalNodes = T)
> nodeNames = getNodeSet(doc,"//h2/span[@class='mw-headline']")
> # For Africa
> xmlChildren(nodeNames[[1]])
$a
<a href="/wiki/Africa" title="Africa">Africa</a> 

attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"        
> xmlSize(nodeNames[[1]])
[1] 1

我知道我可以在单独的getNodeSet命令中执行这些国家/地区,但我只是想确保我没有遗漏某些内容.是否有更智能的方法来获取每个大陆内的所有数据,然后在每个国家/地区内同时获取所有数据？

解决方法

uisng xpath,几个路径可以与|组合分隔器.所以我用它来获得同一个列表中的contries和商店.然后我得到第二个contries列表.我使用后一个列表来分割第一个

url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country"
library(XML)
xmltext <- htmlTreeParse(url,useInternalNodes = T)

## Here I use the combined xpath 
cont.shops <- xpathApply(xmltext,'//*[@id="mw-content-text"]/ul/li|
                                   //*[@id="mw-content-text"]/h3',xmlValue)
cont.shops<- do.call(rbind,cont.shops)                  ## from list to  vector


head(cont.shops)                  ## first element is country followed by shops
     [,1]                   
[1,] "[edit] ? Tunisia"     
[2,] "Magasin G????n????ral"
[3,] "Mercure Market"       
[4,] "Promogro"             
[5,] "Geant"                
[6,] "Carrefour"            
## I get all the contries in one list 
contries <- xpathApply(xmltext,'//*[@id="mw-content-text"]/h3',xmlValue)
contries <- do.call(rbind,contries)                     ## from list to  vector

    head(contries)
     [,] "[edit] ? Morocco"     
[3,] "[edit] ? Ghana"       
[4,] "[edit] ? Kenya"       
[5,] "[edit] ? Nigeria"     
[6,] "[edit] ? South Africa"

现在我做一些处理来分开使用国家的contasshops.

dd <- which(cont.shops %in% contries)                   ## get the index of contries
freq <- c(diff(dd),length(cont.shops)-tail(dd,1)+1)     ## use diff to get Frequencies
contries.f <- rep(contries,freq)                        ## create the factor splitter


ll <- split(cont.shops,contries.f)

我可以查看结果：

> ll[[contries[1]]]
[1] "[edit] ? Tunisia"      "Magasin G????n????ral" "Mercure Market"        "Promogro"              "Geant"                
[6] "Carrefour"             "Monoprix"             
> ll[[contries[2]]]
[1] "[edit] ? Morocco"                                                         
[2] "Alpha 55,one 6-story store in Casablanca"                                
[3] "Galeries Lafayette,to open in 2011[1] within Morocco Mall,in Casablanca"

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!