xml – 刮取分层数据
发布时间:2020-12-16 22:45:03 所属栏目:百科 来源:网络整理
导读:我试图从 global Dept stores开始为大陆/国家划分部门商店列表.我正在运行以下代码以首先获得各大洲,因为我们可以看到XML层次结构是这样的,即每个大陆的国家都不是孩子该大陆的节点. url-"http://en.wikipedia.org/wiki/List_of_department_stores_by_countr
我试图从
global Dept stores开始为大陆/国家划分部门商店列表.我正在运行以下代码以首先获得各大洲,因为我们可以看到XML层次结构是这样的,即每个大陆的国家都不是孩子该大陆的节点.
> url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country" > doc = htmlTreeParse(url,useInternalNodes = T) > nodeNames = getNodeSet(doc,"//h2/span[@class='mw-headline']") > # For Africa > xmlChildren(nodeNames[[1]]) $a <a href="/wiki/Africa" title="Africa">Africa</a> attr(,"class") [1] "XMLInternalNodeList" "XMLNodeList" > xmlSize(nodeNames[[1]]) [1] 1 我知道我可以在单独的getNodeSet命令中执行这些国家/地区,但我只是想确保我没有遗漏某些内容.是否有更智能的方法来获取每个大陆内的所有数据,然后在每个国家/地区内同时获取所有数据? 解决方法
uisng xpath,几个路径可以与|组合分隔器.所以我用它来获得同一个列表中的contries和商店.然后我得到第二个contries列表.我使用后一个列表来分割第一个
url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country" library(XML) xmltext <- htmlTreeParse(url,useInternalNodes = T) ## Here I use the combined xpath cont.shops <- xpathApply(xmltext,'//*[@id="mw-content-text"]/ul/li| //*[@id="mw-content-text"]/h3',xmlValue) cont.shops<- do.call(rbind,cont.shops) ## from list to vector head(cont.shops) ## first element is country followed by shops [,1] [1,] "[edit] ? Tunisia" [2,] "Magasin G????n????ral" [3,] "Mercure Market" [4,] "Promogro" [5,] "Geant" [6,] "Carrefour" ## I get all the contries in one list contries <- xpathApply(xmltext,'//*[@id="mw-content-text"]/h3',xmlValue) contries <- do.call(rbind,contries) ## from list to vector head(contries) [,] "[edit] ? Morocco" [3,] "[edit] ? Ghana" [4,] "[edit] ? Kenya" [5,] "[edit] ? Nigeria" [6,] "[edit] ? South Africa" 现在我做一些处理来分开使用国家的contasshops. dd <- which(cont.shops %in% contries) ## get the index of contries freq <- c(diff(dd),length(cont.shops)-tail(dd,1)+1) ## use diff to get Frequencies contries.f <- rep(contries,freq) ## create the factor splitter ll <- split(cont.shops,contries.f) 我可以查看结果: > ll[[contries[1]]] [1] "[edit] ? Tunisia" "Magasin G????n????ral" "Mercure Market" "Promogro" "Geant" [6] "Carrefour" "Monoprix" > ll[[contries[2]]] [1] "[edit] ? Morocco" [2] "Alpha 55,one 6-story store in Casablanca" [3] "Galeries Lafayette,to open in 2011[1] within Morocco Mall,in Casablanca" (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |