R：将XML数据转换为数据帧

发布时间：2020-12-16 07:42:14 所属栏目：百科来源：网络整理

导读：对于作业作业,我正在尝试将一个 XML文件转换成R中的一个数据框架.我尝试了许多不同的东西,并且已经在互联网上搜索了想法,但是一直没有成功.这是我的代码到目前为止 library(XML)url - 'http://www.ggobi.org/book/data/olive.xml'doc - xmlParse(myUrl)root

对于作业作业,我正在尝试将一个 XML文件转换成R中的一个数据框架.我尝试了许多不同的东西,并且已经在互联网上搜索了想法,但是一直没有成功.这是我的代码到目前为止

library(XML)
url <- 'http://www.ggobi.org/book/data/olive.xml'
doc <- xmlParse(myUrl)
root <- xmlRoot(doc)

dataFrame <- xmlSApply(xmltop,function(x) xmlSApply(x,xmlValue))
data.frame(t(dataFrame),row.names=NULL)

我得到的输出就像一个巨大的数字向量.我正在尝试将数据组织到一个数据框架中,但是我不知道如何正确调整我的代码来获取数据.

它可能不像XML包一样冗长,但是xml2没有内存泄漏,并且是激光关注数据提取的.我使用trimws这是最近添加到R核心.

library(xml2)

pg <- read_xml("http://www.ggobi.org/book/data/olive.xml")

# get all the <record>s
recs <- xml_find_all(pg,"//record")

# extract and clean all the columns
vals <- trimws(xml_text(recs))

# extract and clean (if needed) the area names
labs <- trimws(xml_attr(recs,"label"))

# mine the column names from the two variable descriptions
# this XPath construct lets us grab either the <categ…> or <real…> tags
# and then grabs the 'name' attribute of them
cols <- xml_attr(xml_find_all(pg,"//data/variables/*[self::categoricalvariable or
                                                      self::realvariable]"),"name")

# this converts each set of <record> columns to a data frame
# after first converting each row to numeric and assigning
# names to each column (making it easier to do the matrix to data frame conv)
dat <- do.call(rbind,lapply(strsplit(vals," +"),function(x) {
                                   data.frame(rbind(setNames(as.numeric(x),cols)))
                                 }))

# then assign the area name column to the data frame
dat$area_name <- labs

head(dat)
##   region area palmitic palmitoleic stearic oleic linoleic linolenic
## 1      1    1     1075          75     226  7823      672        NA
## 2      1    1     1088          73     224  7709      781        31
## 3      1    1      911          54     246  8113      549        31
## 4      1    1      966          57     240  7952      619        50
## 5      1    1     1051          67     259  7771      672        50
## 6      1    1      911          49     268  7924      678        51
##   arachidic eicosenoic    area_name
## 1        60         29 North-Apulia
## 2        61         29 North-Apulia
## 3        63         29 North-Apulia
## 4        78         35 North-Apulia
## 5        80         46 North-Apulia
## 6        70         44 North-Apulia

UPDATE

我现在这样做最后一点：

library(tidyverse)

strsplit(vals,"[[:space:]]+") %>% 
  map_df(~as_data_frame(as.list(setNames(.,cols)))) %>% 
  mutate(area_name=labs)

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!