LearningR-数据处理
1. R自带函数1.1 转置使用函数t()可对一个矩阵或数据框进行转置,对于数据框,行名将变成变量(列)名。 cars <- mtcars(1:5,1:4) cars t(cars) 数列array进行维度转换 aperm x <- array(1:24,2:4) xt <- aperm(x,c(2,1,3)) dim(x) dim(xt) 1.2 整合数据aggregate在R中使用一个或多个by变量和一个预先定义好的函数来折叠(collapse)数据。调用格式为: aggregate(x,by,FUN) 其中x是待折叠的数据对象,by饰一个变量名组成的列表,这些变量将被去掉以新的观测,而FUN则是用来计算表述性统计量的标量函数,它将被用来计算新观测中的值。 options(digits=2) attach(mtcars) mydata <- aggregate(mtcars,by=list(cyl,gear),FUN=mean,na.rm=TRUE) mydata by中的变量必须在一个列表中(即使只有一个变量)。也可以在列表中为各组声明自定义的名称,例如by=list(Group.cyl=cyl,Group.gears=gear)。 ## example with character variables and NAs testDF <- data.frame(v1 = c(1,3,5,7,8,NA,4,9),v2 = c(11,33,55,77,88,44,99) ) by1 <- c("red","blue",2,"big","red",12) by2 <- c("wet","dry",99,95,"damp",NA) aggregate(x = testDF,by = list(by1,by2),FUN = "mean") # and if you want to treat NAs as a group fby1 <- factor(by1,exclude = "") fby2 <- factor(by2,exclude = "") aggregate(x = testDF,by = list(fby1,fby2),FUN = "mean") ## Formulas,one ~ one,one ~ many,many ~ one,and many ~ many: aggregate(weight ~ feed,data = chickwts,mean) aggregate(breaks ~ wool + tension,data = warpbreaks,mean) aggregate(cbind(Ozone,Temp) ~ Month,data = airquality,mean) aggregate(cbind(ncases,ncontrols) ~ alcgp + tobgp,data = esoph,sum) ## Dot notation: aggregate(. ~ Species,data = iris,mean) aggregate(len ~ .,data = ToothGrowth,mean) ## Often followed by xtabs(): ag <- aggregate(len ~ .,mean) xtabs(len ~ .,data = ag) ## Compute the average annual approval ratings for American presidents. aggregate(presidents,nfrequency = 1,FUN = mean) ## Give the summer less weight. aggregate(presidents,FUN = weighted.mean,w = c(1,0.5,1)) 1.3 apply待整理 1.4 union和intersectx <- c(sort(sample(1:20,9)),NA) y <- c(sort(sample(3:23,7)),NA) union(x,y) intersect(x,y) setdiff(x,y) setdiff(y,x) setequal(x,y) #%in% (1:10) %in% c(3,12) "%w/o%" <- function(x,y) x[!x %in% y] (1:10) %w/o% c(3,12) sstr <- c("c","ab","B","bba","c","@","bla","a","Ba","%") sstr %in% c(letters,LETTERS) 1.5 合并 cbind和rbind纵向合并数据通常用于向数据框中添加观测。
注:两个数据框行(列)数必须相同。如果x中拥有y中没有的变量,在合并它们之前需做以下处理: (1)删除dataframeA中的多余变量; (2)在dataframeB中创建追加的变量并将其值设为NA(缺失)。 x1 <- c(1:5) x2 <- c(21:25) x3 <- c(31:35) r1 <- cbind(x1,x2) r2 <- rbind(x1,x2) r31 <- cbind(r1,x3) r32 <- rbind(r2,x3) 1.6 匹配合并 mergemerge效果同dplyr的join,join的效力更高。
#authors和books authors <- data.frame( surname = I(c("Tukey","Venables","Tierney","Ripley","McNeil")),nationality = c("US","Australia","US","UK","Australia"),deceased = c("yes",rep("no",4))) books <- data.frame( name = I(c("Tukey","McNeil","R Core")),title = c("Exploratory Data Analysis","Modern Applied Statistics ...","LISP-STAT","Spatial Statistics","Stochastic Simulation","Interactive Data Analysis","An Introduction to R"),other.author = c(NA,"Venables & Smith")) m1 <- merge(authors,books,by.x = "surname",by.y = "name") m2 <- merge(books,authors,by.x = "name",by.y = "surname") #m1和m2结果相同,只是结果的列名不同。 #left_join m3 <- merge(authors,by.y = "name",all.x = T,all.y = F) #right_join m4 <- merge(authors,all.x = F,all.y = T) #full_join m5 <- merge(authors,all = TRUE) m11 <- inner_join(authors,by=c("surname"="name")) m22 <- inner_join(books,by=c("name"="surname")) m33 <- left_join(authors,by=c("surname"="name")) m44 <- right_join(authors,by=c("surname"="name")) m55 <- full_join(authors,by=c("surname"="name")) 1.7 排除重复数据 uniqueunique 函数可以去掉向量、数据框或类似数列的数据中重复的元素。 x <- c(9:20,1:5,3:7,0:8) y <- unique(x) #下列方式业可以,但unique方式效率更高. #duplicated 函数返回了元素是否重复的逻辑值. y1 <- x[!duplicated(x)] 2. reshape2包首先将数据“融合”(melt),以使每一行都是一个唯一的标识符-变量组合。 注:reshape包的重铸函数为cast(),reshape2包的重铸函数为dcast()和acast() #数据集mydata ID <- c(1,2) Time <- c(1,2) X1 <- c(5,6,2) X2 <- c(6,4) mydata <- data.frame(ID,Time,X1,X2) 2.1融合-melt数据集的融合是将它重构为这样一种格式:每个测量变量独占一行,行中带有要唯一确定这个测量所需的标识符变量。 library(reshape2) md <- melt(mydata,id=c("ID","Time")) md <- melt(mydata,id=1:2) 2.2重铸-dcast和acastUse acast or dcast depending on whether you want vector/matrix/array output or data frame output. Data frames can have at most two dimensions.
调用格式为: newdata <- dcast(data,formula,fun.aggregate = NULL,...,margins = NULL,subset = NULL,fill = NULL,drop = TRUE,value.var = guess_value(data)) newdata <- acast(data,value.var = guess_value(data)) 其中md为已融合的数据,formula描述想要的结果,FUN是(可选的)数据整合函数。 rowvar1 + rowvar2 + ... ~ colvar1 + colvar2 + ... 在这个公式中,rowvar1 + rowvar2 + ... 定义了要划掉的变量集合,以确定各行的内容,而colvar1 + colvar2 + ... 则定义了要划掉的、确定各列内容的变量集合。 #执行整合 acast(md,ID~variable,mean) dcast(md,tTime~variable,ID~Time,mean) #不执行整合 dcast(md,ID+Time~variable) dcast(md,ID+variable~Time) dcast(md,ID~variable+Time) 2.3 练习library(reshape2) head(airquality) mydata <- airquality mydata1 <- melt(mydata,id=c("Month","Day"),variable.name = "type",value.name = "val") #选定测量变量为Ozone、Wind mydata2 <- melt(mydata,measure = c("Ozone","Wind"),value.name = "val") str(mydata1) str(mydata2) #大写转换为小写 names(mydata) <- tolower(names(mydata)) a <- melt(mydata,id=c("month","day"),na.rm=TRUE) #数据b和原始数据airquality一样,数据复原了。 b <- dcast(a,month + day ~variable) result1 <- dcast(a,month ~variable,mean) #查看缺失值数量的函数 myfun <- function(x){return(sum(is.na(x)))} result2 <- dcast(a,myfun) result3 <- melt(mydata,"day")) result4 <- dcast(result3,myfun) result5 <- recast(mydata,month ~ variable,id.var = c('month','day'),fun = myfun) 3. dplyr3.1 基本操作3.1.1 数据类型将过长过大的数据集转换为显示更友好的 tbl_df 类型 library(dplyr) iris_df <- tbl_df(iris) 3.1.2 筛选filter按给定的逻辑判断筛选出符合要求的子数据集,类似于 base::subset() 函数 filter(iris_df,Species == 'setosa',Sepal.Length >=5) filter(iris_df,Species == 'setosa' & Sepal.Length >=5) 用R自带函数实现: iris_df[iris_df$Species == 'setosa' & iris_df$Sepal.Length >=5,] 除了代码简洁外,还支持对同一对象的任意个条件组合,如: filter(iris_df,Species == 'setosa' | Sepal.Length >=5) 注意: 表示 AND 时要使用 & 而避免 && 3.1.3 排列 arrangearrange(iris_df,Sepal.Length,Sepal.Width) arrange(iris_df,desc(Sepal.Length)) #这个函数和 plyr::arrange() 是一样的,类似于 order() 用R自带函数实现: iris_df[order(iris_df$Sepal.Length,iris_df$Sepal.Width),] iris_df[order(desc(iris_df$Sepal.Length)),] 3.1.4 选择select用列名作参数来选择子数据集: select(iris_df,1:2) select(iris_df,Species,Sepal.Width) select(iris,everything()) #重命名列名 select(iris_df,Length=Sepal.Length,Width=Sepal.Width) select(iris_df,petal = starts_with("Petal")) 排除列名: select(iris_df,-Petal.Length,-Petal.Width) select的特殊函数
select(iris_df,everything()) select(iris_df,starts_with("Petal")) select(iris_df,ends_with("Width")) select(iris_df,contains("etal")) select(iris_df,matches(".t.")) #选取名称符合指定表达式规则的列 select(iris_df,Sepal.Length:Petal.Width) select(iris_df,Petal.Length,Petal.Width) vars <- c("Petal.Length","Petal.Width") select(iris_df,one_of(vars)) df <- as.data.frame(matrix(runif(100),nrow = 10)) df <- tbl_df(df) select(df,V4:V6) select(df,num_range("V",4:6)) ":" 选择连续列,contains来匹配列名 同样类似于R自带的subset() 函数. subset(iris,select=c(1,2)) subset(iris,select=c(3,4)) subset(iris,select=c(Petal.Length,Petal.Width)) Programming with select 存疑?? select_(iris_df,~Petal.Length) select_(iris_df,"Petal.Length") select_(iris_df,lazyeval::interp(~matches(x),x = ".t.")) select_(iris_df,quote(-Petal.Length),quote(-Petal.Width)) select_(iris_df,.dots = list(quote(-Petal.Length),quote(-Petal.Width))) 3.1.5 添加新变量mutate对已有列进行数据运算并添加为新列: mtcars_df <- tbl_df(mtcars) mutate(mtcars_df,displ_l = disp / 61.0237) #transmute结果只有计算的字段 transmute(mtcars_df,displ_l = disp / 61.0237) mutate_each() 对每一列运行窗体函数。 mutate_each(iris,funs(min_rank)) plyr::mutate() 与 base::transform() 相似,优势在于可以在同一语句中对刚增加的列进行操作。 mutate(hflights_df,gain = ArrDelay - DepDelay,gain_per_hour = gain / (AirTime / 60) ) #而同样操作用R自带函数 transform() 的话就会报错: transform(hflights,gain_per_hour = gain / (AirTime / 60) ) 通过data.frame有可以实现 mtcars_df <- data.frame(mtcars_df,displ_l = mtcars_df$disp / 61.0237) 3.1.6 汇总summarisesummarise(mtcars_df,mean(disp,na.rm = TRUE),n()) summarise(group_by(mtcars_df,cyl),mean(disp),m = mean(disp),sd = sd(disp)) #对每?一列运?行概述函数。 summarise_each(iris,funs(mean)) by_species <- iris %>% group_by(Species) by_species %>% summarise_each(funs(length)) by_species %>% summarise_each(funs(mean)) by_species %>% summarise_each(funs(mean),Petal.Width) by_species %>% summarise_each(funs(mean),matches("Width")) count() #计算各变量中每?一个特定值的?行数(带权重或不带权重)。 count(iris,wt = Sepal.Length) count(iris,mycount = n()) 3.1.7 tallymtcars %>% group_by(cyl,vs) %>% tally(sort = TRUE) #与下列方式相同 mtcars %>% group_by(cyl,vs) %>% summarise(n = n()) %>% arrange(cyl,vs,n) 3.2 分组group_by当对数据集通过 group_by() 添加了分组信息后,mutate(),arrange() 和 summarise() 函数会自动对这些 tbl 类数据执行分组操作 (R语言泛型函数的优势). summarise(mtcars_df,n(),n_distinct(gear)) summarise(group_by(mtcars_df,sd = sd(disp)) #a mutate/rename followed by a simple group_by group_by(mtcars_df,vsam = vs + am) group_by(mtcars_df,vs2 = vs) summarise(group_by(mtcars_df,cyl2=cyl),sd = sd(disp)) 另: 一些汇总时的小函数 n(): 计算个数 3.3 链式操作(管道) %>% 或 %.%dplyr包还新引进了一个操作符,读成then,使用时把数据名作为开头,然后依次对此数据进行多步操作。比如: mtcars %>% group_by(cyl) %>% summarise(total = sum(disp)) %>% arrange(desc(total)) %>% head(5) (x1-x2)^2%>%sum()%>%sqrt() 按数据处理的思路写代码,一步步深入,既易写又易读,接近于从左到右的自然语言顺序, 对比一下用R自带函数实现的. head(arrange(summarise(group_by(mtcars,total = sum(disp)),desc(total)),5) x1 <- 1:5 x2 <- 2:6 sqrt(sum((x1-x2)^2)) 或者像这篇文章所用的方法: totals <- aggregate(. ~ cyl,data=mtcars[,c("cyl","disp")],sum) ranks <- sort.list(-totals$disp) #ranks <- order(-totals$disp) totals[ranks[1:5],] 文章里还表示: 通过 %>% 那段代码比跑上面这段代码,运算速度提升很多倍. 至于这个新鲜的概念会不会和 ggplot2 里的 + 连接号一样,发挥出种种奇妙的功能呢? 还是在实际使用中多体验感受吧. 3.5 数据匹配合并join
x <- data.frame(name = c("John","Paul","George","Ringo","Stuart","Pete"),instrument = c("guitar","bass","guitar","drums","drums")) y <- data.frame(name = c("John","Brian"),band = c("TRUE","TRUE","FALSE")) inner_join(x,y) left_join(x,y) semi_join(x,y) anti_join(x,y) full_join(x,y) right_join(x,y) 3.6 连接数据库
3.7 利用窗体函数变换数据
4. tidyrtidyr包的作者也是Hadley Wickham,与dplyr包结合使用,是reshape2包的替代。 5. 字符串处理5.1 字符个数 ncharnchar()能够获取字符串的长度,它和length()的结果是有区别的。 nchar(c("abc","abcd"))??? #求字符串中的字符个数,返回向量c(3,4) length(c("abc","abcd"))? #返回2,向量中元素的个数 5.2 连接字符 pastepaste()不仅可以连接多个字符串,还可以将对象自动转换为字符串再相连,另外它还能处理向量,所以功能更强大。 paste("fitbit",month,".jpg",sep="") paste("fitbit",1:12,sep = "") paste默认的分隔符是空格,必须指定sep=""。还有一个collapse参数,可以把这些字符串拼成一个长字符串,而不是放在一个向量中。 paste("fitbit",1:3,sep = "",collapse = "; ") 另外还有一个paste0函数,默认就是sep="" 5.3 分割字符 strsplitstrsplit(x,split,fixed = FALSE,perl = FALSE,useBytes = FALSE) x <- c(as = "asfef",qu = "qwerty","yuiop[","b","stuff.blah.yech") strsplit(x,"e") #需要注意的细节 strsplit(paste(c("","") strsplit(""," ")[[1]] strsplit(" "," ")[[1]] ##倒序运用: strReverse <- function(x) sapply(lapply(strsplit(x,NULL),rev),paste,collapse = "") strReverse(c("abc","Statistics")) 5.4 提取字符 substr与substringsubstr(x,start,stop) substring(text,first,last = 1000000L) substr(x,stop) <- value substring(text,last = 1000000L) <- value substr("abcdef",4) substring("abcdef",1:6,1:6) substr(rep("abcdef",4),1:4,4:5) x <- c("asfef","qwerty","stuff.blah.yech") substr(x,5) substring(x,4:6) substring(x,2) <- c("..","+++") 5.5 替换字符 sub和gsub
sub(pattern,replacement,x,ignore.case = FALSE,useBytes = FALSE) gsub(pattern,useBytes = FALSE) 虽然sub和gsub是用于字符串替换的函数,但严格地说R语言没有字符串替换的函数,因为R语言不管什么操作对参数都是传值不传址。所以原字符串并没有改变,要改变原变量我们只能通过再赋值的方式。 text <- "Hello Adam!nHello Ava!" sub(pattern="Adam",replacement="World",text) text sub(pattern="Adam|Ava",?replacement="World",?text) gsub(pattern="Adam|Ava",?replacement="world",?text) sub和gsub函数可以使用提取表达式(转义字符+数字)让部分变成全部 sub(pattern=".*(Adam).*",replacement="1",text) str <- "Now is the time " sub(" +$","",str) sub("[[:space:]]+$",str) sub("s+$",str,perl = TRUE) txt <- "a test of capitalizing" gsub("(w)(w*)","U1L2",txt,perl=TRUE) gsub("b(w)","U1",perl=TRUE) 5.6 字符查询匹配 grep
x <- c("abc","abcdef","def") grep("def",x) #grep返回匹配项的下标 #grepl返回所有查询结果的逻辑向量。两者的结果都可用于提取数据子集 grepl("def",x) regexpr、gregexpr和regexec 5.5 其他
rep(1:4,2) rep(1:4,each = 2) rep(1:4,2)) rep(1:4,1)) rep(1:4,each = 2,len = 4) rep(1:4,len = 10) rep(1:4,times = 3) 5.6 stringr包stringr包是用来处理字符串的。(先挖坑...) 附录A 正则表达式待整理 附录B(编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |