加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 百科 > 正文

使用data.table聚合/合并日期范围

发布时间:2020-12-14 04:58:53 所属栏目:百科 来源:网络整理
导读:假设我有两个data.tables: summary - data.table(period = c("A","B","C","D"),from_date = ymd(c("2017-01-01","2017-01-03","2017-02-08","2017-03-07")),to_date = ymd(c("2017-01-31","2017-04-01","2017-03-08","2017-05-01")))log - data.table(date
假设我有两个data.tables:

summary <- data.table(period = c("A","B","C","D"),from_date = ymd(c("2017-01-01","2017-01-03","2017-02-08","2017-03-07")),to_date = ymd(c("2017-01-31","2017-04-01","2017-03-08","2017-05-01"))
)

log <- data.table(date = ymd(c("2017-01-03","2017-01-20","2017-02-01","2017-03-03","2017-03-15","2017-03-28","2017-04-03","2017-04-23")),event1 = c(4,8,4,3,7,3),event2 = c(1,6,3))

看起来像这样:

> summary
   period  from_date    to_date
1:      A 2017-01-01 2017-01-31
2:      B 2017-01-03 2017-04-01
3:      C 2017-02-08 2017-03-08
4:      D 2017-03-07 2017-05-01
> log
         date event1 event2
1: 2017-01-03      4      1
2: 2017-01-20      8      8
3: 2017-02-01      8      7
4: 2017-03-03      4      3
5: 2017-03-15      3      8
6: 2017-03-28      4      4
7: 2017-04-03      7      6
8: 2017-04-23      3      3

我想在表摘要中获取每个时间段的event1和event2的总和.

我知道我可以这样做:

summary[,c("event1","event2") := .(sum(log[date>=from_date & date<=to_date,event1]),sum(log[date>=from_date & date<=to_date,event2])),by=period][]

获得所需的结果:

period  from_date    to_date event1 event2
1:      A 2017-01-01 2017-01-31     12      9
2:      B 2017-01-03 2017-04-01     31     31
3:      C 2017-02-08 2017-03-08      4      3
4:      D 2017-03-07 2017-05-01     17     21

现在,在我的现实问题中,我有大约30个要汇总的列,我可能想稍后更改,汇总有~35,000行,日志有~40,000,000行.有没有一种有效的方法来实现这一目标?

注意:这是我在这里的第一篇文章.我希望我的问题清楚而具体,如果我有什么需要改进的话,请提出建议.谢谢!

解决方法

是的,您可以执行非equi连接.

(注意我已将日志和摘要更改为Log和Summary,因为原件已经是R中的函数.)

Log[Summary,on = c("date>=from_date","date<=to_date"),nomatch=0L,allow.cartesian = TRUE][,.(from_date = date[1],to_date = date.1[1],event1 = sum(event1),event2 = sum(event2)),keyby = "period"]

要总结列的模式,请使用lapply和.SD:

joined_result <- 
  Log[Summary,nomatch = 0L,allow.cartesian = TRUE]

cols <- grep("event[a-z]?[0-9]",names(joined_result),value = TRUE)

joined_result[,lapply(.SD,sum),.SDcols = cols,keyby = .(period,from_date = date,to_date = date.1)]

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读