使用重复键滚动加入data.table

发布时间：2020-12-12 16:46:35 所属栏目：MsSql教程来源：网络整理

导读：我正在努力了解data.table中的滚动联接.最后给出了再现这些数据的数据. 给定一个机场在特定时间的交易数据： dt t_id airport thisTime1: 1 a 5.12: 3 a 5.13: 2 a 6.2 (注意t_ids 1和3有相同的机场和时间) 以及从机场出发的航班查询表： dt_lookup f_id airp

我正在努力了解data.table中的滚动联接.最后给出了再现这些数据的数据.

给定一个机场在特定时间的交易数据：

> dt
   t_id airport thisTime
1:    1       a      5.1
2:    3       a      5.1
3:    2       a      6.2

(注意t_ids 1和3有相同的机场和时间)

以及从机场出发的航班查询表：

> dt_lookup
   f_id airport thisTime
1:    1       a        6
2:    2       a        6
3:    1       b        7
4:    1       c        8
5:    2       d        7
6:    1       d        9
7:    2       e        8

> tables()
     NAME      NROW NCOL MB COLS                  KEY             
[1,] dt           3    3  1 t_id,airport,thisTime airport,thisTime
[2,] dt_lookup    7    3  1 f_id,thisTime

我想将所有交易与从该机场出发的所有下一个航班进行匹配,以便：

t_id airport thisTime f_id
      1       a        6    1
      1       a        6    2
      3       a        6    1
      3       a        6    2

所以我以为这样会工作：

> dt[dt_lookup,nomatch=0,roll=Inf]
   t_id airport thisTime f_id
1:    3       a        6    1
2:    3       a        6    2

但是它没有返回事务t_id == 1.

从the documentation说：

Usually,there should be no duplicates in x’s key,…

但是,我的’x键'(即机场& thisTime)中有重复的,并且看不到/理解t_id = 1是否从输出中移除.

任何人都可以看出,为什么t_id = 1没有返回,当我有重复的时候怎么能让加入工作？

数据

library(data.table)
dt <- data.table(t_id = seq(1:3),airport = c("a","a","a"),thisTime = c(5.1,6.2,5.1),key=c( "airport","thisTime"))

dt_lookup <- data.table(f_id = c(rep(1,4),rep(2,3)),"b","c","d","e"),thisTime = c(6,7,8,9,6,8),key=c("airport","thisTime"))

解决方法

在输出中没有显示t_id = 1的原因是因为滚动连接采用最后一个键组合出现的行.从文档(强调我的)：

Applies to the last join column,generally a date but can be any
ordered variable,irregular and including gaps. If roll=TRUE and i’s
row matches to all but the last x join column,and its value in the
last i join column falls in a gap (including after the last
observation in x for that group),then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF).

我们考虑一些更大的数据集：

> DT
   t_id airport thisTime
1:    1       a      5.1
2:    4       a      5.1
3:    3       a      5.1
4:    2       d      6.2
5:    5       d      6.2
> DT_LU
   f_id airport thisTime
1:    1       a        6
2:    2       a        6
3:    2       a        8
4:    1       b        7
5:    1       c        8
6:    2       d        7
7:    1       d        9

当您按照您的问题进行滚动加盟：

DT[DT_LU,roll=Inf]

你得到：

t_id airport thisTime f_id
1:    3       a        6    1
2:    3       a        6    2
3:    3       a        8    2
4:    5       d        7    2
5:    5       d        9    1

如您所见,从组合键a,5.1和d,6.2中,最后一行用于连接的数据.因为您使用Inf作为滚动值,所有将来的值都将并入到生成的数据表中.使用时：

DT[DT_LU,roll=1]

您将看到仅包含将来的第一个值：

t_id airport thisTime f_id
1:    3       a        6    1
2:    3       a        6    2
3:    5       d        7    2

如果你想要f_id为机场& thisTime其中DT $thisTime低于DT_LU $thisTime,您可以通过使用ceiling函数创建一个新变量(或替换现有的thisTime)来实现.一个例子,我创建一个新的变量thisTime2,然后用DT_LU做正常的连接：

DT[,thisTime2 := ceiling(thisTime)]
setkey(DT,thisTime2)[DT_LU,nomatch=0]

这使：

t_id airport thisTime thisTime2 f_id
1:    1       a      5.1         6    1
2:    4       a      5.1         6    1
3:    3       a      5.1         6    1
4:    1       a      5.1         6    2
5:    4       a      5.1         6    2
6:    3       a      5.1         6    2
7:    2       d      6.2         7    2
8:    5       d      6.2         7    2

适用于您提供的数据：

> dt[,thisTime2 := ceiling(thisTime)]
> setkey(dt,thisTime2)[dt_lookup,nomatch=0]

   t_id airport thisTime thisTime2 f_id
1:    1       a      5.1         6    1
2:    3       a      5.1         6    1
3:    1       a      5.1         6    2
4:    3       a      5.1         6    2

当您想要包含未来的值而不是仅包含第一个值时,您需要一种稍微不同的方法,您将需要使用i.col功能(尚未记录)：

1：首先设置机场专栏的关键：

setkey(DT,airport)
setkey(DT_LU,airport)

2：使用j中的i.col功能(它还没有记录下来)得到你想要的内容如下：

DT1 <- DT_LU[DT,.(tid = i.t_id,tTime = i.thisTime,fTime = thisTime[i.thisTime < thisTime],fid = f_id[i.thisTime < thisTime]),by=.EACHI]

这给你：

> DT1
    airport tid tTime fTime fid
 1:       a   1   5.1     6   1
 2:       a   1   5.1     6   2
 3:       a   1   5.1     8   2
 4:       a   4   5.1     6   1
 5:       a   4   5.1     6   2
 6:       a   4   5.1     8   2
 7:       a   3   5.1     6   1
 8:       a   3   5.1     6   2
 9:       a   3   5.1     8   2
10:       d   2   6.2     7   2
11:       d   2   6.2     9   1
12:       d   5   6.2     7   2
13:       d   5   6.2     9   1

一些解释：如果你加入了使用相同列名的两个数据表,你可以在i中引用列的名字之间的i可以引用datatable中的列.现在可以将DT中的thisTime与DT_LU的thisTime进行比较.使用by = .EACHI,您可以确保使用条件成立的所有组合都包含在生成的数据表中.

或者,您可以实现相同的：

DT2 <- DT_LU[DT,.(airport=i.airport,tid=i.t_id,tTime=i.thisTime,fTime=thisTime[i.thisTime < thisTime],fid=f_id[i.thisTime < thisTime]),allow.cartesian=TRUE]

其结果相同：

> identical(DT1,DT2)
[1] TRUE

当您只想在某个边界内包含未来的值时,可以使用：

DT1 <- DT_LU[DT,{
               idx = i.thisTime < thisTime & thisTime - i.thisTime < 2
               .(tid  = i.t_id,fTime = thisTime[idx],fid = f_id[idx])
               },by=.EACHI]

这使：

> DT1
   airport tid tTime fTime fid
1:       a   1   5.1     6   1
2:       a   1   5.1     6   2
3:       a   4   5.1     6   1
4:       a   4   5.1     6   2
5:       a   3   5.1     6   1
6:       a   3   5.1     6   2
7:       d   2   6.2     7   2
8:       d   5   6.2     7   2

当您与之前的结果进行比较时,您会看到现在已经删除了行3,10和12.

数据：

DT <- data.table(t_id = c(1,4,2,3,5),"d"),5.1,6.2),"thisTime"))

DT_LU <- data.table(f_id = c(rep(1,"thisTime"))

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!