python – 按行随机连接数据帧

发布时间：2020-12-20 11:52:59 所属栏目：Python 来源：网络整理

导读：如何逐行随机合并,连接或连接pandas数据帧？假设我有四个这样的数据框(有更多的行)： df1 = pd.DataFrame({'col1':["1_1","1_1"],'col2':["1_2","1_2"],'col3':["1_3","1_3"]})df2 = pd.DataFrame({'col1':["2_1","2_1"],'col2':["2_2","2_2"],'col3':["2_3"

如何逐行随机合并,连接或连接pandas数据帧？假设我有四个这样的数据框(有更多的行)：

df1 = pd.DataFrame({'col1':["1_1","1_1"],'col2':["1_2","1_2"],'col3':["1_3","1_3"]})
df2 = pd.DataFrame({'col1':["2_1","2_1"],'col2':["2_2","2_2"],'col3':["2_3","2_3"]})
df3 = pd.DataFrame({'col1':["3_1","3_1"],'col2':["3_2","3_2"],'col3':["3_3","3_3"]})
df4 = pd.DataFrame({'col1':["4_1","4_1"],'col2':["4_2","4_2"],'col3':["4_3","4_3"]})

我怎样才能将这四个数据框随机输出这样的东西(它们是一行一行地随机合并)：

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0  1_1  1_2  1_3  4_1  4_2  4_3  2_1  2_2  2_3  3_1  3_2  3_3
1  2_1  2_2  2_3  1_1  1_2  1_3  3_1  3_2  3_3  4_1  4_2  4_3

我以为我可以这样做：

my_list = [df1,df2,df3,df4]
my_list = random.sample(my_list,len(my_list))
df = pd.DataFrame({'empty' : []})

for row in df:
    new_df = pd.concat(my_list,axis=1)

print new_df

以上for语句不会超过第一行,每行之后(我有更多)将是相同的,即它只会洗牌一次：

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0  4_1  4_2  4_3  1_1  1_2  1_3  2_1  2_2  2_3  3_1  3_2  3_3
1  4_1  4_2  4_3  1_1  1_2  1_3  2_1  2_2  2_3  3_1  3_2  3_3

解决方法

更新：来自@Divakar的更好的解决方案：

df1 = pd.DataFrame({'col1':["1_1","1_3"],'col4':["1_4","1_4"]})
df2 = pd.DataFrame({'col1':["2_1","2_3"],'col4':["2_4","2_4"]})
df3 = pd.DataFrame({'col1':["3_1","3_3"],'col4':["3_4","3_4"]})
df4 = pd.DataFrame({'col1':["4_1","4_3"],'col4':["4_4","4_4"]})

dfs = [df1,df4]
n = len(dfs)
nrows = dfs[0].shape[0]
ncols = dfs[0].shape[1]
A = pd.concat(dfs,axis=1).values.reshape(nrows,-1,ncols)
sidx = np.random.rand(nrows,n).argsort(1)
out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)
df = pd.DataFrame(out_arr)

输出：

In [203]: df
Out[203]:
    0    1    2    3    4    5    6    7    8    9    10   11   12   13   14   15
0  3_1  3_2  3_3  3_4  1_1  1_2  1_3  1_4  4_1  4_2  4_3  4_4  2_1  2_2  2_3  2_4
1  4_1  4_2  4_3  4_4  2_1  2_2  2_3  2_4  3_1  3_2  3_3  3_4  1_1  1_2  1_3  1_4

说明：(c)Divakar

基于NumPy的解决方案

让我们有一个基于NumPy的矢量化解决方案,希望是一个快速的解决方案！

1)让我们将一组连接值重新整形为一个3D数组,将每一行“切割”成与每个输入数据帧中的#列对应的ncols组 –

A = pd.concat(dfs,ncols)

2)接下来,我们欺骗np.aragsort给我们随机唯一索引,范围从0到N-1,其中N是输入数据帧的数量 –

sidx = np.random.rand(nrows,n).argsort(1)

3)最后的技巧是NumPy将一些广播索引与一些广播索引到一个带有sidx的A给我们输出数组 –

out_arr = A[np.arange(nrows)[:,-1)

4)如果需要,转换为数据帧 –

df = pd.DataFrame(out_arr)

老答案：

IIUC你可以这样做：

dfs = [df1,df4]
n = len(dfs)
ncols = dfs[0].shape[1]
v = pd.concat(dfs,axis=1).values
a = np.arange(n * ncols).reshape(n,df1.shape[1])

df = pd.DataFrame(np.asarray([v[i,a[random.sample(range(n),n)].reshape(n * ncols,)] for i in dfs[0].index]))

产量

In [150]: df
Out[150]:
    0    1    2    3    4    5    6    7    8    9    10   11
0  1_1  1_2  1_3  3_1  3_2  3_3  4_1  4_2  4_3  2_1  2_2  2_3
1  2_1  2_2  2_3  1_1  1_2  1_3  3_1  3_2  3_3  4_1  4_2  4_3

说明：

In [151]: v
Out[151]:
array([['1_1','1_2','1_3','2_1','2_2','2_3','3_1','3_2','3_3','4_1','4_2','4_3'],['1_1','4_3']],dtype=object)

In [152]: a
Out[152]:
array([[ 0,1,2],[ 3,4,5],[ 6,7,8],[ 9,10,11]])

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!