python – 在列索引组中的数据框中保留每行的前N个值
我找不到这个问题的优雅解决方案(可能没有).
我有以下示例DataFrame:
0 1 2 3 4 5 6 0 1.764052 0.400157 0.978738 2.240893 1.867558 0.977278 0.950088 1 0.144044 1.454274 0.761038 0.121675 0.443863 0.333674 1.494079 2 2.552990 0.653619 0.864436 0.742165 2.269755 1.454366 0.045759 3 0.154947 0.378163 0.887786 1.980796 0.347912 0.156349 1.230291 4 1.048553 1.420018 1.706270 1.950775 0.509652 0.438074 1.252795 5 0.895467 0.386902 0.510805 1.180632 0.028182 0.428332 0.066517 6 0.672460 0.359553 0.813146 1.726283 0.177426 0.401781 1.630198 7 0.729091 0.128983 1.139401 1.234826 0.402342 0.684810 0.870797 8 1.165150 0.900826 0.465662 1.536244 1.488252 1.895889 1.178780 9 0.403177 1.222445 0.208275 0.976639 0.356366 0.706573 0.010500 7 8 9 0 0.151357 0.103219 0.410599 1 0.205158 0.313068 0.854096 2 0.187184 1.532779 1.469359 3 1.202380 0.387327 0.302303 4 0.777490 1.613898 0.212740 5 0.302472 0.634322 0.362741 6 0.462782 0.907298 0.051945 7 0.578850 0.311553 0.056165 8 0.179925 1.070753 1.054452 9 1.785870 0.126912 0.401989 我有以下区域地图:
区域显示我应该一起检查的列组和df [columns] DataFrame的每一行,保留前N个项目(NB:保持前N个项目,即横截面 – 见后面),将其余部分设为零.例如,对于N = 2的区域“A”,我将检查以下DataFrame: 0 1 2 0 1.764052 0.400157 0.978738 1 0.144044 1.454274 0.761038 2 2.552990 0.653619 0.864436 3 0.154947 0.378163 0.887786 4 1.048553 1.420018 1.706270 5 0.895467 0.386902 0.510805 6 0.672460 0.359553 0.813146 7 0.729091 0.128983 1.139401 8 1.165150 0.900826 0.465662 9 0.403177 1.222445 0.208275 因为N = 2,我将保留前N项: 0 1 2 0 1.764052 0. 0.978738 1 0. 1.454274 0.761038 2 2.552990 0. 0.864436 3 0. 0.378163 0.887786 4 0. 1.420018 1.706270 5 0.895467 0. 0.510805 6 0.672460 0. 0.813146 7 0.729091 0. 1.139401 8 1.165150 0.900826 0. 9 0.403177 1.222445 0. 上面带有区域图并且N = 2的整个输出将如下所示: 0 1 2 3 4 5 6 0 1.764052 0. 0.978738 2.240893 1.867558 0.977278 0.950088 1 0. 1.454274 0.761038 0.121675 0.443863 0.333674 1.494079 2 2.552990 0. 0.864436 0.742165 2.269755 1.454366 0. 3 0. 0.378163 0.887786 1.980796 0.347912 0. 1.230291 4 0. 1.420018 1.706270 1.950775 0.509652 0. 1.252795 5 0.895467 0. 0.510805 1.180632 0.028182 0.428332 0. 6 0.672460 0. 0.813146 1.726283 0.177426 0. 1.630198 7 0.729091 0. 1.139401 1.234826 0.402342 0.684810 0.870797 8 1.165150 0.900826 0. 1.536244 1.488252 1.895889 1.178780 9 0.403177 1.222445 0. 0.976639 0.356366 0.706573 0. 7 8 9 0 0. 0. 0.410599 1 0. 0. 0.854096 2 0. 1.532779 1.469359 3 1.202380 0. 0.302303 4 0. 1.613898 0.212740 5 0. 0.634322 0.362741 6 0. 0.907298 0.051945 7 0. 0. 0.056165 8 0. 0. 1.054452 9 1.785870 0. 0.401989 我试图解决这个问题的方式感觉有点慢.我循环遍历区域,然后我得到一个zone_df,然后我循环遍历行,排序每一行并调用row.head(len(row) – N)以获取需要设置为0的索引和列.然后使用这些值(在dict中)将zone_df中的单元格设置为零,然后组合zone_dfs. 解决方法
这是一种方式 –
def keeptopN_perkey(df,zones,N=2): a = df.values indx = zones.values() r = np.arange(a.shape[0])[:,None] for i in indx: b = a[:,i] L = np.maximum(len(i)-N,0) if L>0: idx = np.argpartition(b,L,axis=1)[:,:L] # or np.argsort(b,:L] b[r,idx] = 0 a[:,i] = b return df 好处是我们正在回写输入数据帧,而无需在使用底层数组数据的帮助下创建输出数据帧. 样品运行 – In [303]: np.random.seed(0) ...: N = 2 ...: df = pd.DataFrame(np.random.randint(11,99,(4,10))) ...: zones = {"A": [0,"B": [3,"C": [5,"D": [9]} ...: In [304]: df Out[304]: 0 1 2 3 4 5 6 7 8 9 0 55 58 75 78 78 20 94 32 47 98 1 81 23 69 76 50 98 57 92 48 36 2 88 83 20 31 91 80 90 58 75 93 3 60 40 30 30 25 50 43 76 20 68 In [305]: keeptopN_perkey(df,N=2) Out[305]: 0 1 2 3 4 5 6 7 8 9 0 0 58 75 78 78 0 94 0 47 98 1 81 0 69 76 50 98 0 92 0 36 2 88 83 0 31 91 80 90 0 0 93 3 60 40 0 30 25 50 0 76 0 68 标杆 其他职位的方法 – def mask_n(df,n): # @piRSquared's helper func v = np.zeros(df.shape,dtype=bool) n = min(n,v.shape[1]) if v.shape[1] > n: j = np.argpartition(-df.values,n,1)[:,:n].ravel() i = np.arange(v.shape[0]).repeat(n) v[i,j] = True return df.where(v,0) else: return df def piRSquared1(df,zones): # @piRSquared's soln1 zinv = {v: k for k in zones for v in zones[k]} return df.groupby(zinv,1).apply(mask_n,n=2) def piRSquared2(df,zones): # @piRSquared's soln2 zinv = {v: k for k in zones for v in zones[k]} return df.mask(df.groupby(zinv,1).rank(axis=1,method='first',ascending=False) > 2,0) def COLDSPEED1(df,zones): # @COLDSPEED's soln for z in zones: df2 = df.iloc[:,zones[z]] df.iloc[:,zones[z]] = np.where(((-df2).rank(axis=1) - 1) >= 2,df2.values) return df def s5s1(df,N=2): # @s5s's soln final = [] for zone_id,cols in zones.iteritems(): values = {} d = df[cols] # zone A for i,row in d.iterrows(): if len(row) > N: row.sort() row[row.head(len(row) - N).index] = 0 values[i] = row d = pd.DataFrame(values).T final.append(d) return pd.concat(final,axis=1)[df.columns] 关于更大数据集的计时 – In [458]: # Setup ...: ncols = 1000 ...: cuts = np.sort(np.random.choice(ncols,ncols//3,replace=0)) ...: indx_split = np.split(np.arange(ncols),cuts) ...: zones = {i:p_i for i,p_i in enumerate(list(map(list,indx_split)))} ...: df = pd.DataFrame(np.random.randint(11,(10,ncols))) ...: N = 2 ...: ...: df1 = df.copy() ...: df2 = df.copy() ...: df3 = df.copy() ...: df4 = df.copy() ...: df5 = df.copy() ...: In [459]: %timeit COLDSPEED1(df1,zones) ...: %timeit piRSquared1(df2,zones) ...: %timeit piRSquared2(df3,zones) ...: %timeit s5s1(df4,zones) ...: %timeit keeptopN_perkey(df5,zones) ...: 1 loop,best of 3: 324 ms per loop 10 loops,best of 3: 116 ms per loop 10 loops,best of 3: 81.6 ms per loop 1 loop,best of 3: 1.47 s per loop 100 loops,best of 3: 2.99 ms per loop (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |