python – 在groupby pandas之后过滤行
发布时间:2020-12-16 23:04:56 所属栏目:Python 来源:网络整理
导读:我在熊猫里有一张桌子: import pandas as pddf = pd.DataFrame({ 'LeafID':[1,1,2,3,6,5,1],'pidx':[10,10,300,30,40,20,45,20],'pidy':[20,400,15,12,43,54,112,23],'count':[10,80,50,70],'score':[10,22,4,9,1]})LeafID count pidx pidy score0 1 10 10
|
我在熊猫里有一张桌子:
import pandas as pd
df = pd.DataFrame({
'LeafID':[1,1,2,3,6,5,1],'pidx':[10,10,300,30,40,20,45,20],'pidy':[20,400,15,12,43,54,112,23],'count':[10,80,50,70],'score':[10,22,4,9,1]
})
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
2 2 30 300 400 10
3 1 40 10 20 22
4 3 80 30 15 22
5 3 10 40 20 3
6 1 20 20 12 4
7 6 50 10 43 5
8 3 30 20 54 9
9 5 10 45 112 0
10 1 70 20 23 1
我想做一个groupby,然后过滤pidx大于2的行. 也就是说,过滤pidx为10和20的行. 我尝试使用df.groupby(‘pidx’).count()但它没有帮助我.同样对于那些行,我必须做0.4 *计数0.6 *得分. 期望的输出是: LeafID count pidx pidy final_score 1 10 10 20 1 20 10 20 1 40 10 20 6 50 10 43 1 20 20 12 3 30 20 54 1 70 20 23 解决方法
您可以在
boolean indexing和
isin中使用
value_counts:
df = pd.DataFrame({
'LeafID':[1,1]
})
print (df)
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
2 2 30 300 400 10
3 1 40 10 20 22
4 3 80 30 15 22
5 3 10 40 20 3
6 1 20 20 12 4
7 6 50 10 43 5
8 3 30 30 54 9
9 5 10 45 112 0
10 1 70 20 23 1
s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
3 1 40 10 20 22
7 6 50 10 43 5
时序: np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000,size=N),'pidx': np.random.randint(10000,'pidy': np.random.choice(L2,N),'count':np.random.randint(1000,size=N)})
print (df)
print (df.groupby('pidx').filter(lambda x: len(x) > 120))
def jez(df):
s = df.pidx.value_counts()
return df[df.pidx.isin(s[s>120].index)]
print (jez(df))
In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop,best of 3: 1.17 s per loop
In [56]: %timeit (jez(df))
10 loops,best of 3: 141 ms per loop
In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops,best of 3: 102 ms per loop
In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop,best of 3: 685 ms per loop
In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops,best of 3: 104 ms per loop
对于final_score,您可以使用: df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6)) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |
