python – 从pandas中的组中获取最新值
发布时间:2020-12-20 11:09:50 所属栏目:Python 来源:网络整理
导读:我有一个具有以下结构的数据帧 Debtor ID | AccountRating | AccountRatingDate | AmountOutstanding |AmountPastDueJohn Snow Closed 2017-03-01 0 0John Snow Delayed 2017-04-22 2000 500John Snow Closed 2017-05-23 0 0John Snow Delayed 2017-07-15 60
我有一个具有以下结构的数据帧
Debtor ID | AccountRating | AccountRatingDate | AmountOutstanding |AmountPastDue John Snow Closed 2017-03-01 0 0 John Snow Delayed 2017-04-22 2000 500 John Snow Closed 2017-05-23 0 0 John Snow Delayed 2017-07-15 6000 300 Sarah Parker Closed 2017-02-01 0 0 Edward Hall Closed 2017-05-01 0 0 Douglas Core Delayed 2017-01-01 1000 200 Douglas Core Delayed 2017-06-01 1000 400 我想要实现的是 Debtor ID | Incidents of delay | TheMostRecentOutstanding | TheMostRecentPastDue John Snow 2 6000 300 Sarah Parker 0 0 0 Edward Hall 0 0 0 Douglas Core 2 1000 400 计算延迟事件非常简单 df_account["pastDuebool"] = df_account['amtPastDue'] > 0 new_df = pd.DataFrame(index = df_account.groupby("Debtor ID").groups.keys()) new_df['Incidents of delay'] = df_account.groupby("Debtor ID")["pastDuebool"].sum() 我正在努力提取最新的amonts和pastdue.我的代码是这样的 new_df["TheMostRecentOutstanding"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountOutstanding"] new_df["TheMostRecentPastDue"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountPastDue"] 但他们返回具有所有NaN值的系列.请帮帮我,我在这里做错了什么? 解决方法
你可以试试这个:
df.sort_values('AccountRatingDate') .query('AccountRating == "Delayed"') .groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']] .agg({'AccountRating':'count','AmountOutstanding':'last','AmountPastDue':'last'}) .reindex(df['Debtor ID'].unique(),fill_value=0) .reset_index() 输出: Debtor ID AccountRating AmountOutstanding AmountPastDue 0 John Snow 2 6000 300 1 Sarah Parker 0 0 0 2 Edward Hall 0 0 0 3 Douglas Core 2 1000 400 细节: >首先按AccountRatingDate排序数据框,以获取最后一个日期 并且,您可以使用重命名和字典进行列重命名: df.sort_values('AccountRatingDate') .query('AccountRating == "Delayed"') .groupby('Debtor ID')[['AccountRating',fill_value=0) .rename(columns={'AccoutRating':'Incidents of delay','AmountOutstanding':'TheMostRecentOutstanding','AmountPastDue':'TheMostRecentPastDue'}) .reset_index() 输出: Debtor ID AccountRating TheMostRecentOutstanding TheMostRecentPastDue 0 John Snow 2 6000 300 1 Sarah Parker 0 0 0 2 Edward Hall 0 0 0 3 Douglas Core 2 1000 400 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |