python – 用于复制一行以填充DataFrame的Pandas
发布时间:2020-12-20 12:33:46 所属栏目:Python 来源:网络整理
导读:我陷入了死胡同,而且我正在使用一些代码,这些代码绝对不是熊猫,因为在Pandas中应该是一个非常简单的任务.我相信有更好的方法. 我有一个DataFrame,我将从中提取一行并创建一个新的DataFrame,如下所示: sampledatafloat_col int_col str_col r v new_coltest
我陷入了死胡同,而且我正在使用一些代码,这些代码绝对不是熊猫,因为在Pandas中应该是一个非常简单的任务.我相信有更好的方法.
我有一个DataFrame,我将从中提取一行并创建一个新的DataFrame,如下所示: >>> sampledata float_col int_col str_col r v new_coltest eddd 0 0.1 1 a 5 1.0 0.1 -0.539783 1 0.2 2 b 5 NaN 0.2 -1.394550 2 0.2 6 None 5 NaN 0.2 0.290157 3 10.1 8 c 5 NaN 10.1 -1.799373 4 NaN -1 a 5 NaN NaN 0.694682 >>> newsampledata = sampledata[(sampledata.new_coltest == 0.1) & (sampledata.float_col == 0.1)] >>> newsampledata float_col int_col str_col r v new_coltest eddd 0 0.1 1 a 5 1.0 0.1 -0.539783 我想做的是在“newsampledata”中复制该单行n次,其中n是已知整数.理想情况下,具有n行的最终DataFrame将覆盖单行“newsampledata”,但这在任何方面都不重要. 我目前正在使用for循环来执行pd.concat n-1次以便填充DataFrame,但由于concat的工作方式,这并不快.我也尝试过使用append相同类型的策略,这比concat略慢. 我已经看到了一些关于类似项目的其他问题,但很多人之前没有看过这个问题.此外,由于性能问题,我已经远离地图/应用,但如果您已经看到这种方法的良好表现,请告诉我,我也会尝试. TIA 解决方法
你可以使用
DataFrame 构造函数:
N = 10 df =pd.DataFrame(newsampledata.values.tolist(),index=np.arange(N),columns=sampledata.columns) print (df) float_col int_col str_col r v new_coltest eddd 0 0.1 1 a 5 1.0 0.1 -0.539783 1 0.1 1 a 5 1.0 0.1 -0.539783 2 0.1 1 a 5 1.0 0.1 -0.539783 3 0.1 1 a 5 1.0 0.1 -0.539783 4 0.1 1 a 5 1.0 0.1 -0.539783 5 0.1 1 a 5 1.0 0.1 -0.539783 6 0.1 1 a 5 1.0 0.1 -0.539783 7 0.1 1 a 5 1.0 0.1 -0.539783 8 0.1 1 a 5 1.0 0.1 -0.539783 9 0.1 1 a 5 1.0 0.1 -0.539783 print (df.dtypes) float_col float64 int_col int64 str_col object r int64 v float64 new_coltest float64 eddd float64 dtype: object 时序: 在大型DataFrame构造函数方法中,小型DataFrame是更快的sample和reindex方法. N = 1000 In [88]: %timeit (pd.DataFrame(newsampledata.values.tolist(),columns=sampledata.columns)) 1000 loops,best of 3: 745 μs per loop In [89]: %timeit (newsampledata.sample(N,replace=True).reset_index(drop=True)) The slowest run took 4.88 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops,best of 3: 470 μs per loop In [90]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True)) 1000 loops,best of 3: 476 μs per loop N = 10000 In [92]: %timeit (pd.DataFrame(newsampledata.values.tolist(),best of 3: 946 μs per loop In [93]: %timeit (newsampledata.sample(N,replace=True).reset_index(drop=True)) 1000 loops,best of 3: 775 μs per loop In [94]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True)) 1000 loops,best of 3: 827 μs per loop N = 100000 In [97]: %timeit (pd.DataFrame(newsampledata.values.tolist(),columns=sampledata.columns)) The slowest run took 12.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100 loops,best of 3: 6.93 ms per loop In [98]: %timeit (newsampledata.sample(N,replace=True).reset_index(drop=True)) 100 loops,best of 3: 7.07 ms per loop In [99]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True)) 100 loops,best of 3: 7.87 ms per loop N = 10000000 In [83]: %timeit (pd.DataFrame(newsampledata.values.tolist(),columns=sampledata.columns)) 1 loop,best of 3: 589 ms per loop In [84]: %timeit (newsampledata.sample(N,replace=True).reset_index(drop=True)) 1 loop,best of 3: 757 ms per loop In [85]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True)) 1 loop,best of 3: 731 ms per loop (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |