如何在pandas python中得到最近除以100的数字

发布时间：2020-12-20 11:08:37 所属栏目：Python 来源：网络整理

导读：我想基于输入列在pandas数据框中添加一个新列.必须像这样填充新添加的列. 第一行必须填入最近除以100的数字. 从下一行开始,将重复输出,直到它与输入值的差值大于或等于100. input output11700.15 1170011695.20 1170011661.00 1170011630.40 1170011666.10 1

我想基于输入列在pandas数据框中添加一个新列.必须像这样填充新添加的列.

>第一行必须填入最近除以100的数字.
>从下一行开始,将重复输出,直到它与输入值的差值大于或等于100.

input       output
11700.15    11700
11695.20    11700
11661.00    11700
11630.40    11700
11666.10    11700
11600.30    11700
11600.00    11600
11555.40    11600
11655.20    11600
11699.00    11600
11701.55    11700
11799.44    11700
11604.65    11700
11600.33    11700
11599.65    11600

在熊猫中最优雅的做法是什么？

解决方法

据我所知,这里没有直观的方法,不涉及显式迭代,这对于numpy和pandas来说并不理想.但是,这个问题的时间复杂度是O(n),这使得它成为numba库的一个很好的目标.这使我们能够提出一个非常有效的解决方案.

关于我的解决方案的一个注意事项,我使用(阈值// 2)//阈值*阈值,与使用np.round(a,decimals = -2)相比看起来冗长.这是由于使用numba的nopython = True,flag的性质,它与np.round函数不兼容.

from numba import jit

@jit(nopython=True)
def cumsum_with_threshold(arr,threshold):
       """
       Rounds values in an array,propogating the last value seen until
       a cumulative sum reaches a threshold
       :param arr: the array to round and sum
       :param threshold: the point at which to stop propogation
       :return: rounded output array
       """

       s = a.shape[0]
       o = np.empty(s)
       d = a[0]
       r = (a + threshold // 2) // threshold * threshold
       c = 0
       o[0] = r[0]

       for i in range(1,s):
           if np.abs(a[i] - d) > threshold:
               o[i] = r[i]
               d = a[i]
           else:
               o[i] = o[i - 1]

       return o

我们来测试一下：

a = df['input'].values
pd.Series(cumsum_with_threshold(a,100))

0     11700.0
1     11700.0
2     11700.0
3     11700.0
4     11700.0
5     11700.0
6     11600.0
7     11600.0
8     11600.0
9     11600.0
10    11700.0
11    11700.0
12    11700.0
13    11600.0
14    11600.0
dtype: float64

如果要将舍入值与输入进行比较而不是实际值,只需在循环中对上面的函数进行以下更改,从而提供问题的输出.

for i in range(1,s):
   if np.abs(a[i] - d) > t:
       o[i] = r[i]
       # OLD d = a[i]
       d = r[i]
   else:
       o[i] = o[i - 1]

为了测试效率,让我们在更大的数据集上运行它：

l = np.random.choice(df['input'].values,10_000_000)

%timeit cumsum_with_threshold(l,100)
1.54 μs ± 7.93 ns per loop (mean ± std. dev. of 7 runs,1000000 loops each)

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!