scala – Spark数据帧中两行之间的差异
发布时间:2020-12-16 09:08:23 所属栏目:安全 来源:网络整理
导读:我通过groupby column1和date在Spark中创建了一个数据框,并计算了数量. val table = df1.groupBy($"column1",$"date").sum("amount") Column1 |Date |AmountA |1-jul |1000A |1-june |2000A |1-May |2000A |1-dec |3000A |1-Nov |2000B |1-jul |100B |1-june
我通过groupby column1和date在Spark中创建了一个数据框,并计算了数量.
val table = df1.groupBy($"column1",$"date").sum("amount") Column1 |Date |Amount A |1-jul |1000 A |1-june |2000 A |1-May |2000 A |1-dec |3000 A |1-Nov |2000 B |1-jul |100 B |1-june |300 B |1-May |400 B |1-dec |300 现在,我想添加新列,表中任意两个日期的数量之间存在差异. 解决方法
如果计算固定为计算前几个月之间的差异,或者计算前两个月之间的差异,则可以使用窗口函数…等等.为此,您可以使用窗口的滞后和导联功能.
但为此您需要更改日期列,如下所示,以便可以订购. +-------+------+--------------+------+ |Column1|Date |Date_Converted|Amount| +-------+------+--------------+------+ |A |1-jul |2017-07-01 |1000 | |A |1-june|2017-06-01 |2000 | |A |1-May |2017-05-01 |2000 | |A |1-dec |2017-12-01 |3000 | |A |1-Nov |2017-11-01 |2000 | |B |1-jul |2017-07-01 |100 | |B |1-june|2017-06-01 |300 | |B |1-May |2017-05-01 |400 | |B |1-dec |2017-12-01 |300 | +-------+------+--------------+------+ 您可以通过执行操作找到上个月和当前月份之间的差异 import org.apache.spark.sql.expressions._ val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted") import org.apache.spark.sql.functions._ df.withColumn("diff_Amt_With_Prev_Month",$"Amount" - when((lag("Amount",1).over(windowSpec)).isNull,0).otherwise(lag("Amount",1).over(windowSpec))) .show(false) 你应该有 +-------+------+--------------+------+------------------------+ |Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_Month| +-------+------+--------------+------+------------------------+ |B |1-May |2017-05-01 |400 |400.0 | |B |1-june|2017-06-01 |300 |-100.0 | |B |1-jul |2017-07-01 |100 |-200.0 | |B |1-dec |2017-12-01 |300 |200.0 | |A |1-May |2017-05-01 |2000 |2000.0 | |A |1-june|2017-06-01 |2000 |0.0 | |A |1-jul |2017-07-01 |1000 |-1000.0 | |A |1-Nov |2017-11-01 |2000 |1000.0 | |A |1-dec |2017-12-01 |3000 |1000.0 | +-------+------+--------------+------+------------------------+ 您可以增加前两个月的滞后位置 df.withColumn("diff_Amt_With_Prev_two_Month",2).over(windowSpec)).isNull,2).over(windowSpec))) .show(false) 哪个会给你 +-------+------+--------------+------+----------------------------+ |Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_two_Month| +-------+------+--------------+------+----------------------------+ |B |1-May |2017-05-01 |400 |400.0 | |B |1-june|2017-06-01 |300 |300.0 | |B |1-jul |2017-07-01 |100 |-300.0 | |B |1-dec |2017-12-01 |300 |0.0 | |A |1-May |2017-05-01 |2000 |2000.0 | |A |1-june|2017-06-01 |2000 |2000.0 | |A |1-jul |2017-07-01 |1000 |-1000.0 | |A |1-Nov |2017-11-01 |2000 |0.0 | |A |1-dec |2017-12-01 |3000 |2000.0 | +-------+------+--------------+------+----------------------------+ 我希望答案是有帮助的 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |