scala – 如何仅对spark数据帧上的特定字段使用“cube”？

发布时间：2020-12-16 18:48:04 所属栏目：安全来源：网络整理

我正在使用spark 1.6.1,我有这样的数据帧.

+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
|     scene_id|  action_id|       classifier|os_name|country|app_ver|   p0value|p1value|p2value|p3value|p4value|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
|    test_home|scene_enter|        test_home|android|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test|
......

我想通过使用立方体操作获得如下数据帧.

(由所有字段分组,但只有“os_name”,“country”,“app_ver”字段为立方体)

+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
|     scene_id|  action_id|       classifier|os_name|country|app_ver|   p0value|p1value|p2value|p3value|p4value|cnt|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
|    test_home|scene_enter|        test_home|android|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test|  9|
|    test_home|scene_enter|        test_home|   null|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test| 35|
|    test_home|scene_enter|        test_home|android|   null|  5.6.3|__OTHERS__|  false|   test|   test|   test| 98|
|    test_home|scene_enter|        test_home|android|     KR|   null|__OTHERS__|  false|   test|   test|   test|101|
|    test_home|scene_enter|        test_home|   null|   null|  5.6.3|__OTHERS__|  false|   test|   test|   test|301|
|    test_home|scene_enter|        test_home|   null|     KR|   null|__OTHERS__|  false|   test|   test|   test|225|
|    test_home|scene_enter|        test_home|android|   null|   null|__OTHERS__|  false|   test|   test|   test|312|
|    test_home|scene_enter|        test_home|   null|   null|   null|__OTHERS__|  false|   test|   test|   test|521|
......

我尝试过如下,但它看起来很慢而且丑陋……

var cubed = df
  .cube($"scene_id",$"action_id",$"classifier",$"country",$"os_name",$"app_ver",$"p0value",$"p1value",$"p2value",$"p3value",$"p4value")
  .count
  .where("scene_id IS NOT NULL AND action_id IS NOT NULL AND classifier IS NOT NULL AND p0value IS NOT NULL AND p1value IS NOT NULL AND p2value IS NOT NULL AND p3value IS NOT NULL AND p4value IS NOT NULL")

更好的解决方案？
请执行我的坏英语.. ^^;

提前致谢..

解决方法

我相信你无法完全避免这个问题,但有一个简单的技巧可以减少它的规模.我们的想法是用一个占位符替换所有不应被边缘化的列.

例如,如果您有一个DataFrame：

val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f")

你对d和e边缘化的多维数据集感兴趣并按a..c分组,你可以将a..c的替换定义为：

import org.apache.spark.sql.functions.struct
import sparkSql.implicits._

// alias here may not work in Spark 1.6
val rest = struct(Seq($"a",$"b",$"c"): _*).alias("rest")

和立方体：

val cubed =  Seq($"d",$"e")

// If there is a problem with aliasing rest it can done here.
val tmp = df.cube(rest.alias("rest") +: cubed: _*).count

快速过滤和选择应该处理其余的：

tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*)

结果如：

+---+---+---+----+----+-----+
|  a|  b|  c|   d|   e|count|
+---+---+---+----+----+-----+
|  1|  2|  3|null|   5|    1|
|  1|  2|  3|null|null|    1|
|  1|  2|  3|   4|   5|    1|
|  1|  2|  3|   4|null|    1|
+---+---+---+----+----+-----+

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!