sql – cube,rollup和groupBy运算符之间有什么区别？

发布时间：2020-12-12 06:56:11 所属栏目：MsSql教程来源：网络整理

导读：问题几乎在标题中。我找不到任何有关差异的详细文档。我确实注意到了一个区别，因为在交换cube和groupBy函数调用时，我会得到不同的结果。我注意到对于使用’cube’的结果，我在经常分组的表达式上得到了很多空值。解决方法这些不是以相同的方式工作。 gro

问题几乎在标题中。我找不到任何有关差异的详细文档。

我确实注意到了一个区别，因为在交换cube和groupBy函数调用时，我会得到不同的结果。我注意到对于使用’cube’的结果，我在经常分组的表达式上得到了很多空值。

解决方法

这些不是以相同的方式工作。 groupBy只是标准SQL中GROUP BY子句的等价物。换一种说法

table.groupBy($"foo",$"bar")

相当于：

SELECT foo,bar,[agg-expressions] FROM table GROUP BY foo,bar

cube等效于GROUP BY的CUBE扩展。它采用列列表并将聚合表达式应用于分组列的所有可能组合。让我们说你有这样的数据：

val df = Seq(("foo",1L),("foo",2L),("bar",2L)).toDF("x","y")

df.show

// +---+---+
// |  x|  y|
// +---+---+
// |foo|  1|
// |foo|  2|
// |bar|  2|
// |bar|  2|
// +---+---+

并使用count作为聚合计算cube(x，y)：

df.cube($"x",$"y").count.show

// +----+----+-----+     
// |   x|   y|count|
// +----+----+-----+
// |null|   1|    1|   <- count of records where y = 1
// |null|   2|    3|   <- count of records where y = 2
// | foo|null|    2|   <- count of records where x = foo
// | bar|   2|    2|   <- count of records where x = bar AND y = 2
// | foo|   1|    1|   <- count of records where x = foo AND y = 1
// | foo|   2|    1|   <- count of records where x = foo AND y = 2
// |null|null|    4|   <- total count of records
// | bar|null|    2|   <- count of records where x = bar
// +----+----+-----+

与多维数据集类似的函数是汇总，它从左到右计算分层小计：

df.rollup($"x",$"y").count.show
// +----+----+-----+
// |   x|   y|count|
// +----+----+-----+
// | foo|null|    2|   <- count where x is fixed to foo
// | bar|   2|    2|   <- count where x is fixed to bar and y is fixed to  2
// | foo|   1|    1|   ...
// | foo|   2|    1|   ...
// |null|null|    4|   <- count where no column is fixed
// | bar|null|    2|   <- count where x is fixed to bar
// +----+----+-----+

只是为了比较让我们看看普通groupBy的结果：

df.groupBy($"x",$"y").count.show

// +---+---+-----+
// |  x|  y|count|
// +---+---+-----+
// |foo|  1|    1|   <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo|  2|    1|   <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar|  2|    2|   <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+

总结一下：

>使用普通GROUP BY时，每行只包含一次相应的摘要。
>使用GROUP BY CUBE(..)，每行包含在它所代表的每个级别组合的摘要中，包括通配符。从逻辑上讲，上面显示的内容相当于这样(假设我们可以使用NULL占位符)：

SELECT NULL,NULL,COUNT(*) FROM table
UNION ALL
SELECT x,COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT NULL,y,COUNT(*) FROM table GROUP BY y
UNION ALL
SELECT x,COUNT(*) FROM table GROUP BY x,y

>使用GROUP BY ROLLUP(…)类似于CUBE，但通过从左到右填充列来分层次地工作。

SELECT NULL,COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT x,y

ROLLUP和CUBE来自数据仓库扩展，因此如果您想更好地了解其工作原理，您还可以查看您喜欢的RDMBS的文档。例如PostgreSQL在9.5和these are relatively well documented中都引入了。

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!