赞
踩
参考文章:
1.hive 中的grouping set,cube,roll up函数
https://blog.csdn.net/weixin_34352449/article/details/92169948
2.GROUPING函数
https://blog.csdn.net/u013053796/article/details/18619315
3.GROUP BY...WITH ROLL UP 分组统计后的再合计
https://blog.csdn.net/u013677636/article/details/52353812
在使用Hive 的时候,我们常常进行聚合统计运算。
聚合统计的运算函数有很多,比如 我们最常用的 GROUP BY 函数。
但是常常我们需要多维度统计数据,这个时候我们就会用到Hive 的聚合统计函数
这里我们讲解下 ROLLUP, GROUPING SETS, CUBE 的含义以及用法。
我们结合案例讲解下这几个函数的使用规则
我们现在有多个公司,多个部门,多位员工的薪水情况。现在我们需要按照多种维度去统计薪水。
- use data_warehouse_test;
-
-
- CREATE TABLE IF NOT EXISTS datacube_salary_org (
- company_name STRING COMMENT '公司名称'
- ,dep_name STRING COMMENT '部门名称'
- ,user_id BIGINT COMMENT '用户id'
- ,user_name STRING COMMENT '用户姓名'
- ,salary DECIMAL(10,2) COMMENT '薪水'
- ,create_time DATE COMMENT '创建时间'
- ,update_time DATE COMMENT '修改时间'
- )
- PARTITIONED BY(
- pt STRING COMMENT '数据分区'
- )
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
- STORED AS TEXTFILE
- ;
-
-
- CREATE TABLE IF NOT EXISTS datacube_salary_basic_aggr(
- company_name STRING COMMENT '公司名称'
- ,dep_name STRING COMMENT '部门名称'
- ,user_id BIGINT COMMENT '用户id'
- ,salary DECIMAL(10,2) COMMENT '薪水'
- )
- STORED AS ORC
- ;
-
-
- CREATE TABLE IF NOT EXISTS datacube_salary_dep_aggr(
- company_name STRING COMMENT '公司名称'
- ,dep_name STRING COMMENT '部门名称'
- ,total_salary DECIMAL(10,2) COMMENT '薪水'
- )
- STORED AS ORC
- ;
-
-
- CREATE TABLE IF NOT EXISTS datacube_salary_company_aggr(
- company_name STRING COMMENT '公司名称'
- ,total_salary DECIMAL(10,2) COMMENT '薪水'
- )
- STORED AS ORC
- ;
-
-
- CREATE TABLE IF NOT EXISTS datacube_salary_total_aggr(
- total_salary DECIMAL(10,2) COMMENT '薪水'
- )
- STORED AS ORC
- ;

1.直接通过SQL插入
2.通过文件LOAD
1.通过SQL插入
- INSERT OVERWRITE TABLE datacube_salary_org PARTITION (pt = '20200407') VALUES
- ('s.zh','enginer',1,'szh',28000.0,'2020-04-07','2020-04-07'),
- ('s.zh','enginer',2,'zyq',26000.0,'2020-04-03','2020-04-03'),
- ('s.zh','tester',3,'gkm',20000.0,'2020-04-07','2020-04-07'),
- ('x.qx','finance',4,'pip',13400.0,'2020-04-07','2020-04-07'),
- ('x.qx','finance',5,'kip',24500.0,'2020-04-07','2020-04-07'),
- ('x.qx','finance',6,'zxxc',13000.0,'2020-04-07','2020-04-07'),
- ('x.qx','kiccp',7,'xsz',8600.0,'2020-04-07','2020-04-07')
- ;
2.通过文件LOAD
创建一个txt 文件,填入以下内容
- s.zh,engineer,1,szh,28000.0,2020-04-07,2020-04-07
- s.zh,engineer,2,zyq,26000.0,2020-04-03,2020-04-03
- s.zh,tester,3,gkm,20000.0,2020-04-07,2020-04-07
- x.qx,finance,4,pip,13400.0,2020-04-07,2020-04-07
- x.qx,finance,5,kip,24500.0,2020-04-07,2020-04-07
- x.qx,finance,6,zxxc,13000.0,2020-04-07,2020-04-07
- x.qx,kiccp,7,xsz,8600.0,2020-04-07,2020-04-07
我们通过如下2种方式,将数据加载 进去
方法1.
创建文件夹
移动文件
修复表
hdfs dfs -mkdir /user/hive/warehouse/data_warehouse_test.db/datacube_salary_org/pt=20200406
hdfs dfs -put org_data.txt /user/hive/warehouse/data_warehouse_test.db/datacube_salary_org/pt=20200406
在beeline / hive 客户端 执行如下语句
MSCK REPAIR TABLE datacube_salary_org;
方法2.
创建分区
LOAD文件
LOAD 文件的标准用法
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
创建分区
ALTER TABLE datacube_salary_org ADD PARTITION (pt = '20200405');
LOAD数据
LOAD DATA LOCAL INPATH '/opt/hive/my_script/data_warehouse_test/rollup_table/org_data.txt' OVERWRITE INTO TABLE datacube_salary_org PARTITION (pt = '20200405');
ROLLUP, GROUPING SETS, CUBE
我们从 GROUPING SETS , ROLLUP, CUBE 一点点讨论。
GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解.
首先我们学习下 GROUPING SETS
GROUPING SETS 用法如下:
- SELECT
- a
- ,b
- ...
- ,f
- FROM test_table
- GROUP BY
- a
- ,b
- ...
- ,f
- GROUPING SETS ((?,...,?),(xxx),(yyy))
-
- GROUPING SETS 中间可以填写多个条件。
其中 (?,...,?) 可以为 a~f 中不重复的任意项
具体例子如下:
- SELECT
- a
- ,b
- ,SUM(c)
- FROM test_table
- GROUP BY
- a
- ,b
- GROUPING SETS ((a),(a,b),())
等价于
- SELECT
- a
- ,NULL
- ,SUM(c)
- FROM test_table
- GROUP BY
- a
-
- UNION ALL
-
- SELECT
- a
- ,b
- ,SUM(c)
- FROM test_table
- GROUP BY
- a
- ,b
-
- UNION ALL
-
- SELECT
- NULL
- ,NULL
- ,SUM(c)
- FROM test_table
- ;

实际案例,我们想按照公司 ,和整体 去统计员工的薪水, 但是我们想在 一条语句中完成。
我们该如何编写SQL 呢?
SQL如下:
- SELECT
- grouping__id
- ,company_name
- ,dep_name
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
- ,user_id
- GROUPING SETS ((company_name), ())
- ORDER BY
- grouping__id
- ;

输出如下:
- INFO : MapReduce Jobs Launched:
- INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.13 sec HDFS Read: 11666 HDFS Write: 175 SUCCESS
- INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.73 sec HDFS Read: 7060 HDFS Write: 188 SUCCESS
- INFO : Total MapReduce CPU Time Spent: 7 seconds 860 msec
- INFO : Completed executing command(queryId=hive_20200408032038_18e04047-b8c0-4d07-a5de-00ccbc7cb4cc); Time taken: 51.459 seconds
- INFO : OK
- +---------------+---------------+-----------+----------+---------------+
- | grouping__id | company_name | dep_name | user_id | total_salary |
- +---------------+---------------+-----------+----------+---------------+
- | 0 | NULL | NULL | NULL | 133500.00 |
- | 1 | x.qx | NULL | NULL | 59500.00 |
- | 1 | s.zh | NULL | NULL | 74000.00 |
- +---------------+---------------+-----------+----------+---------------+
可以看到 grouping_id 为0, 计算的是整体的薪水和
而grouping_id 为1, 计算的是分公司的薪水
这里面 GROUPING__ID 的计算方式 和 GROUPING 函数请参考我另一篇文章
我们刚才说过 用 GROUPING SETS 和 GROUP BY + UNION ALL 是等价的,但是它们真的一样么。我们运行 EXPLAIN 检验一下。
下面给出两个运行结果等价的SQL 的 EXPLAIN 结果。
SQL1
- EXPLAIN
-
- SELECT
- *
- FROM
- (
- SELECT
- 0 AS mark
- ,NULL
- ,NULL
- ,NULL
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
-
- UNION ALL
-
- SELECT
- 1 AS mark
- ,company_name
- ,NULL
- ,NULL
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ) tmp
- ORDER BY mark
- ;

SQL2
- EXPLAIN
-
- SELECT
- grouping__id
- ,company_name
- ,dep_name
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
- ,user_id
- GROUPING SETS ((company_name), ())
- ORDER BY
- grouping__id
- ;

先贴出2个SQL的分别EXPLAIN 结果
UNION ALL 的SQL
注意 : UNION ALL 中增加更多的SQL 会增加 JOB 数量。
- INFO : Starting task [Stage-4:EXPLAIN] in serial mode
- INFO : Completed executing command(queryId=hive_20200408035719_94eb75b2-c6fc-4804-bacb-d3555c61e7f3); Time taken: 0.016 seconds
- INFO : OK
- +----------------------------------------------------+
- | Explain |
- +----------------------------------------------------+
- | STAGE DEPENDENCIES: |
- | Stage-1 is a root stage |
- | Stage-2 depends on stages: Stage-1, Stage-3 |
- | Stage-3 is a root stage |
- | Stage-0 depends on stages: Stage-2 |
- | |
- | STAGE PLANS: |
- | Stage: Stage-1 |
- | Map Reduce |
- | Map Operator Tree: |
- | TableScan |
- | alias: datacube_salary_org |
- | filterExpr: (pt = '20200407') (type: boolean) |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | Select Operator |
- | expressions: salary (type: decimal(10,2)) |
- | outputColumnNames: salary |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | Group By Operator |
- | aggregations: sum(salary) |
- | mode: hash |
- | outputColumnNames: _col0 |
- | Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
- | Reduce Output Operator |
- | sort order: |
- | Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
- | value expressions: _col0 (type: decimal(20,2)) |
- | Reduce Operator Tree: |
- | Group By Operator |
- | aggregations: sum(VALUE._col0) |
- | mode: mergepartial |
- | outputColumnNames: _col0 |
- | Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
- | Select Operator |
- | expressions: 0 (type: int), null (type: string), _col0 (type: decimal(20,2)) |
- | outputColumnNames: _col0, _col1, _col4 |
- | Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
- | File Output Operator |
- | compressed: false |
- | table: |
- | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
- | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
- | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
- | |
- | Stage: Stage-2 |
- | Map Reduce |
- | Map Operator Tree: |
- | TableScan |
- | Union |
- | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
- | Select Operator |
- | expressions: _col0 (type: int), _col1 (type: string), _col4 (type: decimal(20,2)) |
- | outputColumnNames: _col0, _col1, _col4 |
- | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
- | Reduce Output Operator |
- | key expressions: _col0 (type: int) |
- | sort order: + |
- | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
- | value expressions: _col1 (type: string), _col4 (type: decimal(20,2)) |
- | TableScan |
- | Union |
- | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
- | Select Operator |
- | expressions: _col0 (type: int), _col1 (type: string), _col4 (type: decimal(20,2)) |
- | outputColumnNames: _col0, _col1, _col4 |
- | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
- | Reduce Output Operator |
- | key expressions: _col0 (type: int) |
- | sort order: + |
- | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
- | value expressions: _col1 (type: string), _col4 (type: decimal(20,2)) |
- | Reduce Operator Tree: |
- | Select Operator |
- | expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), null (type: void), null (type: void), VALUE._col3 (type: decimal(20,2)) |
- | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
- | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
- | File Output Operator |
- | compressed: false |
- | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
- | table: |
- | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
- | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
- | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
- | |
- | Stage: Stage-3 |
- | Map Reduce |
- | Map Operator Tree: |
- | TableScan |
- | alias: datacube_salary_org |
- | filterExpr: (pt = '20200407') (type: boolean) |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | Select Operator |
- | expressions: company_name (type: string), salary (type: decimal(10,2)) |
- | outputColumnNames: company_name, salary |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | Group By Operator |
- | aggregations: sum(salary) |
- | keys: company_name (type: string) |
- | mode: hash |
- | outputColumnNames: _col0, _col1 |
- +----------------------------------------------------+
- | Explain |
- +----------------------------------------------------+
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | Reduce Output Operator |
- | key expressions: _col0 (type: string) |
- | sort order: + |
- | Map-reduce partition columns: _col0 (type: string) |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | value expressions: _col1 (type: decimal(20,2)) |
- | Reduce Operator Tree: |
- | Group By Operator |
- | aggregations: sum(VALUE._col0) |
- | keys: KEY._col0 (type: string) |
- | mode: mergepartial |
- | outputColumnNames: _col0, _col1 |
- | Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
- | Select Operator |
- | expressions: 1 (type: int), _col0 (type: string), _col1 (type: decimal(20,2)) |
- | outputColumnNames: _col0, _col1, _col4 |
- | Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
- | File Output Operator |
- | compressed: false |
- | table: |
- | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
- | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
- | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
- | |
- | Stage: Stage-0 |
- | Fetch Operator |
- | limit: -1 |
- | Processor Tree: |
- | ListSink |
- | |
- +----------------------------------------------------+

GROUPING SETS 的SQL
- INFO : Starting task [Stage-3:EXPLAIN] in serial mode
- INFO : Completed executing command(queryId=hive_20200408035951_8153b57c-9f60-4c2d-bc26-92659fdc8afd); Time taken: 0.007 seconds
- INFO : OK
- +----------------------------------------------------+
- | Explain |
- +----------------------------------------------------+
- | STAGE DEPENDENCIES: |
- | Stage-1 is a root stage |
- | Stage-2 depends on stages: Stage-1 |
- | Stage-0 depends on stages: Stage-2 |
- | |
- | STAGE PLANS: |
- | Stage: Stage-1 |
- | Map Reduce |
- | Map Operator Tree: |
- | TableScan |
- | alias: datacube_salary_org |
- | filterExpr: (pt = '20200407') (type: boolean) |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | Select Operator |
- | expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), salary (type: decimal(10,2)) |
- | outputColumnNames: company_name, dep_name, user_id, salary |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | Group By Operator |
- | aggregations: sum(salary) |
- | keys: company_name (type: string), dep_name (type: string), user_id (type: bigint), 0 (type: int) |
- | mode: hash |
- | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
- | Statistics: Num rows: 14 Data size: 680 Basic stats: COMPLETE Column stats: NONE |
- | Reduce Output Operator |
- | key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: int) |
- | sort order: ++++ |
- | Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: int) |
- | Statistics: Num rows: 14 Data size: 680 Basic stats: COMPLETE Column stats: NONE |
- | value expressions: _col4 (type: decimal(20,2)) |
- | Reduce Operator Tree: |
- | Group By Operator |
- | aggregations: sum(VALUE._col0) |
- | keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: int) |
- | mode: mergepartial |
- | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | Select Operator |
- | expressions: _col3 (type: int), _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col4 (type: decimal(20,2)) |
- | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | File Output Operator |
- | compressed: false |
- | table: |
- | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
- | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
- | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
- | |
- | Stage: Stage-2 |
- | Map Reduce |
- | Map Operator Tree: |
- | TableScan |
- | Reduce Output Operator |
- | key expressions: _col0 (type: int) |
- | sort order: + |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: decimal(20,2)) |
- | Reduce Operator Tree: |
- | Select Operator |
- | expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: bigint), VALUE._col3 (type: decimal(20,2)) |
- | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | File Output Operator |
- | compressed: false |
- | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
- | table: |
- | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
- | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
- | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
- | |
- | Stage: Stage-0 |
- | Fetch Operator |
- | limit: -1 |
- | Processor Tree: |
- | ListSink |
- | |
- +----------------------------------------------------+

通过以上分析比较,我们不难发现 GROUPING SETS 的作业划分数量更少,
实际情况下 : GROUPING SETS 的 运行效率也要高于 UNION ALL 的 GROUP BY 形式
下面是对比的执行时间
GROUPING SETS
- INFO : MapReduce Jobs Launched:
- INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.62 sec HDFS Read: 11666 HDFS Write: 175 SUCCESS
- INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.51 sec HDFS Read: 7060 HDFS Write: 188 SUCCESS
- INFO : Total MapReduce CPU Time Spent: 7 seconds 130 msec
- INFO : Completed executing command(queryId=hive_20200408045412_4ab9e09f-436e-4433-9a1f-a03d5b32ef3e); Time taken: 49.676 seconds
- INFO : OK
- +---------------+---------------+-----------+----------+---------------+
- | grouping__id | company_name | dep_name | user_id | total_salary |
- +---------------+---------------+-----------+----------+---------------+
- | 0 | NULL | NULL | NULL | 133500.00 |
- | 1 | x.qx | NULL | NULL | 59500.00 |
- | 1 | s.zh | NULL | NULL | 74000.00 |
- +---------------+---------------+-----------+----------+---------------+
UNION ALL 的 GROUP BY
- INFO : MapReduce Jobs Launched:
- INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.7 sec HDFS Read: 10541 HDFS Write: 119 SUCCESS
- INFO : Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 4.34 sec HDFS Read: 10919 HDFS Write: 152 SUCCESS
- INFO : Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 5.08 sec HDFS Read: 12932 HDFS Write: 188 SUCCESS
- INFO : Total MapReduce CPU Time Spent: 13 seconds 120 msec
- INFO : Completed executing command(queryId=hive_20200408045141_2033cbf6-a457-4bdb-aaec-65900b386972); Time taken: 84.365 seconds
- INFO : OK
- +-----------+----------+----------+----------+-------------------+
- | tmp.mark | tmp._c1 | tmp._c2 | tmp._c3 | tmp.total_salary |
- +-----------+----------+----------+----------+-------------------+
- | 0 | NULL | NULL | NULL | 133500.00 |
- | 1 | x.qx | NULL | NULL | 59500.00 |
- | 1 | s.zh | NULL | NULL | 74000.00 |
- +-----------+----------+----------+----------+-------------------+
可以看到 GROUPING SETS 的执行效率更高
rollup可以实现从右到左递减多级的统计,显示统计某一层次结构的聚合。
即
- SELECT
- grouping__id
- ,company_name
- ,dep_name
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
- ,user_id
- WITH ROLLUP
- ORDER BY
- grouping__id
- ;

等价于
- SELECT
- grouping__id
- ,company_name
- ,dep_name
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
- ,user_id
- GROUPING SETS ((company_name, dep_name, user_id), (company_name, dep_name), (company_name),())
- ORDER BY
- grouping__id
- ;

等价于
- SELECT
- company_name
- ,dep_name
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
- ,user_id
-
- UNION ALL
-
- SELECT
- company_name
- ,dep_name
- ,NULL
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
-
- UNION ALL
-
- SELECT
- company_name
- ,NULL
- ,NULL
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
-
- UNION ALL
-
- SELECT
- NULL
- ,NULL
- ,NULL
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- ;

运算结果
- INFO : MapReduce Jobs Launched:
- INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.21 sec HDFS Read: 11674 HDFS Write: 563 SUCCESS
- INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.91 sec HDFS Read: 7448 HDFS Write: 602 SUCCESS
- INFO : Total MapReduce CPU Time Spent: 8 seconds 120 msec
- INFO : Completed executing command(queryId=hive_20200408052638_740f42b9-6f08-49a6-8123-9a77aedc6b19); Time taken: 50.563 seconds
- INFO : OK
- +---------------+---------------+-----------+----------+---------------+
- | grouping__id | company_name | dep_name | user_id | total_salary |
- +---------------+---------------+-----------+----------+---------------+
- | 0 | NULL | NULL | NULL | 133500.00 |
- | 1 | s.zh | NULL | NULL | 74000.00 |
- | 1 | x.qx | NULL | NULL | 59500.00 |
- | 3 | s.zh | tester | NULL | 20000.00 |
- | 3 | x.qx | kiccp | NULL | 8600.00 |
- | 3 | x.qx | finance | NULL | 50900.00 |
- | 3 | s.zh | enginer | NULL | 54000.00 |
- | 7 | x.qx | kiccp | 7 | 8600.00 |
- | 7 | x.qx | finance | 6 | 13000.00 |
- | 7 | x.qx | finance | 5 | 24500.00 |
- | 7 | x.qx | finance | 4 | 13400.00 |
- | 7 | s.zh | enginer | 2 | 26000.00 |
- | 7 | s.zh | enginer | 1 | 28000.00 |
- | 7 | s.zh | tester | 3 | 20000.00 |
- +---------------+---------------+-----------+----------+---------------+

SELECT
grouping__id
,company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
WITH ROLLUP
ORDER BY
grouping__id
;
可以看到 GROUP BY 中有 三个字段, 依次是 company_name, dep_name, user_id
可以看作3位二进制
即 (0/1),(0/1),(0/1)
低位对应 company_name
高位对应 user_id
如果这一位聚合了(GROUP BY 中存在该字段),则为0,GROUPING 函数对应也返回0
最后 GROUPING_ID 为对应2进制 转换的 10进制数。
1) ROLLUP 的 全分组的子集
- SELECT
- company_name
- ,dep_name
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
- ,user_id
对应的结果
- +---------------+---------------+-----------+----------+---------------+
- | grouping__id | company_name | dep_name | user_id | total_salary |
- +---------------+---------------+-----------+----------+---------------+
- | 7 | x.qx | kiccp | 7 | 8600.00 |
- | 7 | x.qx | finance | 6 | 13000.00 |
- | 7 | x.qx | finance | 5 | 24500.00 |
- | 7 | x.qx | finance | 4 | 13400.00 |
- | 7 | s.zh | enginer | 2 | 26000.00 |
- | 7 | s.zh | enginer | 1 | 28000.00 |
- | 7 | s.zh | tester | 3 | 20000.00 |
- +---------------+---------------+-----------+----------+---------------+
因为 GROUP BY 这三个字段都有,则为 111 ,所以 GROUPING__ID 为7
2) ROLLUP company_name, dep_name 分组的子集
- SELECT
- company_name
- ,dep_name
- ,NULL
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
对应的子集
- +---------------+---------------+-----------+----------+---------------+
- | grouping__id | company_name | dep_name | user_id | total_salary |
- +---------------+---------------+-----------+----------+---------------+
- | 3 | s.zh | tester | NULL | 20000.00 |
- | 3 | x.qx | kiccp | NULL | 8600.00 |
- | 3 | x.qx | finance | NULL | 50900.00 |
- | 3 | s.zh | enginer | NULL | 54000.00 |
- +---------------+---------------+-----------+----------+---------------+
因为 GROUP BY company_name ,dep_name
结合上面规则
可以看到 GROUP BY 中有 三个字段, 依次是 company_name, dep_name, user_id
可以看作3位二进制
即 (0/1),(0/1),(0/1)
低位对应 company_name
高位对应 user_id
则 GROUPING__ID 为 011 ,即为 3
我相信这里已经解释的很清楚了,下面我们看下 CUBE 函数
cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,它会统计所选列中值的所有组合的聚合
- SELECT
- grouping__id
- ,company_name
- ,dep_name
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
- ,user_id
- WITH CUBE
- ORDER BY
- grouping__id
- ;

等价于
(company_name,dep_name,user_id)
- SELECT
- company_name
- ,dep_name
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
- ,user_id
-
- UNION ALL
-
- SELECT
- company_name
- ,dep_name
- ,NULL
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,dep_name
-
- UNION ALL
-
- SELECT
- company_name
- ,NULL
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
- ,user_id
-
-
- UNION ALL
-
- SELECT
- company_name
- ,NULL
- ,NULL
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- company_name
-
- UNION ALL
-
- SELECT
- NULL
- ,dep_name
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- dep_name
- ,user_id
-
-
- UNION ALL
-
- SELECT
- NULL
- ,dep_name
- ,NULL
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- dep_name
-
-
- UNION ALL
-
- SELECT
- NULL
- ,NULL
- ,user_id
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
- GROUP BY
- user_id
-
- UNION ALL
-
- SELECT
- NULL
- ,NULL
- ,NULL
- ,SUM(salary) AS total_salary
- FROM datacube_salary_org
- WHERE pt = '20200407'
-
-
- ;
-

结果如下:
- +---------------+---------------+-----------+----------+---------------+
- | grouping__id | company_name | dep_name | user_id | total_salary |
- +---------------+---------------+-----------+----------+---------------+
- | 0 | NULL | NULL | NULL | 133500.00 |
- | 1 | s.zh | NULL | NULL | 74000.00 |
- | 1 | x.qx | NULL | NULL | 59500.00 |
- | 2 | NULL | finance | NULL | 50900.00 |
- | 2 | NULL | kiccp | NULL | 8600.00 |
- | 2 | NULL | tester | NULL | 20000.00 |
- | 2 | NULL | enginer | NULL | 54000.00 |
- | 3 | s.zh | tester | NULL | 20000.00 |
- | 3 | s.zh | enginer | NULL | 54000.00 |
- | 3 | x.qx | kiccp | NULL | 8600.00 |
- | 3 | x.qx | finance | NULL | 50900.00 |
- | 4 | NULL | NULL | 7 | 8600.00 |
- | 4 | NULL | NULL | 5 | 24500.00 |
- | 4 | NULL | NULL | 4 | 13400.00 |
- | 4 | NULL | NULL | 3 | 20000.00 |
- | 4 | NULL | NULL | 2 | 26000.00 |
- | 4 | NULL | NULL | 1 | 28000.00 |
- | 4 | NULL | NULL | 6 | 13000.00 |
- | 5 | s.zh | NULL | 2 | 26000.00 |
- | 5 | s.zh | NULL | 3 | 20000.00 |
- | 5 | x.qx | NULL | 5 | 24500.00 |
- | 5 | x.qx | NULL | 6 | 13000.00 |
- | 5 | s.zh | NULL | 1 | 28000.00 |
- | 5 | x.qx | NULL | 7 | 8600.00 |
- | 5 | x.qx | NULL | 4 | 13400.00 |
- | 6 | NULL | enginer | 1 | 28000.00 |
- | 6 | NULL | finance | 4 | 13400.00 |
- | 6 | NULL | tester | 3 | 20000.00 |
- | 6 | NULL | finance | 5 | 24500.00 |
- | 6 | NULL | kiccp | 7 | 8600.00 |
- | 6 | NULL | enginer | 2 | 26000.00 |
- | 6 | NULL | finance | 6 | 13000.00 |
- | 7 | x.qx | finance | 5 | 24500.00 |
- | 7 | x.qx | finance | 4 | 13400.00 |
- | 7 | x.qx | kiccp | 7 | 8600.00 |
- | 7 | s.zh | tester | 3 | 20000.00 |
- | 7 | s.zh | enginer | 2 | 26000.00 |
- | 7 | s.zh | enginer | 1 | 28000.00 |
- | 7 | x.qx | finance | 6 | 13000.00 |
- +---------------+---------------+-----------+----------+---------------+

Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。