当前位置:   article > 正文

Hive_Hive ROLLUP, GROUPING SETS, CUBE 聚合函数 与 GROUPING 函数_hive rolling up

hive rolling up

参考文章:

1.hive 中的grouping set,cube,roll up函数

https://blog.csdn.net/weixin_34352449/article/details/92169948

 

2.GROUPING函数

https://blog.csdn.net/u013053796/article/details/18619315

 

3.GROUP BY...WITH ROLL UP 分组统计后的再合计

https://blog.csdn.net/u013677636/article/details/52353812

 

  在使用Hive 的时候,我们常常进行聚合统计运算。

聚合统计的运算函数有很多,比如 我们最常用的 GROUP BY 函数。

 

但是常常我们需要多维度统计数据,这个时候我们就会用到Hive 的聚合统计函数

这里我们讲解下  ROLLUP,  GROUPING SETS, CUBE 的含义以及用法。

 

我们结合案例讲解下这几个函数的使用规则

 

数据统计背景:

我们现在有多个公司,多个部门,多位员工的薪水情况。现在我们需要按照多种维度去统计薪水。 

 

首先 我们构建下基础的表

  1. use data_warehouse_test;
  2. CREATE TABLE IF NOT EXISTS datacube_salary_org (
  3. company_name STRING COMMENT '公司名称'
  4. ,dep_name STRING COMMENT '部门名称'
  5. ,user_id BIGINT COMMENT '用户id'
  6. ,user_name STRING COMMENT '用户姓名'
  7. ,salary DECIMAL(10,2) COMMENT '薪水'
  8. ,create_time DATE COMMENT '创建时间'
  9. ,update_time DATE COMMENT '修改时间'
  10. )
  11. PARTITIONED BY(
  12. pt STRING COMMENT '数据分区'
  13. )
  14. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  15. STORED AS TEXTFILE
  16. ;
  17. CREATE TABLE IF NOT EXISTS datacube_salary_basic_aggr(
  18. company_name STRING COMMENT '公司名称'
  19. ,dep_name STRING COMMENT '部门名称'
  20. ,user_id BIGINT COMMENT '用户id'
  21. ,salary DECIMAL(10,2) COMMENT '薪水'
  22. )
  23. STORED AS ORC
  24. ;
  25. CREATE TABLE IF NOT EXISTS datacube_salary_dep_aggr(
  26. company_name STRING COMMENT '公司名称'
  27. ,dep_name STRING COMMENT '部门名称'
  28. ,total_salary DECIMAL(10,2) COMMENT '薪水'
  29. )
  30. STORED AS ORC
  31. ;
  32. CREATE TABLE IF NOT EXISTS datacube_salary_company_aggr(
  33. company_name STRING COMMENT '公司名称'
  34. ,total_salary DECIMAL(10,2) COMMENT '薪水'
  35. )
  36. STORED AS ORC
  37. ;
  38. CREATE TABLE IF NOT EXISTS datacube_salary_total_aggr(
  39. total_salary DECIMAL(10,2) COMMENT '薪水'
  40. )
  41. STORED AS ORC
  42. ;

 

然后,我们通过2种方式导入基础数据

1.直接通过SQL插入

2.通过文件LOAD

 

1.通过SQL插入

  1. INSERT OVERWRITE TABLE datacube_salary_org PARTITION (pt = '20200407') VALUES
  2. ('s.zh','enginer',1,'szh',28000.0,'2020-04-07','2020-04-07'),
  3. ('s.zh','enginer',2,'zyq',26000.0,'2020-04-03','2020-04-03'),
  4. ('s.zh','tester',3,'gkm',20000.0,'2020-04-07','2020-04-07'),
  5. ('x.qx','finance',4,'pip',13400.0,'2020-04-07','2020-04-07'),
  6. ('x.qx','finance',5,'kip',24500.0,'2020-04-07','2020-04-07'),
  7. ('x.qx','finance',6,'zxxc',13000.0,'2020-04-07','2020-04-07'),
  8. ('x.qx','kiccp',7,'xsz',8600.0,'2020-04-07','2020-04-07')
  9. ;

 

2.通过文件LOAD

创建一个txt 文件,填入以下内容

  1. s.zh,engineer,1,szh,28000.0,2020-04-07,2020-04-07
  2. s.zh,engineer,2,zyq,26000.0,2020-04-03,2020-04-03
  3. s.zh,tester,3,gkm,20000.0,2020-04-07,2020-04-07
  4. x.qx,finance,4,pip,13400.0,2020-04-07,2020-04-07
  5. x.qx,finance,5,kip,24500.0,2020-04-07,2020-04-07
  6. x.qx,finance,6,zxxc,13000.0,2020-04-07,2020-04-07
  7. x.qx,kiccp,7,xsz,8600.0,2020-04-07,2020-04-07

我们通过如下2种方式,将数据加载 进去

 

方法1.

 创建文件夹

 移动文件

 修复表

hdfs dfs -mkdir /user/hive/warehouse/data_warehouse_test.db/datacube_salary_org/pt=20200406
hdfs dfs -put org_data.txt /user/hive/warehouse/data_warehouse_test.db/datacube_salary_org/pt=20200406

在beeline / hive  客户端 执行如下语句

MSCK REPAIR TABLE datacube_salary_org;

 

方法2.

   创建分区

   LOAD文件

 

LOAD 文件的标准用法 

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]  

  

创建分区

ALTER TABLE datacube_salary_org ADD PARTITION (pt = '20200405');

LOAD数据

LOAD DATA LOCAL INPATH '/opt/hive/my_script/data_warehouse_test/rollup_table/org_data.txt'  OVERWRITE INTO TABLE datacube_salary_org PARTITION (pt = '20200405');

 

 

之后,我们看下这几个函数的用法

 

ROLLUP,  GROUPING SETS, CUBE

我们从 GROUPING SETS , ROLLUP, CUBE 一点点讨论。

 

GROUPING SETS

     GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解.

 

 

首先我们学习下 GROUPING SETS 

GROUPING SETS 用法如下:

  1. SELECT
  2. a
  3. ,b
  4. ...
  5. ,f
  6. FROM test_table
  7. GROUP BY
  8. a
  9. ,b
  10. ...
  11. ,f
  12. GROUPING SETS ((?,...,?),(xxx),(yyy))
  13. GROUPING SETS 中间可以填写多个条件。

其中 (?,...,?) 可以为 a~f 中不重复的任意项

 

具体例子如下:

  1. SELECT
  2. a
  3. ,b
  4. ,SUM(c)
  5. FROM test_table
  6. GROUP BY
  7. a
  8. ,b
  9. GROUPING SETS ((a),(a,b),())

等价于

  1. SELECT
  2. a
  3. ,NULL
  4. ,SUM(c)
  5. FROM test_table
  6. GROUP BY
  7. a
  8. UNION ALL
  9. SELECT
  10. a
  11. ,b
  12. ,SUM(c)
  13. FROM test_table
  14. GROUP BY
  15. a
  16. ,b
  17. UNION ALL
  18. SELECT
  19. NULL
  20. ,NULL
  21. ,SUM(c)
  22. FROM test_table
  23. ;

 

实际案例,我们想按照公司 ,和整体 去统计员工的薪水, 但是我们想在 一条语句中完成。

我们该如何编写SQL 呢?

SQL如下:

  1. SELECT
  2. grouping__id
  3. ,company_name
  4. ,dep_name
  5. ,user_id
  6. ,SUM(salary) AS total_salary
  7. FROM datacube_salary_org
  8. WHERE pt = '20200407'
  9. GROUP BY
  10. company_name
  11. ,dep_name
  12. ,user_id
  13. GROUPING SETS ((company_name), ())
  14. ORDER BY
  15. grouping__id
  16. ;

 

输出如下:

  1. INFO : MapReduce Jobs Launched:
  2. INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.13 sec HDFS Read: 11666 HDFS Write: 175 SUCCESS
  3. INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.73 sec HDFS Read: 7060 HDFS Write: 188 SUCCESS
  4. INFO : Total MapReduce CPU Time Spent: 7 seconds 860 msec
  5. INFO : Completed executing command(queryId=hive_20200408032038_18e04047-b8c0-4d07-a5de-00ccbc7cb4cc); Time taken: 51.459 seconds
  6. INFO : OK
  7. +---------------+---------------+-----------+----------+---------------+
  8. | grouping__id | company_name | dep_name | user_id | total_salary |
  9. +---------------+---------------+-----------+----------+---------------+
  10. | 0 | NULL | NULL | NULL | 133500.00 |
  11. | 1 | x.qx | NULL | NULL | 59500.00 |
  12. | 1 | s.zh | NULL | NULL | 74000.00 |
  13. +---------------+---------------+-----------+----------+---------------+

可以看到 grouping_id 为0, 计算的是整体的薪水和

而grouping_id 为1,  计算的是分公司的薪水

这里面 GROUPING__ID 的计算方式 和 GROUPING 函数请参考我另一篇文章

 

 

 

我们刚才说过 用 GROUPING SETS  和  GROUP BY + UNION ALL 是等价的,但是它们真的一样么。我们运行 EXPLAIN 检验一下。

下面给出两个运行结果等价的SQL 的 EXPLAIN 结果。

SQL1

  1. EXPLAIN
  2. SELECT
  3. *
  4. FROM
  5. (
  6. SELECT
  7. 0 AS mark
  8. ,NULL
  9. ,NULL
  10. ,NULL
  11. ,SUM(salary) AS total_salary
  12. FROM datacube_salary_org
  13. WHERE pt = '20200407'
  14. UNION ALL
  15. SELECT
  16. 1 AS mark
  17. ,company_name
  18. ,NULL
  19. ,NULL
  20. ,SUM(salary) AS total_salary
  21. FROM datacube_salary_org
  22. WHERE pt = '20200407'
  23. GROUP BY
  24. company_name
  25. ) tmp
  26. ORDER BY mark
  27. ;

 

SQL2

  1. EXPLAIN
  2. SELECT
  3. grouping__id
  4. ,company_name
  5. ,dep_name
  6. ,user_id
  7. ,SUM(salary) AS total_salary
  8. FROM datacube_salary_org
  9. WHERE pt = '20200407'
  10. GROUP BY
  11. company_name
  12. ,dep_name
  13. ,user_id
  14. GROUPING SETS ((company_name), ())
  15. ORDER BY
  16. grouping__id
  17. ;

 

 

先贴出2个SQL的分别EXPLAIN 结果

UNION ALL 的SQL

注意 : UNION ALL 中增加更多的SQL 会增加 JOB 数量。

  1. INFO : Starting task [Stage-4:EXPLAIN] in serial mode
  2. INFO : Completed executing command(queryId=hive_20200408035719_94eb75b2-c6fc-4804-bacb-d3555c61e7f3); Time taken: 0.016 seconds
  3. INFO : OK
  4. +----------------------------------------------------+
  5. | Explain |
  6. +----------------------------------------------------+
  7. | STAGE DEPENDENCIES: |
  8. | Stage-1 is a root stage |
  9. | Stage-2 depends on stages: Stage-1, Stage-3 |
  10. | Stage-3 is a root stage |
  11. | Stage-0 depends on stages: Stage-2 |
  12. | |
  13. | STAGE PLANS: |
  14. | Stage: Stage-1 |
  15. | Map Reduce |
  16. | Map Operator Tree: |
  17. | TableScan |
  18. | alias: datacube_salary_org |
  19. | filterExpr: (pt = '20200407') (type: boolean) |
  20. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  21. | Select Operator |
  22. | expressions: salary (type: decimal(10,2)) |
  23. | outputColumnNames: salary |
  24. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  25. | Group By Operator |
  26. | aggregations: sum(salary) |
  27. | mode: hash |
  28. | outputColumnNames: _col0 |
  29. | Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
  30. | Reduce Output Operator |
  31. | sort order: |
  32. | Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
  33. | value expressions: _col0 (type: decimal(20,2)) |
  34. | Reduce Operator Tree: |
  35. | Group By Operator |
  36. | aggregations: sum(VALUE._col0) |
  37. | mode: mergepartial |
  38. | outputColumnNames: _col0 |
  39. | Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
  40. | Select Operator |
  41. | expressions: 0 (type: int), null (type: string), _col0 (type: decimal(20,2)) |
  42. | outputColumnNames: _col0, _col1, _col4 |
  43. | Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
  44. | File Output Operator |
  45. | compressed: false |
  46. | table: |
  47. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
  48. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
  49. | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
  50. | |
  51. | Stage: Stage-2 |
  52. | Map Reduce |
  53. | Map Operator Tree: |
  54. | TableScan |
  55. | Union |
  56. | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
  57. | Select Operator |
  58. | expressions: _col0 (type: int), _col1 (type: string), _col4 (type: decimal(20,2)) |
  59. | outputColumnNames: _col0, _col1, _col4 |
  60. | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
  61. | Reduce Output Operator |
  62. | key expressions: _col0 (type: int) |
  63. | sort order: + |
  64. | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
  65. | value expressions: _col1 (type: string), _col4 (type: decimal(20,2)) |
  66. | TableScan |
  67. | Union |
  68. | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
  69. | Select Operator |
  70. | expressions: _col0 (type: int), _col1 (type: string), _col4 (type: decimal(20,2)) |
  71. | outputColumnNames: _col0, _col1, _col4 |
  72. | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
  73. | Reduce Output Operator |
  74. | key expressions: _col0 (type: int) |
  75. | sort order: + |
  76. | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
  77. | value expressions: _col1 (type: string), _col4 (type: decimal(20,2)) |
  78. | Reduce Operator Tree: |
  79. | Select Operator |
  80. | expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), null (type: void), null (type: void), VALUE._col3 (type: decimal(20,2)) |
  81. | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
  82. | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
  83. | File Output Operator |
  84. | compressed: false |
  85. | Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
  86. | table: |
  87. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
  88. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
  89. | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
  90. | |
  91. | Stage: Stage-3 |
  92. | Map Reduce |
  93. | Map Operator Tree: |
  94. | TableScan |
  95. | alias: datacube_salary_org |
  96. | filterExpr: (pt = '20200407') (type: boolean) |
  97. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  98. | Select Operator |
  99. | expressions: company_name (type: string), salary (type: decimal(10,2)) |
  100. | outputColumnNames: company_name, salary |
  101. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  102. | Group By Operator |
  103. | aggregations: sum(salary) |
  104. | keys: company_name (type: string) |
  105. | mode: hash |
  106. | outputColumnNames: _col0, _col1 |
  107. +----------------------------------------------------+
  108. | Explain |
  109. +----------------------------------------------------+
  110. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  111. | Reduce Output Operator |
  112. | key expressions: _col0 (type: string) |
  113. | sort order: + |
  114. | Map-reduce partition columns: _col0 (type: string) |
  115. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  116. | value expressions: _col1 (type: decimal(20,2)) |
  117. | Reduce Operator Tree: |
  118. | Group By Operator |
  119. | aggregations: sum(VALUE._col0) |
  120. | keys: KEY._col0 (type: string) |
  121. | mode: mergepartial |
  122. | outputColumnNames: _col0, _col1 |
  123. | Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
  124. | Select Operator |
  125. | expressions: 1 (type: int), _col0 (type: string), _col1 (type: decimal(20,2)) |
  126. | outputColumnNames: _col0, _col1, _col4 |
  127. | Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
  128. | File Output Operator |
  129. | compressed: false |
  130. | table: |
  131. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
  132. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
  133. | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
  134. | |
  135. | Stage: Stage-0 |
  136. | Fetch Operator |
  137. | limit: -1 |
  138. | Processor Tree: |
  139. | ListSink |
  140. | |
  141. +----------------------------------------------------+

 

GROUPING SETS 的SQL

  1. INFO : Starting task [Stage-3:EXPLAIN] in serial mode
  2. INFO : Completed executing command(queryId=hive_20200408035951_8153b57c-9f60-4c2d-bc26-92659fdc8afd); Time taken: 0.007 seconds
  3. INFO : OK
  4. +----------------------------------------------------+
  5. | Explain |
  6. +----------------------------------------------------+
  7. | STAGE DEPENDENCIES: |
  8. | Stage-1 is a root stage |
  9. | Stage-2 depends on stages: Stage-1 |
  10. | Stage-0 depends on stages: Stage-2 |
  11. | |
  12. | STAGE PLANS: |
  13. | Stage: Stage-1 |
  14. | Map Reduce |
  15. | Map Operator Tree: |
  16. | TableScan |
  17. | alias: datacube_salary_org |
  18. | filterExpr: (pt = '20200407') (type: boolean) |
  19. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  20. | Select Operator |
  21. | expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), salary (type: decimal(10,2)) |
  22. | outputColumnNames: company_name, dep_name, user_id, salary |
  23. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  24. | Group By Operator |
  25. | aggregations: sum(salary) |
  26. | keys: company_name (type: string), dep_name (type: string), user_id (type: bigint), 0 (type: int) |
  27. | mode: hash |
  28. | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
  29. | Statistics: Num rows: 14 Data size: 680 Basic stats: COMPLETE Column stats: NONE |
  30. | Reduce Output Operator |
  31. | key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: int) |
  32. | sort order: ++++ |
  33. | Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: int) |
  34. | Statistics: Num rows: 14 Data size: 680 Basic stats: COMPLETE Column stats: NONE |
  35. | value expressions: _col4 (type: decimal(20,2)) |
  36. | Reduce Operator Tree: |
  37. | Group By Operator |
  38. | aggregations: sum(VALUE._col0) |
  39. | keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: int) |
  40. | mode: mergepartial |
  41. | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
  42. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  43. | Select Operator |
  44. | expressions: _col3 (type: int), _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col4 (type: decimal(20,2)) |
  45. | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
  46. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  47. | File Output Operator |
  48. | compressed: false |
  49. | table: |
  50. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
  51. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
  52. | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
  53. | |
  54. | Stage: Stage-2 |
  55. | Map Reduce |
  56. | Map Operator Tree: |
  57. | TableScan |
  58. | Reduce Output Operator |
  59. | key expressions: _col0 (type: int) |
  60. | sort order: + |
  61. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  62. | value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: decimal(20,2)) |
  63. | Reduce Operator Tree: |
  64. | Select Operator |
  65. | expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: bigint), VALUE._col3 (type: decimal(20,2)) |
  66. | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
  67. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  68. | File Output Operator |
  69. | compressed: false |
  70. | Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
  71. | table: |
  72. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
  73. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
  74. | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
  75. | |
  76. | Stage: Stage-0 |
  77. | Fetch Operator |
  78. | limit: -1 |
  79. | Processor Tree: |
  80. | ListSink |
  81. | |
  82. +----------------------------------------------------+

 

通过以上分析比较,我们不难发现 GROUPING SETS 的作业划分数量更少,

实际情况下 : GROUPING SETS 的 运行效率也要高于 UNION ALL 的 GROUP BY 形式

 

下面是对比的执行时间

GROUPING SETS

  1. INFO : MapReduce Jobs Launched:
  2. INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.62 sec HDFS Read: 11666 HDFS Write: 175 SUCCESS
  3. INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.51 sec HDFS Read: 7060 HDFS Write: 188 SUCCESS
  4. INFO : Total MapReduce CPU Time Spent: 7 seconds 130 msec
  5. INFO : Completed executing command(queryId=hive_20200408045412_4ab9e09f-436e-4433-9a1f-a03d5b32ef3e); Time taken: 49.676 seconds
  6. INFO : OK
  7. +---------------+---------------+-----------+----------+---------------+
  8. | grouping__id | company_name | dep_name | user_id | total_salary |
  9. +---------------+---------------+-----------+----------+---------------+
  10. | 0 | NULL | NULL | NULL | 133500.00 |
  11. | 1 | x.qx | NULL | NULL | 59500.00 |
  12. | 1 | s.zh | NULL | NULL | 74000.00 |
  13. +---------------+---------------+-----------+----------+---------------+

 

 

UNION ALL 的 GROUP BY

  1. INFO : MapReduce Jobs Launched:
  2. INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.7 sec HDFS Read: 10541 HDFS Write: 119 SUCCESS
  3. INFO : Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 4.34 sec HDFS Read: 10919 HDFS Write: 152 SUCCESS
  4. INFO : Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 5.08 sec HDFS Read: 12932 HDFS Write: 188 SUCCESS
  5. INFO : Total MapReduce CPU Time Spent: 13 seconds 120 msec
  6. INFO : Completed executing command(queryId=hive_20200408045141_2033cbf6-a457-4bdb-aaec-65900b386972); Time taken: 84.365 seconds
  7. INFO : OK
  8. +-----------+----------+----------+----------+-------------------+
  9. | tmp.mark | tmp._c1 | tmp._c2 | tmp._c3 | tmp.total_salary |
  10. +-----------+----------+----------+----------+-------------------+
  11. | 0 | NULL | NULL | NULL | 133500.00 |
  12. | 1 | x.qx | NULL | NULL | 59500.00 |
  13. | 1 | s.zh | NULL | NULL | 74000.00 |
  14. +-----------+----------+----------+----------+-------------------+

 

可以看到 GROUPING SETS 的执行效率更高

 

 

 

ROLLUP

rollup可以实现从右到左递减多级的统计,显示统计某一层次结构的聚合。

 

  1. SELECT
  2. grouping__id
  3. ,company_name
  4. ,dep_name
  5. ,user_id
  6. ,SUM(salary) AS total_salary
  7. FROM datacube_salary_org
  8. WHERE pt = '20200407'
  9. GROUP BY
  10. company_name
  11. ,dep_name
  12. ,user_id
  13. WITH ROLLUP
  14. ORDER BY
  15. grouping__id
  16. ;

 

等价于

  1. SELECT
  2. grouping__id
  3. ,company_name
  4. ,dep_name
  5. ,user_id
  6. ,SUM(salary) AS total_salary
  7. FROM datacube_salary_org
  8. WHERE pt = '20200407'
  9. GROUP BY
  10. company_name
  11. ,dep_name
  12. ,user_id
  13. GROUPING SETS ((company_name, dep_name, user_id), (company_name, dep_name), (company_name),())
  14. ORDER BY
  15. grouping__id
  16. ;

 

等价于

  1. SELECT
  2. company_name
  3. ,dep_name
  4. ,user_id
  5. ,SUM(salary) AS total_salary
  6. FROM datacube_salary_org
  7. WHERE pt = '20200407'
  8. GROUP BY
  9. company_name
  10. ,dep_name
  11. ,user_id
  12. UNION ALL
  13. SELECT
  14. company_name
  15. ,dep_name
  16. ,NULL
  17. ,SUM(salary) AS total_salary
  18. FROM datacube_salary_org
  19. WHERE pt = '20200407'
  20. GROUP BY
  21. company_name
  22. ,dep_name
  23. UNION ALL
  24. SELECT
  25. company_name
  26. ,NULL
  27. ,NULL
  28. ,SUM(salary) AS total_salary
  29. FROM datacube_salary_org
  30. WHERE pt = '20200407'
  31. GROUP BY
  32. company_name
  33. UNION ALL
  34. SELECT
  35. NULL
  36. ,NULL
  37. ,NULL
  38. ,SUM(salary) AS total_salary
  39. FROM datacube_salary_org
  40. WHERE pt = '20200407'
  41. ;

 

运算结果

  1. INFO : MapReduce Jobs Launched:
  2. INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.21 sec HDFS Read: 11674 HDFS Write: 563 SUCCESS
  3. INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.91 sec HDFS Read: 7448 HDFS Write: 602 SUCCESS
  4. INFO : Total MapReduce CPU Time Spent: 8 seconds 120 msec
  5. INFO : Completed executing command(queryId=hive_20200408052638_740f42b9-6f08-49a6-8123-9a77aedc6b19); Time taken: 50.563 seconds
  6. INFO : OK
  7. +---------------+---------------+-----------+----------+---------------+
  8. | grouping__id | company_name | dep_name | user_id | total_salary |
  9. +---------------+---------------+-----------+----------+---------------+
  10. | 0 | NULL | NULL | NULL | 133500.00 |
  11. | 1 | s.zh | NULL | NULL | 74000.00 |
  12. | 1 | x.qx | NULL | NULL | 59500.00 |
  13. | 3 | s.zh | tester | NULL | 20000.00 |
  14. | 3 | x.qx | kiccp | NULL | 8600.00 |
  15. | 3 | x.qx | finance | NULL | 50900.00 |
  16. | 3 | s.zh | enginer | NULL | 54000.00 |
  17. | 7 | x.qx | kiccp | 7 | 8600.00 |
  18. | 7 | x.qx | finance | 6 | 13000.00 |
  19. | 7 | x.qx | finance | 5 | 24500.00 |
  20. | 7 | x.qx | finance | 4 | 13400.00 |
  21. | 7 | s.zh | enginer | 2 | 26000.00 |
  22. | 7 | s.zh | enginer | 1 | 28000.00 |
  23. | 7 | s.zh | tester | 3 | 20000.00 |
  24. +---------------+---------------+-----------+----------+---------------+

 

这里我们简要的说一下 GROUPING__ID 的计算规则

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
 WITH ROLLUP
ORDER BY
 grouping__id
;
 

可以看到 GROUP BY 中有 三个字段, 依次是 company_name, dep_name, user_id

可以看作3位二进制

即 (0/1),(0/1),(0/1) 

低位对应 company_name

高位对应 user_id 

如果这一位聚合了(GROUP BY 中存在该字段),则为0,GROUPING 函数对应也返回0 

最后 GROUPING_ID 为对应2进制 转换的 10进制数。

 

 

1) ROLLUP 的 全分组的子集

  1. SELECT 
  2.  company_name
  3.  ,dep_name
  4.  ,user_id
  5.  ,SUM(salary) AS total_salary
  6. FROM datacube_salary_org
  7. WHERE pt = '20200407'
  8. GROUP BY 
  9.  company_name
  10.  ,dep_name
  11.  ,user_id 

对应的结果

  1. +---------------+---------------+-----------+----------+---------------+
  2. | grouping__id | company_name | dep_name | user_id | total_salary |
  3. +---------------+---------------+-----------+----------+---------------+
  4. | 7 | x.qx | kiccp | 7 | 8600.00 |
  5. | 7 | x.qx | finance | 6 | 13000.00 |
  6. | 7 | x.qx | finance | 5 | 24500.00 |
  7. | 7 | x.qx | finance | 4 | 13400.00 |
  8. | 7 | s.zh | enginer | 2 | 26000.00 |
  9. | 7 | s.zh | enginer | 1 | 28000.00 |
  10. | 7 | s.zh | tester | 3 | 20000.00 |
  11. +---------------+---------------+-----------+----------+---------------+

因为 GROUP BY 这三个字段都有,则为 111 ,所以 GROUPING__ID 为7

 

 

2)  ROLLUP  company_name, dep_name 分组的子集

  1. SELECT
  2. company_name
  3. ,dep_name
  4. ,NULL
  5. ,SUM(salary) AS total_salary
  6. FROM datacube_salary_org
  7. WHERE pt = '20200407'
  8. GROUP BY
  9. company_name
  10. ,dep_name

对应的子集

  1. +---------------+---------------+-----------+----------+---------------+
  2. | grouping__id | company_name | dep_name | user_id | total_salary |
  3. +---------------+---------------+-----------+----------+---------------+
  4. | 3 | s.zh | tester | NULL | 20000.00 |
  5. | 3 | x.qx | kiccp | NULL | 8600.00 |
  6. | 3 | x.qx | finance | NULL | 50900.00 |
  7. | 3 | s.zh | enginer | NULL | 54000.00 |
  8. +---------------+---------------+-----------+----------+---------------+

因为 GROUP BY  company_name ,dep_name

结合上面规则 

可以看到 GROUP BY 中有 三个字段, 依次是 company_name, dep_name, user_id

可以看作3位二进制

即 (0/1),(0/1),(0/1) 

低位对应 company_name

高位对应 user_id 

则 GROUPING__ID 为 011 ,即为 3

我相信这里已经解释的很清楚了,下面我们看下 CUBE 函数

 

 

 

CUBE 

  cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,它会统计所选列中值的所有组合的聚合

 

  1. SELECT
  2.  grouping__id
  3.  ,company_name
  4.  ,dep_name
  5.  ,user_id
  6.  ,SUM(salary) AS total_salary
  7. FROM datacube_salary_org
  8. WHERE pt = '20200407'
  9. GROUP BY 
  10.  company_name
  11.  ,dep_name
  12.  ,user_id 
  13.  WITH CUBE
  14. ORDER BY
  15.  grouping__id
  16. ;


 

等价于 

(company_name,dep_name,user_id)

  1. SELECT
  2. company_name
  3. ,dep_name
  4. ,user_id
  5. ,SUM(salary) AS total_salary
  6. FROM datacube_salary_org
  7. WHERE pt = '20200407'
  8. GROUP BY
  9. company_name
  10. ,dep_name
  11. ,user_id
  12. UNION ALL
  13. SELECT
  14. company_name
  15. ,dep_name
  16. ,NULL
  17. ,SUM(salary) AS total_salary
  18. FROM datacube_salary_org
  19. WHERE pt = '20200407'
  20. GROUP BY
  21. company_name
  22. ,dep_name
  23. UNION ALL
  24. SELECT
  25. company_name
  26. ,NULL
  27. ,user_id
  28. ,SUM(salary) AS total_salary
  29. FROM datacube_salary_org
  30. WHERE pt = '20200407'
  31. GROUP BY
  32. company_name
  33. ,user_id
  34. UNION ALL
  35. SELECT
  36. company_name
  37. ,NULL
  38. ,NULL
  39. ,SUM(salary) AS total_salary
  40. FROM datacube_salary_org
  41. WHERE pt = '20200407'
  42. GROUP BY
  43. company_name
  44. UNION ALL
  45. SELECT
  46. NULL
  47. ,dep_name
  48. ,user_id
  49. ,SUM(salary) AS total_salary
  50. FROM datacube_salary_org
  51. WHERE pt = '20200407'
  52. GROUP BY
  53. dep_name
  54. ,user_id
  55. UNION ALL
  56. SELECT
  57. NULL
  58. ,dep_name
  59. ,NULL
  60. ,SUM(salary) AS total_salary
  61. FROM datacube_salary_org
  62. WHERE pt = '20200407'
  63. GROUP BY
  64. dep_name
  65. UNION ALL
  66. SELECT
  67. NULL
  68. ,NULL
  69. ,user_id
  70. ,SUM(salary) AS total_salary
  71. FROM datacube_salary_org
  72. WHERE pt = '20200407'
  73. GROUP BY
  74. user_id
  75. UNION ALL
  76. SELECT
  77. NULL
  78. ,NULL
  79. ,NULL
  80. ,SUM(salary) AS total_salary
  81. FROM datacube_salary_org
  82. WHERE pt = '20200407'
  83. ;

 

结果如下:

  1. +---------------+---------------+-----------+----------+---------------+
  2. | grouping__id | company_name | dep_name | user_id | total_salary |
  3. +---------------+---------------+-----------+----------+---------------+
  4. | 0 | NULL | NULL | NULL | 133500.00 |
  5. | 1 | s.zh | NULL | NULL | 74000.00 |
  6. | 1 | x.qx | NULL | NULL | 59500.00 |
  7. | 2 | NULL | finance | NULL | 50900.00 |
  8. | 2 | NULL | kiccp | NULL | 8600.00 |
  9. | 2 | NULL | tester | NULL | 20000.00 |
  10. | 2 | NULL | enginer | NULL | 54000.00 |
  11. | 3 | s.zh | tester | NULL | 20000.00 |
  12. | 3 | s.zh | enginer | NULL | 54000.00 |
  13. | 3 | x.qx | kiccp | NULL | 8600.00 |
  14. | 3 | x.qx | finance | NULL | 50900.00 |
  15. | 4 | NULL | NULL | 7 | 8600.00 |
  16. | 4 | NULL | NULL | 5 | 24500.00 |
  17. | 4 | NULL | NULL | 4 | 13400.00 |
  18. | 4 | NULL | NULL | 3 | 20000.00 |
  19. | 4 | NULL | NULL | 2 | 26000.00 |
  20. | 4 | NULL | NULL | 1 | 28000.00 |
  21. | 4 | NULL | NULL | 6 | 13000.00 |
  22. | 5 | s.zh | NULL | 2 | 26000.00 |
  23. | 5 | s.zh | NULL | 3 | 20000.00 |
  24. | 5 | x.qx | NULL | 5 | 24500.00 |
  25. | 5 | x.qx | NULL | 6 | 13000.00 |
  26. | 5 | s.zh | NULL | 1 | 28000.00 |
  27. | 5 | x.qx | NULL | 7 | 8600.00 |
  28. | 5 | x.qx | NULL | 4 | 13400.00 |
  29. | 6 | NULL | enginer | 1 | 28000.00 |
  30. | 6 | NULL | finance | 4 | 13400.00 |
  31. | 6 | NULL | tester | 3 | 20000.00 |
  32. | 6 | NULL | finance | 5 | 24500.00 |
  33. | 6 | NULL | kiccp | 7 | 8600.00 |
  34. | 6 | NULL | enginer | 2 | 26000.00 |
  35. | 6 | NULL | finance | 6 | 13000.00 |
  36. | 7 | x.qx | finance | 5 | 24500.00 |
  37. | 7 | x.qx | finance | 4 | 13400.00 |
  38. | 7 | x.qx | kiccp | 7 | 8600.00 |
  39. | 7 | s.zh | tester | 3 | 20000.00 |
  40. | 7 | s.zh | enginer | 2 | 26000.00 |
  41. | 7 | s.zh | enginer | 1 | 28000.00 |
  42. | 7 | x.qx | finance | 6 | 13000.00 |
  43. +---------------+---------------+-----------+----------+---------------+

 

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/345188
推荐阅读
相关标签
  

闽ICP备14008679号