当前位置:   article > 正文

Spark SQL/DataFrame/DataSet操作(三)-----分组聚合groupBy_spark dataframe groupby

spark dataframe groupby

分组函数groupBy

(1)分组计数

select address,count(1) from people group by address;  等价的算子如下

  1. scala> peopleDF.show()
  2. +--------+---+--------+
  3. | name|age| address|
  4. +--------+---+--------+
  5. |zhangsan| 22| chengdu|
  6. | wangwu| 33| beijing|
  7. | lisi| 28|shanghai|
  8. |xiaoming| 28| beijing|
  9. | mm| 21| chengdu|
  10. |xiaoming| 18| beijing|
  11. | mm| 11| chengdu|
  12. +--------+---+--------+
  13. scala> peopleDF.groupBy("address").count().show
  14. +--------+-----+
  15. | address|count|
  16. +--------+-----+
  17. | beijing| 3|
  18. | chengdu| 3|
  19. |shanghai| 1|
  20. +--------+-----+

(2)分组后求最值、平均值、求和的方法

  1. //等价于select address,max(age) from people group by address;
  2. scala> peopleDF.groupBy("address").max("age").show
  3. +--------+--------+
  4. | address|max(age)|
  5. +--------+--------+
  6. | beijing| 33|
  7. | chengdu| 22|
  8. |shanghai| 28|
  9. +--------+--------+
  10. //等价于select address,avg(age) from people group by address;
  11. scala> peopleDF.groupBy("address").avg("age").show
  12. +--------+------------------+
  13. | address| avg(age)|
  14. +--------+------------------+
  15. | beijing|26.333333333333332|
  16. | chengdu| 18.0|
  17. |shanghai| 28.0|
  18. +--------+------------------+
  19. //等价于select address,min(age) from people group by address;
  20. scala> peopleDF.groupBy("address").min("age").show
  21. +--------+--------+
  22. | address|min(age)|
  23. +--------+--------+
  24. | beijing| 18|
  25. | chengdu| 11|
  26. |shanghai| 28|
  27. +--------+--------+
  28. //等价于select address,sum(age) from people group by address;
  29. scala> peopleDF.groupBy("address").sum("age").show
  30. +--------+--------+
  31. | address|sum(age)|
  32. +--------+--------+
  33. | beijing| 79|
  34. | chengdu| 54|
  35. |shanghai| 28|
  36. +--------+--------+

(3)分组后,求多个聚合值(最值、平均值等)。使用算子groupBy+agg

  1. //等价于select address,count(age),max(age),min(age),avg(age),sum(age) from people group by address;
  2. scala> peopleDF.groupBy("address").agg(count("age"),max("age"),min("age"),avg("age"),sum("age")).show
  3. +--------+----------+--------+--------+------------------+--------+
  4. | address|count(age)|max(age)|min(age)| avg(age)|sum(age)|
  5. +--------+----------+--------+--------+------------------+--------+
  6. | beijing| 3| 33| 18|26.333333333333332| 79|
  7. | chengdu| 3| 22| 11| 18.0| 54|
  8. |shanghai| 1| 28| 28| 28.0| 28|
  9. +--------+----------+--------+--------+------------------+--------+

(4)分组聚合后取别名

  1. scala> peopleDF.groupBy("address").agg(count("age").as("cnt"),avg("age").as("avg")).show
  2. +--------+---+------------------+
  3. | address|cnt| avg|
  4. +--------+---+------------------+
  5. | beijing| 3|26.333333333333332|
  6. | chengdu| 3| 18.0|
  7. |shanghai| 1| 28.0|
  8. +--------+---+------------------+

(5)分组后行转列,使用pivot。

  1. //求同名用户在同一个地址的平均年龄
  2. //把name的不同值作为列名
  3. scala> peopleDF.groupBy("address").pivot("name").avg("age").show
  4. +--------+----+----+------+--------+--------+
  5. | address|lisi| mm|wangwu|xiaoming|zhangsan|
  6. +--------+----+----+------+--------+--------+
  7. | beijing|null|null| 33.0| 23.0| null|
  8. | chengdu|null|16.0| null| null| 22.0|
  9. |shanghai|28.0|null| null| null| null|
  10. +--------+----+----+------+--------+--------+

 

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop】
推荐阅读
相关标签
  

闽ICP备14008679号