赞
踩
2023B站黑马Hadoop、Hive、云平台实战项目
目录
1. 清洗数据
1)部分数据缺失地理位置信息(sender_gps),需要剔除
- select count(if(length(sender_gps)=0,1,null)) count_null_sender_gps,
- count(if(length(receiver_gps)=0,1,null)) count_null_receiver_gps
- from tb_msg_source;
2)为方便处理,从时间中提取天、小时字段,GPS中提取经度、纬度
创建etl表,存储清洗后的数据
- insert overwrite table tb_msg_etl30
- select *, date(msg_time) msg_day,
- hour(msg_time) msg_hour,
- split(sender_gps,',')[0] sender_lng,
- split(sender_gps,',')[1] sender_lat
- from tb_msg_source30
- where length(sender_gps)>0;
- -- 统计今日消息总量
- create table tb_rs_total_msg_cnt comment '每日消息总量' as
- select msg_day, count(*) total_msg_cnt from tb_msg_etl30 group by msg_day;
- -- 统计每小时消息量、发送和接收用户数
- create table tb_rs_hour_msg_cnt comment '每小时消息趋势' as
- select msg_hour, count(*) total_msg_cnt, count(distinct sender_account) sender_user_cnt,
- count(distinct receiver_account) receiver_user_cnt
- from tb_msg_etl30 group by msg_hour;
- -- 统计今日各地区发送消息总量
- create table tb_rs_loc_cnt comment '今日各地区发送消息总量' as
- select msg_day, sender_lng, sender_lat, count(*) total_msg_cnt
- from tb_msg_etl30
- group by msg_day, sender_lng, sender_lat;
- -- 统计今日发送和接收用户数
- create table tb_rs_user_cnt comment '每日发送和接收消息的人数' as
- select msg_day, count(distinct sender_account) sender_user_cnt,
- count(distinct receiver_account) receiver_user_cnt
- from tb_msg_etl30
- group by msg_day;
- -- 统计发送消息条数最多的前十个用户
- create table tb_rs_s_user_top10 comment '发送消息最多的10个用户' as
- select sender_name, count(*) sender_msg_cnt
- from tb_msg_etl30
- group by sender_name
- order by sender_msg_cnt desc
- limit 10;
- -- 统计接收消息条数最多的前十个用户
- create table tb_rs_r_user_top10 comment '接收消息最多的10个用户' as
- select receiver_name, count(*) receiver_msg_cnt
- from tb_msg_etl30
- group by receiver_name
- order by receiver_msg_cnt desc
- limit 10;
- -- 统计发送人的手机型号分布情况
- create table tb_rs_sender_phone comment '发送人的手机型号分布' as
- select sender_phonetype, count(*) cnt
- from tb_msg_etl30
- group by sender_phonetype;
- -- 统计发送人的手机操作系统分布
- create table tb_rs_sender_os comment '发送人的手机操作系统分布' as
- select sender_os, count(*) cnt
- from tb_msg_etl30
- group by sender_os;
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
1)统计今日消息总量
2)统计每小时消息量、发送和接收用户数
3)统计今日各地区发送消息总量
注:模拟数据中只有100个不同用户,实际中按经纬度坐标group by太细了,应该换算出地区
4)统计今日发送和接收用户数
5)统计发送消息条数最多的前十个用户
6)统计接收消息条数最多的前十个用户
7)统计发送人的手机型号分布情况
8)统计发送人的手机操作系统分布
记:在阿里云上搭建了3台虚拟机Hadoop集群,最开始觉得性能应该够用,想跑1000W数据,结果计算创建etl表的时候,用了快2小时都没跑出来。后面换成30W数据,同一条sql语句花了2秒多。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。