当前位置:   article > 正文

hadoop学习---基于Hive的教育平台数据仓库分析案例(一)_hive数据分析案例

hive数据分析案例

后续链接第二部分,第二部分请点击:基于Hive的教育平台数据仓库分析案例(二)

后续链接第三部分,第三部分请点击:基于Hive的教育平台数据仓库分析案例   (三)

案例背景

大数据技用可以海量的用为数据中掘分析根据分析结化平台的服务质最终满足用的需求。教育大据分析平台项目就是据技术应用于培训领业经营提供据支撑。

案例数据产生流程:

案例数据流转:

 首先业务是存储在MySQL数据库中, 通过sqoop对MySQL的数据进行数据的导入操作, 将数据导入到HIVE的ODS层中, 对数据进行清洗转换成处理工作, 处理之后对数据进行统计分析, 将统计分析的结果基于sqoop在导出到MySQL中, 最后使用finebi实现图表展示操作, 由于分析工作是需要周期性干活, 采用ooize进行自动化的调度工作, 整个项目是基于cloudera manager进行统一监控管理

案例需求:

  1. 建立集团数仓库一集团数据中心把分散的业务数理和存
  2. 根据业务分析需要海量的用为数据中掘分析定制多维据集合形成数据集市供各个场景主题使用。
  3. 前端业务数据展示选择和控制取合适的前端统计分析结展示工具。

 案例架构:

 案例工具:

zookeeper: 集群管理工具, 主要服务于hadoop高可用以及其他基于zookeeper管理的大数据软件
HDFS:  主要负责最终数据的存储
YARN: 主要提供资源的分配

HIVE: 用于编写SQL, 进行数据分析
oozie:  主要是用来做自动化定时调度
sqoop: 主要是用于数据的导入导出

HUE: 提升操作hadoop用户体验, 可以基于HUE操作HDFS, HIVE ....
FINEBI: 由帆软公司提供的一款进行数据报表展示工具

案例数据架构:

将调研需求转换为开发需求

如何转换: 
    将每一个需求中涉及到维度以及涉及到指标从需求中分析出来, 同时找到涉及到那些表, 以及那些字段

目的:
    涉及维度
    涉及指标
    涉及表
    涉及字段

在此基础上, 还需要找到需要清洗那些数据, 需要转换那些数据, 如果有多个表, 表与表关联条件是什么...

 访问和咨询用户模块:

总访问客户量、地区独立访客热力图、访客咨询率趋势、客户访问量和访客咨询率双轴趋势、​​时间段访问客户量趋势、来源渠道访问量占比、搜索来源访问量占比、跃页面排行榜

意向用户模块:

意向学员位置热力图、总意向量、意向学科排名、意向校区排名、来源渠道占比、意向贡献中心占比

有效线索模块:

有效线索转化率、有效线索转化率时间段趋势、有效线索量

报名用户模块:

校区报名柱状图、学科报名柱状图、总报名量、线上报名量、意向用户报名转化率、有效线索报名转化率、日报名趋势图、校区学科的报名学员TOP来源渠道占比、咨询中心报名贡献

学生出勤模块:

班级出勤人数、班级出勤率、班级迟到人数、班级迟到率、班级请假人数、班级请假率、班级旷课人数、班级旷课率

案例模块分析:

访问和咨询用户数据模块(全量分析):

需求指标:

需求一:  统计指定时间段内,访问客户的总数量。能够下钻到小时数据。

需求二: 统计指定时间段内,访问客户中各区域人数热力图。能够下钻到小时数据。

需求三: 统计指定时间段内,不同地区(省、市)访问的客户中发起咨询的人数占比。

需求四: 统计指定时间段内,每日客户访问量/咨询率双轴趋势图。能够下钻到小时数据。

需求五: 统计指定时间段内,1-24h之间,每个时间段的访问客户量。

需求六: 统计指定时间段内,不同来源渠道的访问客户量占比。能够下钻到小时数据。

需求七: 统计指定时间段内,不同搜索来源的访问客户量占比。能够下钻到小时数据。

需求八: 统计指定时间段内,访问客户量最多的页面排行榜TOPN。能够下钻到小时数据。

总体分析: 

涉及维度:
    固有维度: 
        时间维度: 年 季度 月 天 小时
    产品属性维度:
        地区维度
        来源渠道
        搜索来源
        受访页面

涉及指标: 
            访问量
            咨询量
涉及表 : 
            事实表: web_chat_ems_2019_12 和 web_chat_text_ems_2019_11 
            维度表: 没有 (数仓建模, 不需要DIM层)

涉及字段: 
            时间维度: 
                web_chat_ems: create_time
            地区维度: 
                web_chat_ems:  area
            来源渠道:
                web_chat_ems: origin_channel
            搜索来源:
                web_chat_ems: seo_source
            受访页面:
                web_chat_text_ems: from_url
            指标字段: 
                访问量: sid
                咨询量: sid

            区分访问和咨询:  
                web_chat_ems: msg_count >= 1 即为咨询数据

需要清洗数据: 没有清洗
    
需要转换字段: 时间字段
    需要将create_time 转换为 yearinfo, quarterinfo,monthinfo,dayinfo,hourinfo
    
一对一关系 :  id = id
            一对一关系其实本质就是一张表

数据准备: 将原始数据加载到本地MySQL数据库中

访问客户量的数据来源于咨询系统的访问会话月表web_chat_ems:

表名的格式为web_chat_ems_年_月,年份为4位数字,月份为二位数字,如果为单数时,前面会用0来补全,比如web_chat_ems_2019_07

web_chat_text_ems表是访问附属月表

表名的格式和web_chat_ems相同。web_chat_ems和web_chat_text_ems是一一对应的,通过主键id进行关联。

 在当前数据库下执行sql脚本文件,将数据导入即可

点击:访问和咨询用户数据模块sql文件

 建模分析:

 ODS层: 源数据层

作用: 对接数据源, 一般和数据源保持相同的粒度(将数据源数据完整的拷贝到ODS层)

建表比较简单: 
    业务库中对应表有那些字段, 需要在ODS层建一个与之相同字段的表即可, 额外在建表的时候, 需要构建为分区表, 分区字段为时间字段, 用于标记在何年何月何日将数据抽取到ODS层   
此层会有二个表

 DIM层: 维度层

作用: 存储维度表数据

此时不需要, 因为当前主题, 没有维度表

DWD层: 明细层

作用:  1) 清洗转换    2) 少量维度退化

思考1: 当前需要做什么清洗操作?
     不需要进行清洗

思考2: 当前需要做什么转换操作?
    需要对时间字段进行转换, 需要转换为  yearinfo, quarterinfo,monthinfo,dayinfo,hourinfo

思考3: 当前需要做什么维度退化操作?
    两个事实表合并在一起
    
建表字段 : 原有表的字段 + 转换后的字段+ 清洗后字段
    sid,session_id,ip,create_time,area,origin_channel,seo_source,
    from_url,msg_count,yearinfo,quarterinfo,monthinfo,dayinfo,
    hourinfo,referrer,landing_page_url,url_title,
    platform_description,other_params,history

思想:
    当合并表个表的时候, 获取抽取数据时候, 处理方案有三种:
    1) 当表中字段比较多的时候, 只需要抽取需要用的字段
    2) 当无法确定需要用那些字段的时候, 采用全部抽取
    3) 如果表中字段比较少, 不管用得上, 还是用不上, 都抽取

DWM层: 中间层 (省略)

作用: 1) 维度退化操作  2) 提前聚合

思考1: 当前需要进行什么维度退化操作?
    没有任何维度退化操作, 压根都没有DIM层
思考2: 当前需要进行什么提前聚合操作?
    可以尝试先对小时进行提前聚合操作, 以便于后统计方便

思考3: 当前主题是否可以按照小时提前聚合呢? 
    目前不可以, 因为数据存在重复的问题, 无法提前聚合, 一旦聚合后, 会导致后续的统计出现不精确问题

DWS层: 业务层

作用: 细化维度统计操作

一般是一个指标会对应一个统计结果表
访问量: 
        固有维度: 
                    时间维度: 年 季度 月 天 小时
        产品属性维度:
                    地区维度
                    来源渠道
                    搜索来源
                    受访页面

建表字段: 指标统计字段 + 各个维度字段 + 三个经验字段(time_type,group_time,time_str)
    sid_total,sessionid_total,ip_total,yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,
    area,origin_channel,seo_source,from_url, time_type,group_time,time_str

咨询量:
        固有维度: 
                    时间维度: 年 季度 月 天 小时
        产品属性维度:
                    地区维度
                    来源渠道

建表字段: 指标统计字段 + 各个维度字段 + 三个经验字段(time_type,group_time,time_str)
    sid_total,sessionid_total,ip_total,yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,
    area,origin_channel, time_type,group_time,time_str


注意: 如果不存在的维度字段的值, 设置为 -1 (业务指定, 表示没有这个维度)

DA层:

作用: 对接应用, 应用需要什么数据, 从DWS层获取什么数据即可

此层目前不做任何处理, 已经全部需要都细化统计完成了, 后续具体用什么, 看图表支持了...

建模操作(建表):

 ODS层:将MySQL数据库中数据导入hive中
  1. CREATE DATABASE IF NOT EXISTS `itcast_ods`;
  2. --写入时压缩生效
  3. set hive.exec.orc.compression.strategy=COMPRESSION;
  4. -- 访问咨询表
  5. CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.web_chat_ems (
  6. id INT comment '主键',
  7. create_date_time STRING comment '数据创建时间',
  8. session_id STRING comment '七陌sessionId',
  9. sid STRING comment '访客id',
  10. create_time STRING comment '会话创建时间',
  11. seo_source STRING comment '搜索来源',
  12. seo_keywords STRING comment '关键字',
  13. ip STRING comment 'IP地址',
  14. area STRING comment '地域',
  15. country STRING comment '所在国家',
  16. province STRING comment '省',
  17. city STRING comment '城市',
  18. origin_channel STRING comment '投放渠道',
  19. user_match STRING comment '所属坐席',
  20. manual_time STRING comment '人工开始时间',
  21. begin_time STRING comment '坐席领取时间 ',
  22. end_time STRING comment '会话结束时间',
  23. last_customer_msg_time_stamp STRING comment '客户最后一条消息的时间',
  24. last_agent_msg_time_stamp STRING comment '坐席最后一下回复的时间',
  25. reply_msg_count INT comment '客服回复消息数',
  26. msg_count INT comment '客户发送消息数',
  27. browser_name STRING comment '浏览器名称',
  28. os_info STRING comment '系统名称')
  29. comment '访问会话信息表'
  30. PARTITIONED BY(starts_time STRING)
  31. ROW FORMAT DELIMITED
  32. FIELDS TERMINATED BY '\t'
  33. stored as orc
  34. location '/user/hive/warehouse/itcast_ods.db/web_chat_ems_ods'
  35. TBLPROPERTIES ('orc.compress'='ZLIB');
  36. -- 访问咨询附属表
  37. CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.web_chat_text_ems (
  38. id INT COMMENT '主键来自MySQL',
  39. referrer STRING comment '上级来源页面',
  40. from_url STRING comment '会话来源页面',
  41. landing_page_url STRING comment '访客着陆页面',
  42. url_title STRING comment '咨询页面title',
  43. platform_description STRING comment '客户平台信息',
  44. other_params STRING comment '扩展字段中数据',
  45. history STRING comment '历史访问记录'
  46. ) comment 'EMS-PV测试表'
  47. PARTITIONED BY(start_time STRING)
  48. ROW FORMAT DELIMITED
  49. FIELDS TERMINATED BY '\t'
  50. stored as orc
  51. location '/user/hive/warehouse/itcast_ods.db/web_chat_text_ems_ods'
  52. TBLPROPERTIES ('orc.compress'='ZLIB');
DWD层:
  1. CREATE DATABASE IF NOT EXISTS `itcast_dwd`;
  2. create table if not exists itcast_dwd.visit_consult_dwd(
  3. session_id STRING comment '七陌sessionId',
  4. sid STRING comment '访客id',
  5. create_time bigint comment '会话创建时间',
  6. seo_source STRING comment '搜索来源',
  7. ip STRING comment 'IP地址',
  8. area STRING comment '地域',
  9. msg_count int comment '客户发送消息数',
  10. origin_channel STRING COMMENT '来源渠道',
  11. referrer STRING comment '上级来源页面',
  12. from_url STRING comment '会话来源页面',
  13. landing_page_url STRING comment '访客着陆页面',
  14. url_title STRING comment '咨询页面title',
  15. platform_description STRING comment '客户平台信息',
  16. other_params STRING comment '扩展字段中数据',
  17. history STRING comment '历史访问记录',
  18. hourinfo string comment '小时'
  19. )
  20. comment '访问咨询DWD表'
  21. partitioned by(yearinfo String,quarterinfo string, monthinfo String, dayinfo string)
  22. row format delimited fields terminated by '\t'
  23. stored as orc
  24. location '/user/hive/warehouse/itcast_dwd.db/visit_consult_dwd'
  25. tblproperties ('orc.compress'='SNAPPY');
 DWS层
  1. CREATE DATABASE IF NOT EXISTS `itcast_dws`;
  2. -- 访问量统计结果表
  3. CREATE TABLE IF NOT EXISTS itcast_dws.visit_dws (
  4. sid_total INT COMMENT '根据sid去重求count',
  5. sessionid_total INT COMMENT '根据sessionid去重求count',
  6. ip_total INT COMMENT '根据IP去重求count',
  7. area STRING COMMENT '区域信息',
  8. seo_source STRING COMMENT '搜索来源',
  9. origin_channel STRING COMMENT '来源渠道',
  10. hourinfo STRING COMMENT '创建时间,统计至小时',
  11. time_str STRING COMMENT '时间明细',
  12. from_url STRING comment '会话来源页面',
  13. groupType STRING COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
  14. time_type STRING COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;')
  15. comment 'EMS访客日志dws表'
  16. PARTITIONED BY(yearinfo STRING,quarterinfo STRING,monthinfo STRING,dayinfo STRING)
  17. ROW FORMAT DELIMITED
  18. FIELDS TERMINATED BY '\t'
  19. stored as orc
  20. location '/user/hive/warehouse/itcast_dws.db/visit_dws'
  21. TBLPROPERTIES ('orc.compress'='SNAPPY');
  22. -- 咨询量统计结果表
  23. CREATE TABLE IF NOT EXISTS itcast_dws.consult_dws
  24. (
  25. sid_total INT COMMENT '根据sid去重求count',
  26. sessionid_total INT COMMENT '根据sessionid去重求count',
  27. ip_total INT COMMENT '根据IP去重求count',
  28. area STRING COMMENT '区域信息',
  29. origin_channel STRING COMMENT '来源渠道',
  30. hourinfo STRING COMMENT '创建时间,统计至小时',
  31. time_str STRING COMMENT '时间明细',
  32. groupType STRING COMMENT '产品属性类型:1.地区;2.来源渠道',
  33. time_type STRING COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;'
  34. )
  35. COMMENT '咨询量DWS宽表'
  36. PARTITIONED BY (yearinfo string,quarterinfo STRING, monthinfo STRING, dayinfo string)
  37. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
  38. STORED AS ORC
  39. LOCATION '/user/hive/warehouse/itcast_dws.db/consult_dws'
  40. TBLPROPERTIES ('orc.compress'='SNAPPY');

数据采集

目的: 用sqoop将业务端(MySQL)的数据导入到ODS层对应表(hive)中。

注意:字段名字要一一对应,不对的要改名再导入。

  1. -- 访问咨询主表
  2. sqoop import \
  3. --connect jdbc:mysql://192.168.52.150:3306/nev \
  4. --username root \
  5. --password 123456 \
  6. --query 'SELECT
  7. id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
  8. AREA,country,province,city,origin_channel,USER AS user_match,manual_time,begin_time,end_time,
  9. last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
  10. msg_count,browser_name,os_info, "2021-09-24" AS starts_time
  11. FROM web_chat_ems_2019_07 where 1=1 and $CONDITIONS' \
  12. --hcatalog-database itcast_ods \
  13. --hcatalog-table web_chat_ems \
  14. -m 1
  15. -- 访问咨询附属表
  16. sqoop import \
  17. --connect jdbc:mysql://192.168.52.150:3306/nev \
  18. --username root \
  19. --password 123456 \
  20. --query 'SELECT
  21. *, "2021-09-24" AS start_time
  22. FROM web_chat_text_ems_2019_07 where 1=1 and $CONDITIONS' \
  23. --hcatalog-database itcast_ods \
  24. --hcatalog-table web_chat_text_ems \
  25. -m 1

 校验数据是否导入成功:

  1. 1) 查看mysql共计有多少条数据
  2. SELECT COUNT(1) FROM web_chat_ems_2019_07; 211197
  3. SELECT COUNT(1) FROM web_chat_text_ems_2019_07; 105599
  4. 2) 到hive中对表查询一下一共多少条数据
  5. SELECT COUNT(1) FROM itcast_ods.web_chat_ems; 211197
  6. SELECT COUNT(1) FROM itcast_ods.web_chat_text_ems; 105599
  7. 3) 查询其中一部分数据, 观察数据映射是否OK
  8. select * from itcast_ods.web_chat_ems limit 10;
  9. SELECT * FROM itcast_ods.web_chat_text_ems limit 10;

数据清洗转换

 目的: 将ODS层数据导入到DWD层

DWD层作用:  
    1) 清洗转换操作  2) 少量维度退化操作

思考1: 是否需要做清洗转换操作, 如果需要做什么呢?
    清洗操作: 不需要
    转换操作: 将create_time日期 转换为 yearinfo quarterinfo monthinfo dayinfo hourinfo
    额外加一个转换: 将create_time日期数据转换为时间戳
思考2: 是否需要进行维度退化操作, 如果需要做什么呢? 
    需要的, 将两个事实表合并称为一个事实表

  1. --动态分区配置
  2. set hive.exec.dynamic.partition=true;
  3. set hive.exec.dynamic.partition.mode=nonstrict;
  4. --hive压缩
  5. set hive.exec.compress.intermediate=true;
  6. set hive.exec.compress.output=true;
  7. --写入时压缩生效
  8. set hive.exec.orc.compression.strategy=COMPRESSION;
  9. insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  10. select
  11. wce.session_id,
  12. wce.sid,
  13. unix_timestamp(wce.create_time) as create_time,
  14. wce.seo_source,
  15. wce.ip,
  16. wce.area,
  17. wce.msg_count,
  18. wce.origin_channel,
  19. wcte.referrer,
  20. wcte.from_url,
  21. wcte.landing_page_url,
  22. wcte.url_title,
  23. wcte.platform_description,
  24. wcte.other_params,
  25. wcte.history,
  26. substr(wce.create_time,12,2) as hourinfo,
  27. substr(wce.create_time,1,4) as yearinfo,
  28. quarter(wce.create_time) as quarterinfo,
  29. substr(wce.create_time,6,2) as monthinfo,
  30. substr(wce.create_time,9,2) as dayinfo
  31. from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
  32. on wce.id = wcte.id;

 数据分析

目的: 将DWD层数据灌入到DWS层,一张大表里面分了好几个需求,按照-1区分。

访问量表:

访问量: 
    固有维度: 
                时间维度:  年 、季度 、月 、天 、小时
    产品属性维度:
                地区维度、来源渠道、搜索来源、受访页面、总访问量 

共计5*5=25个需求,25个需求放到一张表visit_dws中。数据量很大,模拟时选其中几个。

如何计算访问量:
以时间为基准,统计总访问量(5个需求)

--统计每年的总访问量
--统计每年每季度的总访问量
--统计每年每季度每月的总访问量
--统计每年每季度每月每天的总访问量
--统计每年每季度每月每天每小时的总访问量

  1. -- 统计每年的总访问量
  2. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  3. select
  4. count(distinct sid) as sid_total,
  5. count(distinct session_id) as sessionid_total,
  6. count(distinct ip) as ip_total,
  7. '-1' as area,
  8. '-1' as seo_source,
  9. '-1' as origin_channel,
  10. '-1' as hourinfo,
  11. yearinfo as time_str,
  12. '-1' as from_url,
  13. '5' as grouptype,
  14. '5' as time_type,
  15. yearinfo,
  16. '-1' as quarterinfo,
  17. '-1' as monthinfo,
  18. '-1' as dayinfo
  19. from itcast_dwd.visit_consult_dwd
  20. group by yearinfo;
  21. -- 统计每年每季度的总访问量
  22. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  23. select
  24. count(distinct sid) as sid_total,
  25. count(distinct session_id) as sessionid_total,
  26. count(distinct ip) as ip_total,
  27. '-1' as area,
  28. '-1' as seo_source,
  29. '-1' as origin_channel,
  30. '-1' as hourinfo,
  31. concat(yearinfo,'_',quarterinfo) as time_str,
  32. '-1' as from_url,
  33. '5' as grouptype,
  34. '4' as time_type,
  35. yearinfo,
  36. quarterinfo,
  37. '-1' as monthinfo,
  38. '-1' as dayinfo
  39. from itcast_dwd.visit_consult_dwd
  40. group by yearinfo,quarterinfo;
  41. -- 统计每年每季度每月的总访问量
  42. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  43. select
  44. count(distinct sid) as sid_total,
  45. count(distinct session_id) as sessionid_total,
  46. count(distinct ip) as ip_total,
  47. '-1' as area,
  48. '-1' as seo_source,
  49. '-1' as origin_channel,
  50. '-1' as hourinfo,
  51. concat(yearinfo,'-',monthinfo) as time_str,
  52. '-1' as from_url,
  53. '5' as grouptype,
  54. '3' as time_type,
  55. yearinfo,
  56. quarterinfo,
  57. monthinfo,
  58. '-1' as dayinfo
  59. from itcast_dwd.visit_consult_dwd
  60. group by yearinfo,quarterinfo,monthinfo;
  61. -- 统计每年每季度每月每天的总访问量
  62. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  63. select
  64. count(distinct sid) as sid_total,
  65. count(distinct session_id) as sessionid_total,
  66. count(distinct ip) as ip_total,
  67. '-1' as area,
  68. '-1' as seo_source,
  69. '-1' as origin_channel,
  70. '-1' as hourinfo,
  71. concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
  72. '-1' as from_url,
  73. '5' as grouptype,
  74. '2' as time_type,
  75. yearinfo,
  76. quarterinfo,
  77. monthinfo,
  78. dayinfo
  79. from itcast_dwd.visit_consult_dwd
  80. group by yearinfo,quarterinfo,monthinfo,dayinfo;
  81. -- 统计每年每季度每月每天每小时的总访问量
  82. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  83. select
  84. count(distinct sid) as sid_total,
  85. count(distinct session_id) as sessionid_total,
  86. count(distinct ip) as ip_total,
  87. '-1' as area,
  88. '-1' as seo_source,
  89. '-1' as origin_channel,
  90. hourinfo,
  91. concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
  92. '-1' as from_url,
  93. '5' as grouptype,
  94. '1' as time_type,
  95. yearinfo,
  96. quarterinfo,
  97. monthinfo,
  98. dayinfo
  99. from itcast_dwd.visit_consult_dwd
  100. group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
 基于时间统计各个受访页面的访问量(5个需求)

--统计每年各个受访页面的访问量
--统计每年,每季度各个受访页面的访问量
--统计每年,每季度,每月各个受访页面的访问量
--统计每年,每季度,每月.每天各个受访页面的访问量
--统计每年,每季度,每月.每天,每小时各个受访页面的访问量

  1. -- 统计每年各个受访页面的访问量
  2. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  3. select
  4. count(distinct sid) as sid_total,
  5. count(distinct session_id) as sessionid_total,
  6. count(distinct ip) as ip_total,
  7. '-1' as area,
  8. '-1' as seo_source,
  9. '-1' as origin_channel,
  10. '-1' as hourinfo,
  11. yearinfo as time_str,
  12. from_url,
  13. '4' as grouptype,
  14. '5' as time_type,
  15. yearinfo,
  16. '-1' as quarterinfo,
  17. '-1' as monthinfo,
  18. '-1' as dayinfo
  19. from itcast_dwd.visit_consult_dwd
  20. group by yearinfo,from_url;
  21. -- 统计每年,每季度各个受访页面的访问量
  22. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  23. select
  24. count(distinct sid) as sid_total,
  25. count(distinct session_id) as sessionid_total,
  26. count(distinct ip) as ip_total,
  27. '-1' as area,
  28. '-1' as seo_source,
  29. '-1' as origin_channel,
  30. '-1' as hourinfo,
  31. concat(yearinfo,'_',quarterinfo) as time_str,
  32. from_url,
  33. '4' as grouptype,
  34. '4' as time_type,
  35. yearinfo,
  36. quarterinfo,
  37. '-1' as monthinfo,
  38. '-1' as dayinfo
  39. from itcast_dwd.visit_consult_dwd
  40. group by yearinfo,quarterinfo,from_url;
  41. -- 统计每年,每季度,每月各个受访页面的访问量
  42. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  43. select
  44. count(distinct sid) as sid_total,
  45. count(distinct session_id) as sessionid_total,
  46. count(distinct ip) as ip_total,
  47. '-1' as area,
  48. '-1' as seo_source,
  49. '-1' as origin_channel,
  50. '-1' as hourinfo,
  51. concat(yearinfo,'-',monthinfo) as time_str,
  52. from_url,
  53. '4' as grouptype,
  54. '3' as time_type,
  55. yearinfo,
  56. quarterinfo,
  57. monthinfo,
  58. '-1' as dayinfo
  59. from itcast_dwd.visit_consult_dwd
  60. group by yearinfo,quarterinfo,monthinfo,from_url;
  61. -- 统计每年,每季度,每月.每天各个受访页面的访问量
  62. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  63. select
  64. count(distinct sid) as sid_total,
  65. count(distinct session_id) as sessionid_total,
  66. count(distinct ip) as ip_total,
  67. '-1' as area,
  68. '-1' as seo_source,
  69. '-1' as origin_channel,
  70. '-1' as hourinfo,
  71. concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
  72. from_url,
  73. '4' as grouptype,
  74. '2' as time_type,
  75. yearinfo,
  76. quarterinfo,
  77. monthinfo,
  78. dayinfo
  79. from itcast_dwd.visit_consult_dwd
  80. group by yearinfo,quarterinfo,monthinfo,dayinfo,from_url;
  81. -- 统计每年,每季度,每月.每天,每小时各个受访页面的访问量
  82. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  83. select
  84. count(distinct sid) as sid_total,
  85. count(distinct session_id) as sessionid_total,
  86. count(distinct ip) as ip_total,
  87. '-1' as area,
  88. '-1' as seo_source,
  89. '-1' as origin_channel,
  90. hourinfo,
  91. concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
  92. from_url,
  93. '4' as grouptype,
  94. '1' as time_type,
  95. yearinfo,
  96. quarterinfo,
  97. monthinfo,
  98. dayinfo
  99. from itcast_dwd.visit_consult_dwd
  100. group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,from_url;

咨询量表

咨询量
    维度: 
        固有维度:时间: 年 、季度、 月、 天、 小时
        产品属性维度:
            地区、来源渠道、总咨询量

咨询和访问的区别:
    msg_count >=1 即为咨询数据   

共有3*5=15个需求,15个需求放到一张表consult_dws中。数据量很大,模拟时选其中个。

如何计算咨询量:
基于时间统计总咨询量(5个需求)

--统计每年的总咨询量
--统计每年每季度的总咨询量
--统计每年每季度每月的总咨询量
--统计每年每季度每月每天的总咨询量
--统计每年每季度每月每天每小时的总咨询量

  1. -- 统计每年的总咨询量
  2. insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  3. select
  4. count(distinct sid) as sid_total,
  5. count(distinct session_id) as sessionid_total,
  6. count(distinct ip) as ip_total,
  7. '-1' as area,
  8. '-1' as origin_channel,
  9. '-1' as hourinfo,
  10. yearinfo as time_str,
  11. '3' as grouptype,
  12. '5' as time_type,
  13. yearinfo,
  14. '-1' as quarterinfo,
  15. '-1' as monthinfo,
  16. '-1' as dayinfo
  17. from itcast_dwd.visit_consult_dwd where msg_count >= 1
  18. group by yearinfo;
  19. -- 统计每年每季度的总咨询量
  20. insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  21. select
  22. count(distinct sid) as sid_total,
  23. count(distinct session_id) as sessionid_total,
  24. count(distinct ip) as ip_total,
  25. '-1' as area,
  26. '-1' as origin_channel,
  27. '-1' as hourinfo,
  28. concat(yearinfo,'_',quarterinfo) as time_str,
  29. '3' as grouptype,
  30. '4' as time_type,
  31. yearinfo,
  32. quarterinfo,
  33. '-1' as monthinfo,
  34. '-1' as dayinfo
  35. from itcast_dwd.visit_consult_dwd where msg_count >= 1
  36. group by yearinfo,quarterinfo;
  37. -- 统计每年每季度每月的总咨询量
  38. insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  39. select
  40. count(distinct sid) as sid_total,
  41. count(distinct session_id) as sessionid_total,
  42. count(distinct ip) as ip_total,
  43. '-1' as area,
  44. '-1' as origin_channel,
  45. '-1' as hourinfo,
  46. concat(yearinfo,'-',monthinfo) as time_str,
  47. '3' as grouptype,
  48. '3' as time_type,
  49. yearinfo,
  50. quarterinfo,
  51. monthinfo,
  52. '-1' as dayinfo
  53. from itcast_dwd.visit_consult_dwd where msg_count >= 1
  54. group by yearinfo,quarterinfo,monthinfo;
  55. -- 统计每年每季度每月每天的总咨询量
  56. insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  57. select
  58. count(distinct sid) as sid_total,
  59. count(distinct session_id) as sessionid_total,
  60. count(distinct ip) as ip_total,
  61. '-1' as area,
  62. '-1' as origin_channel,
  63. '-1' as hourinfo,
  64. concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
  65. '3' as grouptype,
  66. '2' as time_type,
  67. yearinfo,
  68. quarterinfo,
  69. monthinfo,
  70. dayinfo
  71. from itcast_dwd.visit_consult_dwd where msg_count >= 1
  72. group by yearinfo,quarterinfo,monthinfo,dayinfo;
  73. -- 统计每年每季度每月每天每小时的总咨询量
  74. insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  75. select
  76. count(distinct sid) as sid_total,
  77. count(distinct session_id) as sessionid_total,
  78. count(distinct ip) as ip_total,
  79. '-1' as area,
  80. '-1' as origin_channel,
  81. hourinfo,
  82. concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
  83. '3' as grouptype,
  84. '1' as time_type,
  85. yearinfo,
  86. quarterinfo,
  87. monthinfo,
  88. dayinfo
  89. from itcast_dwd.visit_consult_dwd where msg_count >= 1
  90. group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
基于时间,统计各个地区的咨询量(5个需求)

-- 统计每年各个地区的咨询量
-- 统计每年每季度各个地区的咨询量
-- 统计每年每季度每月各个地区的咨询量
-- 统计每年每季度每月每天各个地区的咨询量
-- 统计每年每季度每月每天每小时各个地区的咨询量

数据导出

 目的: 从hive的DWS层将数据导出到mysql中对应目标表中

第一步: 在mysql中创建目标表:创建数据库建表
  1. create database scrm_bi default character set utf8mb4 collate utf8mb4_general_ci;
  2. -- 访问量的结果表:
  3. CREATE TABLE IF NOT EXISTS scrm_bi.visit_dws (
  4. sid_total INT COMMENT '根据sid去重求count',
  5. sessionid_total INT COMMENT '根据sessionid去重求count',
  6. ip_total INT COMMENT '根据IP去重求count',
  7. area varchar(32) COMMENT '区域信息',
  8. seo_source varchar(32) COMMENT '搜索来源',
  9. origin_channel varchar(32) COMMENT '来源渠道',
  10. hourinfo varchar(32) COMMENT '创建时间,统计至小时',
  11. time_str varchar(32) COMMENT '时间明细',
  12. from_url varchar(1000) comment '会话来源页面',
  13. groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
  14. time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
  15. yearinfo varchar(32) COMMENT '年' ,
  16. quarterinfo varchar(32) COMMENT '季度',
  17. monthinfo varchar(32) COMMENT '月',
  18. dayinfo varchar(32) COMMENT '天'
  19. )comment 'EMS访客日志dws表';
  20. -- 咨询量的结果表:
  21. CREATE TABLE IF NOT EXISTS scrm_bi.consult_dws
  22. (
  23. sid_total INT COMMENT '根据sid去重求count',
  24. sessionid_total INT COMMENT '根据sessionid去重求count',
  25. ip_total INT COMMENT '根据IP去重求count',
  26. area varchar(32) COMMENT '区域信息',
  27. origin_channel varchar(32) COMMENT '来源渠道',
  28. hourinfo varchar(32) COMMENT '创建时间,统计至小时',
  29. time_str varchar(32) COMMENT '时间明细',
  30. groupType varchar(32) COMMENT '产品属性类型:1.地区;2.来源渠道',
  31. time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
  32. yearinfo varchar(32) COMMENT '年' ,
  33. quarterinfo varchar(32) COMMENT '季度',
  34. monthinfo varchar(32) COMMENT '月',
  35. dayinfo varchar(32) COMMENT '天'
  36. )COMMENT '咨询量DWS宽表';
 第二步执行sqoop的数据导出
  1. -- 先导出 咨询量数据
  2. sqoop export \
  3. --connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
  4. --username root \
  5. --password 123456 \
  6. --table consult_dws \
  7. --hcatalog-database itcast_dws \
  8. --hcatalog-table consult_dws \
  9. -m 1
  10. -- 访问量数据导出
  11. sqoop export \
  12. --connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
  13. --username root \
  14. --password 123456 \
  15. --table visit_dws \
  16. --hcatalog-database itcast_dws \
  17. --hcatalog-table visit_dws \
  18. -m 1

访问和咨询用户数据模块(增量分析):

模拟新增数据,实际生产过程中由需求方自行提供新增数据:

  1. -- 模拟访问咨询主表数据
  2. CREATE TABLE web_chat_ems_2021_09 AS
  3. SELECT * FROM web_chat_ems_2019_07 WHERE create_time BETWEEN '2019-07-01 00:00:00' AND '2019-07-01 23:59:59';
  4. -- 修改表中的时间字段
  5. UPDATE web_chat_ems_2021_09 SET create_time = CONCAT('2021-09-25 ',SUBSTR(create_time,12));
  6. -- 模拟访问咨询附属表数据
  7. CREATE TABLE web_chat_text_ems_2021_09 AS
  8. SELECT
  9. temp2.*
  10. FROM
  11. (SELECT * FROM web_chat_ems_2019_07
  12. WHERE create_time BETWEEN '2019-07-01 00:00:00' AND '2019-07-01 23:59:59') temp1
  13. JOIN web_chat_text_ems_2019_07 temp2 ON temp1.id = temp2.id ;

 增量数据采集:

将增量的SQL集成到增量的sqoop脚本中,通过shell脚本的方式来执行, 通过脚本自动获取上一天的日期, 最后将shell脚本放置在ooize中, 进行自动化调度操作

编写增量数据采集的shell脚本

hadoop01: 家目录

cd /root

vim edu_mode_1_collect.sh

edu_mode_1_collect.sh脚本:此脚本能够实现自动获取上一天的日期数据, 并且还支持采集指定日期下数据,将shell脚本放置到ooize中,完成自动化调度操作。

  1. #!/bin/bash
  2. export SQOOP_HOME=/usr/bin/sqoop
  3. if [ $# == 1 ]
  4. then
  5. dateStr=$1
  6. else
  7. dateStr=`date -d '-1 day' +'%Y-%m-%d'`
  8. fi
  9. dateNowStr=`date +'%Y-%m-%d'`
  10. yearStr=`date -d ${dateStr} +'%Y'`
  11. monthStr=`date -d ${dateStr} +'%m'`
  12. jdbcUrl='jdbc:mysql://192.168.52.150:3306/nev'
  13. username='root'
  14. password='123456'
  15. m='1'
  16. ${SQOOP_HOME} import \
  17. --connect ${jdbcUrl} \
  18. --username ${username} \
  19. --password ${password} \
  20. --query "SELECT
  21. id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
  22. AREA,country,province,city,origin_channel,USER AS user_match,manual_time,begin_time,end_time,
  23. last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
  24. msg_count,browser_name,os_info, '${dateNowStr}' AS starts_time
  25. FROM web_chat_ems_${yearStr}_${monthStr} WHERE create_time BETWEEN '${dateStr} 00:00:00' AND '${dateStr}
  26. 23:59:59' and \$CONDITIONS" \
  27. --hcatalog-database itcast_ods \
  28. --hcatalog-table web_chat_ems \
  29. -m ${m}
  30. ${SQOOP_HOME} import \
  31. --connect ${jdbcUrl} \
  32. --username ${username} \
  33. --password ${password} \
  34. --query "SELECT
  35. temp2.*, '${dateNowStr}' AS start_time
  36. FROM (SELECT id FROM web_chat_ems_${yearStr}_${monthStr} WHERE create_time BETWEEN '${dateStr} 00:00:00'
  37. AND '${dateStr} 23:59:59') temp1
  38. JOIN web_chat_text_ems_${yearStr}_${monthStr} temp2 ON temp1.id = temp2.id where 1=1 and \$CONDIT
  39. IONS" \
  40. --hcatalog-database itcast_ods \
  41. --hcatalog-table web_chat_text_ems \
  42. -m ${m}

测试脚本是否可以正常执行  

  1. sh edu_mode_1_collect.sh
  2. 可以在hive中查看对应分区下的数据, 如果有, 说明抽取成功了
  3. select * from itcast_ods.web_chat_ems where starts_time='2021-09-26' limit 10;

注意:starts_time是当天抽取的当天日期,抽取的数据库中的表的数据是前一天的,所以建表的时候要把ceratetimes字段改为前一天的。比如今天是2021-09-26,我创建的表的名字是

也可以在抽取表的时候带上日期参数,这样可以随机抽取任意一天的数据。 

将shell脚本放置到ooize中,完成自动化调度操作  

增量数据清洗转换

仅需要对新增的这一天数据进行清洗转换操作,因为此时数据粒度是一致的,不需要对之前的数据做处理。

 在shell下执行hive的SQL

./hive -e|-f 'sql语句|SQL脚本' -S

编写shell脚本, 实现增量数据清洗转换操作

hadoop01:
    cd /root
    vim edu_mode_1_handle.sh 

  1. #!/bin/bash
  2. export HIVE_HOME=/usr/bin/hive
  3. if [ $# == 1 ]
  4. then
  5. dateStr=$1
  6. else
  7. dateStr=`date +'%Y-%m-%d'`
  8. fi
  9. sqlStr="
  10. set hive.exec.dynamic.partition=true;
  11. set hive.exec.dynamic.partition.mode=nonstrict;
  12. set hive.exec.compress.intermediate=true;
  13. set hive.exec.compress.output=true;
  14. set hive.exec.orc.compression.strategy=COMPRESSION;
  15. insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  16. select
  17. wce.session_id,
  18. wce.sid,
  19. unix_timestamp(wce.create_time) as create_time,
  20. wce.seo_source,
  21. wce.ip,
  22. wce.area,
  23. wce.msg_count,
  24. wce.origin_channel,
  25. wcte.referrer,
  26. wcte.from_url,
  27. wcte.landing_page_url,
  28. wcte.url_title,
  29. wcte.platform_description,
  30. wcte.other_params,
  31. wcte.history,
  32. substr(wce.create_time,12,2) as hourinfo,
  33. substr(wce.create_time,1,4) as yearinfo,
  34. quarter(wce.create_time) as quarterinfo,
  35. substr(wce.create_time,6,2) as monthinfo,
  36. substr(wce.create_time,9,2) as dayinfo
  37. from (select * from itcast_ods.web_chat_ems where starts_time='${dateStr}') wce join (select * from itcast_ods.web_chat_text_ems where start_time='${dateStr}') wcte
  38. on wce.id = wcte.id;
  39. "
  40. ${HIVE_HOME} -e "${sqlStr}" -S

测试脚本:

sh edu_mode_1_handle.sh

测试数据是否产生:

select * from itcast_dwd.visit_consu1t_dwd where yearinfo='2024' and monthinfo='05'and dayinfo='23 '------这里测试的是数据产生的数据,也就是前一天的数据,日期是从create分出来的

 配置工作流:

增量数据统计分析

第一张表25个需求,第二张表15个需求,有点多, 各选5条模拟一下:

访问量:5条

-- 统计每年的总访问量 

-- 统计每年每季度的总访问量

-- 统计每年每季度每月的总访问量

-- 统计每年每季度每月每天的总访问量

-- 统计每年每季度每月每天每小时的总访问量

咨询量: 5条
-- 统计每年各个地区的咨询量

-- 统计每年每季度各个地区的咨询量

-- 统计每年每季度每月各个地区的咨询量

-- 统计每年每季度每月每天各个地区的咨询量

-- 统计每年每季度每月每天每小时各个地区的咨询量

dwd表是新增了上一天的数据,统计年的就把今年的全部统计,月的把这个月的全部统计,不是就统计前一天的,所以将昨天的数据加入后再统计一遍年的,月的,季度的,以此类推……

说明:

在统计年的时候, 只需要统计加上这一天以后这一年的数据即可, 之前年的数据是不需要统计的
在统计季度的时候,只需要统计加上这一天以后这一年对应的这一季度的数据即可, 之前的季度是不需要统计的
在统计月份的时候, 只需要统计加上这一天以后这一年对应的这一月份的数据即可,之前的月份不需要统计
在统计天的时候, 只需要统计新增的这一天即可
在统计小时的时候, 只需要统计新增的这一天的每个小时

思考: 在统计的过程中, 比如以年统计, 得到一个新的年的统计结果, 那么在DWS层表中是不是还有一个历史的结果呢? 如何解决呢?

比如今天是2024-5-4,昨天已经统计了今年的数据了,今天又统计一遍,需要把昨天的历史数据删除,但hive不支持个别数据的删除。

说明:    
    在统计的过程中, 会对之前的统计结果产生影响, 主要受影响:年,季度,月
    注意: 天 和 小时是不受历史结果影响

解决方案:
    将之前的结果受影响的结果值删除即可

如何删除呢?  注意hive不支持直接对表中某条数据删除
    可以通过删除分区的方案进行处理

思考:
    当年的统计结构数据在那个分区下存储着呢?

        此分区了存储了按年统计的各个产品属性维度的结果数据
        yearinfo='2021' and quarterinfo='-1' and monthinfo ='-1' and dayinfo='-1'
        此分区存储了当年当季度的统计数据
        yearinfo='2021' and quarterinfo='3' and monthinfo ='-1' and dayinfo='-1'
        此分区存储了当年当季度当月的统计数据
        yearinfo='2021' and quarterinfo='3' and monthinfo ='09' and dayinfo='-1'

执行删除:
    alter table xxx drop partition(yearinfo='2021' and quarterinfo='-1' and monthinfo ='-1' and dayinfo='-1');
    alter table xxx drop partition(yearinfo='2021' and quarterinfo='3' and monthinfo ='-1' and dayinfo='-1');
    alter table xxx drop partition(yearinfo='2021' and quarterinfo='3' and monthinfo ='09' and dayinfo='-1');

 注意:先删后加!!!

 编写脚本:

hadoop01:
cd /root
vim edu_mode_1_analyse.sh 

  1. #!/bin/bash
  2. export HIVE_HOME=/usr/bin/hive
  3. if [ $# == 1 ]
  4. then
  5. dateStr=$1
  6. else
  7. dateStr=`date -d '-1 day' +'%Y-%m-%d'`
  8. fi
  9. yearStr=`date -d ${dateStr} +'%Y'`
  10. monthStr=`date -d ${dateStr} +'%m'`
  11. month_for_quarter=`date -d ${dateStr} +'%-m'`
  12. quarterStr=$((($month_for_quarter-1)/3+1))
  13. dayStr=`date -d ${dateStr} +'%d'`
  14. sqlStr="
  15. set hive.exec.dynamic.partition=true;
  16. set hive.exec.dynamic.partition.mode=nonstrict;
  17. set hive.exec.compress.intermediate=true;
  18. set hive.exec.compress.output=true;
  19. set hive.exec.orc.compression.strategy=COMPRESSION;
  20. alter table itcast_dws.visit_dws drop partition(yearinfo='${yearStr}',quarterinfo='-1',monthinfo='-1',dayinfo='-1');
  21. alter table itcast_dws.visit_dws drop partition(yearinfo='${yearStr}',quarterinfo='${quarterStr}',monthinfo='-1',dayinfo='-1');
  22. alter table itcast_dws.visit_dws drop partition(yearinfo='${yearStr}',quarterinfo='${quarterStr}',monthinfo='${monthStr}}',dayinfo='-1');
  23. alter table itcast_dws.consult_dws drop partition(yearinfo='${yearStr}',quarterinfo='-1',monthinfo='-1',dayinfo='-1');
  24. alter table itcast_dws.consult_dws drop partition(yearinfo='${yearStr}',quarterinfo='${quarterStr}',monthinfo='-1',dayinfo='-1');
  25. alter table itcast_dws.consult_dws drop partition(yearinfo='${yearStr}',quarterinfo='${quarterStr}',monthinfo='${monthStr}}',dayinfo='-1');
  26. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  27. select
  28. count(distinct sid) as sid_total,
  29. count(distinct session_id) as sessionid_total,
  30. count(distinct ip) as ip_total,
  31. '-1' as area,
  32. '-1' as seo_source,
  33. '-1' as origin_channel,
  34. '-1' as hourinfo,
  35. yearinfo as time_str,
  36. '-1' as from_url,
  37. '5' as grouptype,
  38. '5' as time_type,
  39. yearinfo,
  40. '-1' as quarterinfo,
  41. '-1' as monthinfo,
  42. '-1' as dayinfo
  43. from itcast_dwd.visit_consult_dwd where yearinfo='${yearStr}'
  44. group by yearinfo;
  45. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  46. select
  47. count(distinct sid) as sid_total,
  48. count(distinct session_id) as sessionid_total,
  49. count(distinct ip) as ip_total,
  50. '-1' as area,
  51. '-1' as seo_source,
  52. '-1' as origin_channel,
  53. '-1' as hourinfo,
  54. concat(yearinfo,'_',quarterinfo) as time_str,
  55. '-1' as from_url,
  56. '5' as grouptype,
  57. '4' as time_type,
  58. yearinfo,
  59. quarterinfo,
  60. '-1' as monthinfo,
  61. '-1' as dayinfo
  62. from itcast_dwd.visit_consult_dwd where yearinfo='${yearStr}' and quarterinfo='${quarterStr}'
  63. group by yearinfo,quarterinfo;
  64. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  65. select
  66. count(distinct sid) as sid_total,
  67. count(distinct session_id) as sessionid_total,
  68. count(distinct ip) as ip_total,
  69. '-1' as area,
  70. '-1' as seo_source,
  71. '-1' as origin_channel,
  72. '-1' as hourinfo,
  73. concat(yearinfo,'-',monthinfo) as time_str,
  74. '-1' as from_url,
  75. '5' as grouptype,
  76. '3' as time_type,
  77. yearinfo,
  78. quarterinfo,
  79. monthinfo,
  80. '-1' as dayinfo
  81. from itcast_dwd.visit_consult_dwd where yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}'
  82. group by yearinfo,quarterinfo,monthinfo;
  83. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  84. select
  85. count(distinct sid) as sid_total,
  86. count(distinct session_id) as sessionid_total,
  87. count(distinct ip) as ip_total,
  88. '-1' as area,
  89. '-1' as seo_source,
  90. '-1' as origin_channel,
  91. '-1' as hourinfo,
  92. concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
  93. '-1' as from_url,
  94. '5' as grouptype,
  95. '2' as time_type,
  96. yearinfo,
  97. quarterinfo,
  98. monthinfo,
  99. dayinfo
  100. from itcast_dwd.visit_consult_dwd where yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}' and dayinfo='${dayStr}'
  101. group by yearinfo,quarterinfo,monthinfo,dayinfo;
  102. insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  103. select
  104. count(distinct sid) as sid_total,
  105. count(distinct session_id) as sessionid_total,
  106. count(distinct ip) as ip_total,
  107. '-1' as area,
  108. '-1' as seo_source,
  109. '-1' as origin_channel,
  110. hourinfo,
  111. concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
  112. '-1' as from_url,
  113. '5' as grouptype,
  114. '1' as time_type,
  115. yearinfo,
  116. quarterinfo,
  117. monthinfo,
  118. dayinfo
  119. from itcast_dwd.visit_consult_dwd where yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}' and dayinfo='${dayStr}'
  120. group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
  121. insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  122. select
  123. count(distinct sid) as sid_total,
  124. count(distinct session_id) as sessionid_total,
  125. count(distinct ip) as ip_total,
  126. area,
  127. '-1' as origin_channel,
  128. '-1' as hourinfo,
  129. yearinfo as time_str,
  130. '1' as grouptype,
  131. '5' as time_type,
  132. yearinfo,
  133. '-1' as quarterinfo,
  134. '-1' as monthinfo,
  135. '-1' as dayinfo
  136. from itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='${yearStr}'
  137. group by yearinfo,area;
  138. insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  139. select
  140. count(distinct sid) as sid_total,
  141. count(distinct session_id) as sessionid_total,
  142. count(distinct ip) as ip_total,
  143. area,
  144. '-1' as origin_channel,
  145. '-1' as hourinfo,
  146. concat(yearinfo,'_',quarterinfo) as time_str,
  147. '1' as grouptype,
  148. '4' as time_type,
  149. yearinfo,
  150. quarterinfo,
  151. '-1' as monthinfo,
  152. '-1' as dayinfo
  153. from itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='${yearStr}' and quarterinfo='${quarterStr}'
  154. group by yearinfo,quarterinfo,area;
  155. insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  156. select
  157. count(distinct sid) as sid_total,
  158. count(distinct session_id) as sessionid_total,
  159. count(distinct ip) as ip_total,
  160. area,
  161. '-1' as origin_channel,
  162. '-1' as hourinfo,
  163. concat(yearinfo,'-',monthinfo) as time_str,
  164. '1' as grouptype,
  165. '3' as time_type,
  166. yearinfo,
  167. quarterinfo,
  168. monthinfo,
  169. '-1' as dayinfo
  170. from itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}'
  171. group by yearinfo,quarterinfo,monthinfo,area;
  172. insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  173. select
  174. count(distinct sid) as sid_total,
  175. count(distinct session_id) as sessionid_total,
  176. count(distinct ip) as ip_total,
  177. area,
  178. '-1' as origin_channel,
  179. '-1' as hourinfo,
  180. concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
  181. '1' as grouptype,
  182. '2' as time_type,
  183. yearinfo,
  184. quarterinfo,
  185. monthinfo,
  186. dayinfo
  187. from itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}' and dayinfo='${dayStr}'
  188. group by yearinfo,quarterinfo,monthinfo,dayinfo,area;
  189. insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
  190. select
  191. count(distinct sid) as sid_total,
  192. count(distinct session_id) as sessionid_total,
  193. count(distinct ip) as ip_total,
  194. area,
  195. '-1' as origin_channel,
  196. hourinfo,
  197. concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
  198. '1' as grouptype,
  199. '1' as time_type,
  200. yearinfo,
  201. quarterinfo,
  202. monthinfo,
  203. dayinfo
  204. from itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}' and dayinfo='${dayStr}'
  205. group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,area;
  206. "
  207. ${HIVE_HOME} -e "${sqlStr}" -S

测试:

sh  edu_mode_1_analyse.sh

 在hive中测试:

select  * from itcast_dws.visit_dws where  yearinfo='2024';

上传到工作流:

增量数据导出操作

 说明:

在执行导出的时候, 也需要将mysql中之前的统计的当年当季度和当月的结果数据删除, 然后重新导入操作

    此时我们处理的方案, 要进行简化一些, 受影响最大范围当年的数据,  可以直接将当年的统计结果数据全部都删除, 然后重新到DWS层当年的所有数据,就是将当年的那个分区数据导出即可。(MySQL可以直接删除)

编写脚本:

hadoop01: 
    cd /root
    vim edu_mode_1_export.sh

  1. #!/bin/bash
  2. export SQOOP_HOME=/usr/bin/sqoop
  3. if [ $# == 1 ]
  4. then
  5. TD_DATE=$1
  6. else
  7. TD_DATE=`date -d '-1 day' +'%Y-%m-%d'`
  8. fi
  9. TD_YEAR=`date -d ${TD_DATE} +%Y`
  10. mysql -uroot -p123456 -h192.168.52.150 -P3306 -e "delete from scrm_bi.visit_dws where yearinfo='$TD_YEAR'; delete from scrm_bi.consult_dws where yearinfo='$TD_YEAR';"
  11. jdbcUrl='jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8'
  12. username='root'
  13. password='123456'
  14. ${SQOOP_HOME} export \
  15. --connect ${jdbcUrl} \
  16. --username ${username} \
  17. --password ${password} \
  18. --table visit_dws \
  19. --hcatalog-database itcast_dws \
  20. --hcatalog-table visit_dws \
  21. --hcatalog-partition-keys yearinfo \
  22. --hcatalog-partition-values $TD_YEAR \
  23. -m 1
  24. ${SQOOP_HOME} export \
  25. --connect ${jdbcUrl} \
  26. --username ${username} \
  27. --password ${password} \
  28. --table consult_dws \
  29. --hcatalog-database itcast_dws \
  30. --hcatalog-table consult_dws \
  31. --hcatalog-partition-keys yearinfo \
  32. --hcatalog-partition-values $TD_YEAR \
  33. -m 1

测试:在本地MySQL中:

select * from scrm_bi.visit_dws where yearinfo='2024';

 上传到工作流:

 访问和咨询用户数据模块工作到此结束,后续模块请点击:

hadoop学习---基于Hive的教育平台数据仓库分析案例(二)

后接第三部分,第三部分请点击:基于Hive的教育平台数据仓库分析案例   (三) 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小舞很执着/article/detail/750329
推荐阅读
相关标签
  

闽ICP备14008679号