赞
踩
注意点:
业务系统中,数据表的后缀是年_月
,这就表明随着时间推移,被采集的表的后缀是动态变化的。
我们的脚本也要做到这一点
这个功能要做成自动化的脚本,脚本能够每天定时执行
增量采集,一天执行一次
当天采集昨日数据
在Linux系统中要求运行好数据模拟器,方便以后验证自己的增量脚本工作是否正常
模拟器地址:git@gitee.com:javacaoyu/edu-data-gen.git
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make zlib zlib-devel libffi-devel -y
wget https://www.python.org/ftp/python/3.7.4/Python-3.7.4.tgz
解压安装包,并cd进去
# 编译并安装
./configure --enable-optimizations --prefix=/usr/local/python3.7.4
make && make install
# 将Python执行程序放入/usr/bin/中
ln -s /usr/local/python3.7.4/bin/python3.7 /usr/bin/python3
# 安装代码依赖的第三方库
# 更新pip
python3 -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
# 安装依赖的库
python3 -m pip install pymysql -i https://pypi.tuna.tsinghua.edu.cn/simple
执行:
python3 current_visit_consult_data_gen.py
配置定时任务的示例:
* * * * * /usr/bin/python3 /root/current_visit_consult_data_gen.py >> /root/py_gen.log 2>&1
执行:
python3 visit_consult_data_gen.py
模拟器内默认是从2019-07 生成到2020-12月
如果想修改,就修改python文件内的year_month_list对象的内容即可。
年_月
,这就表明随着时间推移,被采集的表的后缀是动态变化的。执行脚本:
#!/bin/bash
# 需求:能够手动传入参数指定采集的日期。 如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
DATESTR=$1
fi
# Define var
SQOOP_HOME=/usr/bin/sqoop
JDBCSTR="jdbc:mysql://192.168.52.150:3306/nev"
MYSQL_USERNAME=root
MYSQL_PASSWORD=123456
YEARSTR=`date -d "-1 day" +%Y`
MONTHSTR=`date -d "-1 day" +%m`
HIVE_DB=itcast_ods
MAP_NUMBER=3
echo "执行采集web_chat_ems_${YEARSTR}_${MONTHSTR}的任务......"
$SQOOP_HOME import \
--connect $JDBCSTR \
--username $MYSQL_USERNAME --password $MYSQL_PASSWORD \
--query "SELECT
id,create_date_time,session_id,sid,create_time,
seo_source,seo_keywords,ip,area,country,province,
city,origin_channel,user AS user_match,manual_time,
begin_time,end_time,last_customer_msg_time_stamp,
last_agent_msg_time_stamp,reply_msg_count,msg_count,
browser_name,os_info, CURRENT_DATE() AS start_time
FROM web_chat_ems_${YEARSTR}_${MONTHSTR}
WHERE create_time >= '${DATESTR} 00:00:00' AND create_time <= '${DATESTR} 23:59:59' and \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database $HIVE_DB \
--hcatalog-table web_chat_ems \
-m $MAP_NUMBER \
--split-by id
wait
echo "执行采集web_chat_text_ems_${YEARSTR}_${MONTHSTR}的任务......"
$SQOOP_HOME import \
--connect $JDBCSTR \
--username $MYSQL_USERNAME --password $MYSQL_PASSWORD \
--query "SELECT
id,referrer,from_url,landing_page_url,
url_title,platform_description,
other_params,history, CURRENT_DATE() AS start_time
FROM web_chat_text_ems_${YEARSTR}_${MONTHSTR}
WHERE 1=1 AND \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database $HIVE_DB \
--hcatalog-table web_chat_text_ems \
-m $MAP_NUMBER \
--split-by id
Oozie是Cloudera公司推出的。无缝和CM集群集成。可以在Hue中直接配置Oozie,更加方便
Azakaban:Linked公司推出
Oozie:也是一款任务调度的框架,和Azkaban基本差不多,功能性上其实Azkaban更好
项目中选择Oozie有2点:
如图,Oozie在Hue中的位置
WorkFlow:表示的是一个单独的任务
计划:对配置好的WorkFlow配置执行的计划,比如定时的规则
Bundle:对一堆计划做整体规划
# 配置从ODS层到DWD层的脚本
# 需求:能够手动传入参数指定采集的日期。 如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
DATESTR=$1
fi
HIVE_HOME="/usr/bin/hive"
YEARSTR=`date -d "${DATESTR}" +%Y`
MONTHSTR=`date -d "${DATESTR}" +%m`
DAYSTR=`date -d "${DATESTR}" +%d`
# 计算季度
if [ ${MONTHSTR} -ge 1 ] && [ ${MONTHSTR} -le 3 ]
then
QUARTERSTR=1
elif [ ${MONTHSTR} -ge 4 ] && [ ${MONTHSTR} -le 6 ]
then
QUARTERSTR=2
elif [ ${MONTHSTR} -ge 7 ] && [ ${MONTHSTR} -le 9 ]
then
QUARTERSTR=3
else
QUARTERSTR=4
fi
echo "即将执行:${DATESTR} 这一天的ODS -> DWD的过程......"
$HIVE_HOME -e "
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
INSERT INTO TABLE itcast_dwd.visit_consult_dwd partition(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
w1.session_id,
w1.sid,
UNIX_TIMESTAMP(w1.create_time, 'yyyy-MM-dd HH:mm:ss') AS create_time,
w1.seo_source,
w1.ip,
w1.area,
w1.country,
w1.province,
w1.city,
CAST(IF (w1.msg_count IS NULL, 0, w1.msg_count) AS INT) AS msg_count,
w1.origin_channel,
w2.referrer,
w2.from_url,
w2.landing_page_url,
w2.url_title,
w2.platform_description,
w2.other_params,
w2.history,
SUBSTRING(w1.create_time, 12, 2) AS hourinfo,
SUBSTRING(w1.create_time, 1, 4) AS yearinfo,
QUARTER(w1.create_time) AS quarterinfo,
SUBSTRING(w1.create_time, 6, 2) AS monthinfo,
SUBSTRING(w1.create_time, 9, 2) AS dayinfo
FROM (SELECT * FROM itcast_ods.web_chat_ems WHERE start_time='${DATESTR}') AS w1 INNER JOIN itcast_ods.web_chat_text_ems AS w2
ON w1.id = w2.id;"
我们DWS中是包含了年、季度、月等维度的结果数据
由于新增了一天的数据,当前年、当前季度、当前月的结果数据就失效了
需要重新计算
问题在于:DWS表用的失效数据如何处理。
删除过期数据
优点:
缺点:
增加新列,表名当前数据计算的时间
用的时候,取最新时间的即可
优点:
缺点:
增加新表
每一天的结果都生成一个表(一天一个表)
优点:
缺点:
这三种方式,没有绝对的谁好谁坏
合适的需求下,选择合适的方式即可。
# 删除过期年的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='-1', monthinfo='-1', dayinfo='-1');
# 删除过期季度的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='当前季度', monthinfo='-1', dayinfo='-1');
# 删除过期月的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='当前季度', monthinfo='当前月', dayinfo='-1');
脚本采用全量脚本即可,需要改动的是:
针对年的带上:
WHERE yearinfo = '${YEARSTR}'
针对季度的带上:
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
针对月的带上:
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
针对天和小时的带上:
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
不管是访问量还是咨询量,这两个都需要带上上面的WHERE条件过滤
千万不要忘记了先删除过期数据
# DWD层数据聚合到DWS层的操作脚本(增量)
# 需求:能够手动传入参数指定采集的日期。 如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
DATESTR=$1
fi
HIVE_HOME="/usr/bin/hive"
YEARSTR=`date -d "${DATESTR}" +%Y`
MONTHSTR=`date -d "${DATESTR}" +%m`
DAYSTR=`date -d "${DATESTR}" +%d`
# 计算季度
if [ ${MONTHSTR} -ge 1 ] && [ ${MONTHSTR} -le 3 ]
then
QUARTERSTR=1
elif [ ${MONTHSTR} -ge 4 ] && [ ${MONTHSTR} -le 6 ]
then
QUARTERSTR=2
elif [ ${MONTHSTR} -ge 7 ] && [ ${MONTHSTR} -le 9 ]
then
QUARTERSTR=3
else
QUARTERSTR=4
fi
echo "先删除过期的${YEARSTR}年, ${QUARTERSTR}季度, ${MONTHSTR}月数据"
$HIVE_HOME -e "
# 删除过期年的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='-1', monthinfo='-1', dayinfo='-1');
# 删除过期季度的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='${QUARTERSTR}', monthinfo='-1', dayinfo='-1');
# 删除过期月的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='${QUARTERSTR}', monthinfo='${MONTHSTR}', dayinfo='-1');"
echo "执行新数据的计算,计算的日期是:${DATESTR}"
$HIVE_HOME -e "
/*
指标:访问量
维度:时间(年、季度、月、天、小时)、来源渠道、受访页面、搜索渠道、区域维度
*/
-- 时间维度表开发
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计某年的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
'-1' AS country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
yearinfo AS time_str,
'-1' AS from_url,
'5' AS groupType,
'5' AS time_type,
yearinfo,
'-1' AS quarterinfo,
'-1' AS monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}'
GROUP BY yearinfo;
-- 统计某年某季度的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
'-1' AS country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-Q', quarterinfo) AS time_str,
'-1' AS from_url,
'5' AS groupType,
'4' AS time_type,
yearinfo,
quarterinfo,
'-1' AS monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
GROUP BY yearinfo, quarterinfo;
-- 统计某年某季度某月的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
'-1' AS country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-', monthinfo) AS time_str,
'-1' AS from_url,
'5' AS groupType,
'3' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo;
-- 统计某年某季度某月某日的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
'-1' AS country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-', monthinfo, '-', dayinfo) AS time_str,
'-1' AS from_url,
'5' AS groupType,
'2' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo;
-- 统计某年某季度某月某日某小时的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
'-1' AS country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
hourinfo,
CONCAT(yearinfo, '-', monthinfo, '-', dayinfo, ' ', hourinfo) AS time_str,
'-1' AS from_url,
'5' AS groupType,
'1' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;
-- 按照时间和其它维度汇合进行聚合
-- 按照时间和国家进行组合计算
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计某年某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
yearinfo AS time_str,
'-1' AS from_url,
'1' AS groupType,
'5' AS time_type,
yearinfo,
'-1' AS quarterinfo,
'-1' AS monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}'
GROUP BY yearinfo,country;
-- 统计某年某季度某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-Q', quarterinfo) AS time_str,
'-1' AS from_url,
'1' AS groupType,
'4' AS time_type,
yearinfo,
quarterinfo,
'-1' AS monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
GROUP BY yearinfo, quarterinfo, country;
-- 统计某年某季度某月某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-', monthinfo) AS time_str,
'-1' AS from_url,
'1' AS groupType,
'3' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, country;
-- 统计某年某季度某月某天某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-', monthinfo, '-', dayinfo) AS time_str,
'-1' AS from_url,
'1' AS groupType,
'2' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, country;
-- 统计某年某季度某月某天某小时某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
hourinfo,
CONCAT(yearinfo, '-', monthinfo, '-', dayinfo, ' ', hourinfo) AS time_str,
'-1' AS from_url,
'1' AS groupType,
'1' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo, country;
"
这个也需要每天执行一次。
配置到原有的Oozie的流程后面即可(都是同一个WorkFlow)
如果DWS用的是方式1,MySQL中也要删除过期
导出脚本执行之前需要在MySQL中删除即可
示例:
mysql -uroot -p123456 scrm_bi -e "
# 删除过期年
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='-1' AND monthinfo='-1' AND dayinfo='-1';
# 删除过期季度
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='${QUARTERSTR}' AND monthinfo='-1' AND dayinfo='-1';
# 删除过期月
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='${QUARTERSTR}' AND monthinfo='${MONTHSTR}' AND dayinfo='-1';
"
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root --password 123456 \
--table itcast_visit \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
--hcatalog-partition-keys yearinfo \
--hcatalog-partition-values ${YEARSTR} \
-m 1
从头到尾(从业务数据库 -> ODS -> DWD -> DWS -> MySQL)全流程能够完成下来
课堂上省略了不少的SQL没写,自己做的时候可不能省略
将你们的仓库地址发给我。
BI:Business Intelligence :商业智能
BI就是一种分析、探索数据的数据可视化工具
功能:
BI工具,用的最多是是数据分析师
的岗位的人。
对于开发人员来说,数据分析探索的过程,我们通过数仓建模、代码分析等已经完成了,我们一般只使用可视化这个功能。
PowerBI - 微软出品(收费)
行业标杆,贵
Tableau - 收费BI工具
行业标杆,贵
SuperSet - (开源、免费BI工具)
现阶段比较,简陋的BI工具
帆软BI(FineBi) - 国内的BI产品,收费(可以免费试用)
功能做的基本OK,常用的BI功能都能实现
课程中使用的
在课程资料中有提供安装包,双击安装即可。
这个工具,本质上是一个B/S
架构的软件
不是传统的C/S架构软件
B/S:browser server架构
C/S:Client server架构
FineBi就是一个在浏览器中使用的工具
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。