当前位置:   article > 正文

大数据项目--知行教育(5)_黑马知行教育项目业务流程

黑马知行教育项目业务流程

看板1:访问咨询主题-增量采集

注意点:

  • 业务系统中,数据表的后缀是年_月,这就表明随着时间推移,被采集的表的后缀是动态变化的。

    我们的脚本也要做到这一点

  • 这个功能要做成自动化的脚本,脚本能够每天定时执行

增量采集,一天执行一次

当天采集昨日数据

数据模拟器

Linux系统中要求运行好数据模拟器,方便以后验证自己的增量脚本工作是否正常

模拟器地址:git@gitee.com:javacaoyu/edu-data-gen.git

在Linux安装Python3

yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make zlib zlib-devel libffi-devel -y

wget https://www.python.org/ftp/python/3.7.4/Python-3.7.4.tgz

解压安装包,并cd进去

# 编译并安装
./configure --enable-optimizations --prefix=/usr/local/python3.7.4
make && make install

# 将Python执行程序放入/usr/bin/中
ln -s /usr/local/python3.7.4/bin/python3.7 /usr/bin/python3

# 安装代码依赖的第三方库
# 更新pip
python3 -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple

# 安装依赖的库
python3 -m pip install pymysql -i https://pypi.tuna.tsinghua.edu.cn/simple
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

运行模拟器

运行增量模拟器

执行:

python3 current_visit_consult_data_gen.py

配置定时任务的示例:

* * * * * /usr/bin/python3 /root/current_visit_consult_data_gen.py >> /root/py_gen.log 2>&1
  • 1
运行全量数据生产器

执行:

python3 visit_consult_data_gen.py

模拟器内默认是从2019-07 生成到2020-12月

如果想修改,就修改python文件内的year_month_list对象的内容即可。

业务数据库 - > ODS的流程

  • 业务系统中,数据表的后缀是年_月,这就表明随着时间推移,被采集的表的后缀是动态变化的。
  • 我们的增量采集是按天来的,采集的数据能够带上按天过滤的条件

执行脚本:

#!/bin/bash

# 需求:能够手动传入参数指定采集的日期。  如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
    DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
    DATESTR=$1
fi

# Define var
SQOOP_HOME=/usr/bin/sqoop
JDBCSTR="jdbc:mysql://192.168.52.150:3306/nev"
MYSQL_USERNAME=root
MYSQL_PASSWORD=123456
YEARSTR=`date -d "-1 day" +%Y`
MONTHSTR=`date -d "-1 day" +%m`
HIVE_DB=itcast_ods
MAP_NUMBER=3


echo "执行采集web_chat_ems_${YEARSTR}_${MONTHSTR}的任务......"

$SQOOP_HOME import \
--connect $JDBCSTR \
--username $MYSQL_USERNAME --password $MYSQL_PASSWORD \
--query "SELECT
  id,create_date_time,session_id,sid,create_time,
  seo_source,seo_keywords,ip,area,country,province,
  city,origin_channel,user AS user_match,manual_time,
  begin_time,end_time,last_customer_msg_time_stamp,
  last_agent_msg_time_stamp,reply_msg_count,msg_count,
  browser_name,os_info, CURRENT_DATE() AS start_time
FROM  web_chat_ems_${YEARSTR}_${MONTHSTR}
WHERE create_time >= '${DATESTR} 00:00:00' AND create_time <= '${DATESTR} 23:59:59' and \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database $HIVE_DB \
--hcatalog-table web_chat_ems \
-m $MAP_NUMBER \
--split-by id



wait

echo "执行采集web_chat_text_ems_${YEARSTR}_${MONTHSTR}的任务......"
$SQOOP_HOME import \
--connect $JDBCSTR \
--username $MYSQL_USERNAME --password $MYSQL_PASSWORD \
--query "SELECT
id,referrer,from_url,landing_page_url,
url_title,platform_description,
other_params,history, CURRENT_DATE() AS start_time
FROM web_chat_text_ems_${YEARSTR}_${MONTHSTR}
WHERE 1=1 AND \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database $HIVE_DB \
--hcatalog-table web_chat_text_ems \
-m $MAP_NUMBER \
--split-by id
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60

配置Oozie的定时任务

Oozie是Cloudera公司推出的。无缝和CM集群集成。可以在Hue中直接配置Oozie,更加方便

Azakaban:Linked公司推出

Oozie:也是一款任务调度的框架,和Azkaban基本差不多,功能性上其实Azkaban更好

项目中选择Oozie有2点:

  1. 在学习上,多用一款框架
  2. 在真实的企业中,一般选择CM的企业,多数都会选择Oozie

如图,Oozie在Hue中的位置

WorkFlow:表示的是一个单独的任务

计划:对配置好的WorkFlow配置执行的计划,比如定时的规则

Bundle:对一堆计划做整体规划

配置Oozie的WorkFlow

image-20201228104045825

配置Oozie的计划

image-20201228103912479

配置ODS -> DWD的流程

# 配置从ODS层到DWD层的脚本

# 需求:能够手动传入参数指定采集的日期。  如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
    DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
    DATESTR=$1
fi

HIVE_HOME="/usr/bin/hive"
YEARSTR=`date -d "${DATESTR}" +%Y`
MONTHSTR=`date -d "${DATESTR}" +%m`
DAYSTR=`date -d "${DATESTR}" +%d`

# 计算季度
if [ ${MONTHSTR} -ge 1 ] && [ ${MONTHSTR} -le 3 ]
then
    QUARTERSTR=1
elif [ ${MONTHSTR} -ge 4 ] && [ ${MONTHSTR} -le 6 ]
then
    QUARTERSTR=2
elif [ ${MONTHSTR} -ge 7 ] && [ ${MONTHSTR} -le 9 ]
then
    QUARTERSTR=3
else
    QUARTERSTR=4
fi


echo "即将执行:${DATESTR} 这一天的ODS -> DWD的过程......"

$HIVE_HOME -e "
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
INSERT INTO TABLE itcast_dwd.visit_consult_dwd partition(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    w1.session_id,
    w1.sid,
    UNIX_TIMESTAMP(w1.create_time, 'yyyy-MM-dd HH:mm:ss') AS create_time,
    w1.seo_source,
    w1.ip,
    w1.area,
    w1.country,
    w1.province,
    w1.city,
    CAST(IF (w1.msg_count IS NULL, 0, w1.msg_count) AS INT) AS msg_count,
    w1.origin_channel,
    w2.referrer,
    w2.from_url,
    w2.landing_page_url,
    w2.url_title,
    w2.platform_description,
    w2.other_params,
    w2.history,
    SUBSTRING(w1.create_time, 12, 2) AS hourinfo,
    SUBSTRING(w1.create_time, 1, 4) AS yearinfo,
    QUARTER(w1.create_time) AS quarterinfo,
    SUBSTRING(w1.create_time, 6, 2) AS monthinfo,
    SUBSTRING(w1.create_time, 9, 2) AS dayinfo
FROM (SELECT * FROM itcast_ods.web_chat_ems WHERE start_time='${DATESTR}') AS w1 INNER JOIN itcast_ods.web_chat_text_ems AS w2
ON w1.id = w2.id;"
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68

DWD -> DWS 流程

问题

我们DWS中是包含了年、季度、月等维度的结果数据

由于新增了一天的数据,当前年、当前季度、当前月的结果数据就失效了

需要重新计算

问题在于:DWS表用的失效数据如何处理。

问题解决方式

方式1

删除过期数据

优点:

  • 对BI友好,没有历史数据的混淆,直接取就是最新的
  • 表中的数据清晰

缺点:

  • 执行复杂
  • 破坏了数仓设计中尽量不删除的原则
方式2:

增加新列,表名当前数据计算的时间

用的时候,取最新时间的即可

优点:

  • 表中存储了历史结果的变更
  • 不会执行删除,不破坏数仓的原则

缺点:

  • 对BI分析来说,需要先过滤最新的数据(稍微不友好)
  • 修改表结构(全量操作要重来)
方式3:

增加新表

每一天的结果都生成一个表(一天一个表)

优点:

  • 每个表都是清晰的,指向了具体某天的结果
  • 通过多个表的方式也记录的历史结果的变更

缺点:

  • 数据冗余过多(只要业务需要,冗余不是事)
  • 对BI不友好(换一天,换一个表,如果BI不支持动态规则配置自动变更表,就要手动变更)

这三种方式,没有绝对的谁好谁坏

合适的需求下,选择合适的方式即可。

脚本编写

过期数据的删除
# 删除过期年的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='-1', monthinfo='-1', dayinfo='-1');
# 删除过期季度的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='当前季度', monthinfo='-1', dayinfo='-1');
# 删除过期月的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='当前季度', monthinfo='当前月', dayinfo='-1');
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
脚本编写的说明

脚本采用全量脚本即可,需要改动的是:

针对年的带上:

WHERE yearinfo = '${YEARSTR}'

针对季度的带上:

WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'

针对月的带上:

WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'

针对天和小时的带上:

WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'

不管是访问量还是咨询量,这两个都需要带上上面的WHERE条件过滤

千万不要忘记了先删除过期数据

示例脚本

# DWD层数据聚合到DWS层的操作脚本(增量)



# 需求:能够手动传入参数指定采集的日期。  如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
    DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
    DATESTR=$1
fi

HIVE_HOME="/usr/bin/hive"
YEARSTR=`date -d "${DATESTR}" +%Y`
MONTHSTR=`date -d "${DATESTR}" +%m`
DAYSTR=`date -d "${DATESTR}" +%d`

# 计算季度
if [ ${MONTHSTR} -ge 1 ] && [ ${MONTHSTR} -le 3 ]
then
    QUARTERSTR=1
elif [ ${MONTHSTR} -ge 4 ] && [ ${MONTHSTR} -le 6 ]
then
    QUARTERSTR=2
elif [ ${MONTHSTR} -ge 7 ] && [ ${MONTHSTR} -le 9 ]
then
    QUARTERSTR=3
else
    QUARTERSTR=4
fi


echo "先删除过期的${YEARSTR}年, ${QUARTERSTR}季度, ${MONTHSTR}月数据"
$HIVE_HOME -e "
# 删除过期年的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='-1', monthinfo='-1', dayinfo='-1');
# 删除过期季度的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='${QUARTERSTR}', monthinfo='-1', dayinfo='-1');
# 删除过期月的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='${QUARTERSTR}', monthinfo='${MONTHSTR}', dayinfo='-1');"


echo "执行新数据的计算,计算的日期是:${DATESTR}"
$HIVE_HOME -e "
/*
指标:访问量
维度:时间(年、季度、月、天、小时)、来源渠道、受访页面、搜索渠道、区域维度
*/
-- 时间维度表开发
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计某年的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    yearinfo AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '5' AS time_type,
    yearinfo,
    '-1' AS quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}'
GROUP BY yearinfo;
-- 统计某年某季度的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-Q', quarterinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '4' AS time_type,
    yearinfo,
    quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
GROUP BY yearinfo, quarterinfo;
-- 统计某年某季度某月的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '3' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo;
-- 统计某年某季度某月某日的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '2' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo;
-- 统计某年某季度某月某日某小时的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo, ' ', hourinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '1' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;




-- 按照时间和其它维度汇合进行聚合
-- 按照时间和国家进行组合计算
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计某年某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    yearinfo AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '5' AS time_type,
    yearinfo,
    '-1' AS quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}'
GROUP BY yearinfo,country;
-- 统计某年某季度某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-Q', quarterinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '4' AS time_type,
    yearinfo,
    quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
GROUP BY yearinfo, quarterinfo, country;
-- 统计某年某季度某月某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '3' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, country;
-- 统计某年某季度某月某天某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '2' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, country;
-- 统计某年某季度某月某天某小时某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo, ' ', hourinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '1' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo, country;
"
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232
  • 233
  • 234
  • 235
  • 236
  • 237
  • 238
  • 239
  • 240
  • 241
  • 242
  • 243
  • 244
  • 245
  • 246
  • 247
  • 248
  • 249
  • 250
  • 251
  • 252
  • 253
  • 254
  • 255
  • 256
  • 257
  • 258
  • 259
  • 260
  • 261
  • 262
  • 263
  • 264
  • 265
  • 266
  • 267
  • 268
  • 269
  • 270
  • 271
  • 272
  • 273
  • 274
  • 275
  • 276
  • 277
  • 278
  • 279
  • 280
  • 281
  • 282
  • 283
  • 284
  • 285
  • 286
  • 287
  • 288
  • 289
  • 290
  • 291
  • 292
  • 293
  • 294
  • 295
  • 296
  • 297
  • 298
  • 299
  • 300
  • 301
  • 302

配置Oozie定时

这个也需要每天执行一次。

配置到原有的Oozie的流程后面即可(都是同一个WorkFlow)

DWS -> MySQL

如果DWS用的是方式1,MySQL中也要删除过期

导出脚本执行之前需要在MySQL中删除即可

示例:

mysql -uroot -p123456 scrm_bi -e "
# 删除过期年
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='-1' AND monthinfo='-1' AND dayinfo='-1';
# 删除过期季度
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='${QUARTERSTR}' AND monthinfo='-1' AND dayinfo='-1';
# 删除过期月
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='${QUARTERSTR}' AND monthinfo='${MONTHSTR}' AND dayinfo='-1';
"

sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root --password 123456 \
--table itcast_visit \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
--hcatalog-partition-keys yearinfo \
--hcatalog-partition-values ${YEARSTR} \
-m 1
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

最终的Oozie WorkFlow

image-20201228120539121

看板一作业

答辩

  • 能够说出项目介绍,项目是什么行业,要做什么。看板一是关注什么
  • 能够说出看板一的指标和维度有哪些
  • 能够说出建模分析的思路和过程
  • 能够分析针对增量数据的不同做法以及优缺点

实践

  • 从头到尾(从业务数据库 -> ODS -> DWD -> DWS -> MySQL)全流程能够完成下来

    课堂上省略了不少的SQL没写,自己做的时候可不能省略

    将你们的仓库地址发给我。

BI

简介

BI:Business Intelligence :商业智能

BI就是一种分析、探索数据的数据可视化工具

功能:

  • 对数据进行深入分析和探索
  • 在过程中将结果进行可视化展示

BI工具,用的最多是是数据分析师的岗位的人。

对于开发人员来说,数据分析探索的过程,我们通过数仓建模、代码分析等已经完成了,我们一般只使用可视化这个功能。

市面上常见的BI软件

  • PowerBI - 微软出品(收费)

    行业标杆,贵

  • Tableau - 收费BI工具

    行业标杆,贵

  • SuperSet - (开源、免费BI工具)

    现阶段比较,简陋的BI工具

  • 帆软BI(FineBi) - 国内的BI产品,收费(可以免费试用)

    功能做的基本OK,常用的BI功能都能实现

    课程中使用的

安装FineBi

在课程资料中有提供安装包,双击安装即可。

这个工具,本质上是一个B/S架构的软件

不是传统的C/S架构软件

B/S:browser server架构

C/S:Client server架构

FineBi就是一个在浏览器中使用的工具

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/码创造者/article/detail/976959
推荐阅读
相关标签
  

闽ICP备14008679号