赞
踩
学习尚硅谷电商数仓项目笔记,以下是自己记录的DWD层用户行为日志解析的一些理解以及要点
启动日志结构
页面埋点日志结构
DWD层用户行为日志解析,主要是将ODS层中的 ods_log 日志解析出来
主要有以下5张表:
该表中一行对应一个明细的启动记录,应当包含日志中的公共信息和启动信息
drop table if exists dwd_start_log; CREATE EXTERNAL TABLE dwd_start_log( `area_code` string COMMENT '地区编码', `brand` string COMMENT '手机品牌', `channel` string COMMENT '渠道', `model` string COMMENT '手机型号', `mid_id` string COMMENT '设备id', `os` string COMMENT '操作系统', `user_id` string COMMENT '会员id', `version_code` string COMMENT 'app版本号', `entry` string COMMENT ' icon手机图标 notice 通知 install 安装后启动', `loading_time` bigint COMMENT '启动加载时间', `open_ad_id` string COMMENT '广告页ID ', `open_ad_ms` bigint COMMENT '广告总共播放时间', `open_ad_skip_ms` bigint COMMENT '用户跳过广告时点', `ts` bigint COMMENT '时间' ) COMMENT '启动日志表' PARTITIONED BY (dt string) -- 按照时间创建分区 stored as parquet -- 采用parquet列式存储 LOCATION '/warehouse/gmall/dwd/dwd_start_log' -- 指定在HDFS上存储位置 TBLPROPERTIES('parquet.compression'='lzo') -- 采用LZO压缩为parquet内部的列进行压缩 ;
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_start_log partition(dt='2020-06-14') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.start.entry'), get_json_object(line,'$.start.loading_time'), get_json_object(line,'$.start.open_ad_id'), get_json_object(line,'$.start.open_ad_ms'), get_json_object(line,'$.start.open_ad_skip_ms'), get_json_object(line,'$.ts') from ods_log where dt='2020-06-14' and get_json_object(line,'$.start') is not null;
hive (gmall)> set hive.input.format;
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
与启动日志表同理,页面日志表中每行数据对应一次明细的页面访问记录,应当包含公共信息核页面信息。建表语句和导入数据语句和启动日志表基本一致,这里不在赘述,可见下文中的脚本
动作日志表中每行数据对应用户的一个动作记录,一个动作记录应当包含公共信息、页面信息以及动作信息。可是现在的问题是,动作记录是包含在一个 action 字段数组中,这时候需要通过 UDTF 函数将action数组炸开,然后在解析每个字段
drop table if exists dwd_action_log; CREATE EXTERNAL TABLE dwd_action_log( `area_code` string COMMENT '地区编码', `brand` string COMMENT '手机品牌', `channel` string COMMENT '渠道', `model` string COMMENT '手机型号', `mid_id` string COMMENT '设备id', `os` string COMMENT '操作系统', `user_id` string COMMENT '会员id', `version_code` string COMMENT 'app版本号', `during_time` bigint COMMENT '持续时间毫秒', `page_item` string COMMENT '目标id ', `page_item_type` string COMMENT '目标类型', `last_page_id` string COMMENT '上页类型', `page_id` string COMMENT '页面id ', `source_type` string COMMENT '来源类型', `action_id` string COMMENT '动作id', `item` string COMMENT '目标id ', `item_type` string COMMENT '目标类型', `ts` bigint COMMENT '时间' ) COMMENT '动作日志表' PARTITIONED BY (dt string) stored as parquet LOCATION '/warehouse/gmall/dwd/dwd_action_log' TBLPROPERTIES('parquet.compression'='lzo');
可参考上文
https://editor.csdn.net/md/?articleId=114650015
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_action_log partition(dt='2020-06-14') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(action,'$.action_id'), get_json_object(action,'$.item'), get_json_object(action,'$.item_type'), get_json_object(action,'$.ts') from ods_log lateral view explode_json_array(get_json_object(line,'$.actions')) tmp as action where dt='2020-06-14' and get_json_object(line,'$.actions') is not null;
曝光日志表中每行数据对应一个曝光记录,一个曝光记录应当包含公共信息、页面信息以及曝光信息。先将包含display字段的日志过滤出来,然后通过UDTF函数,将display数组“炸开”,这里因为 display 数组中有多个曝光记录,因此和动作日志表解析大致相同,这里不再赘述
错误日志表中每行数据对应一个错误记录,为方便定位错误,一个错误记录应当包含与之对应的公共信息、页面信息、曝光信息、动作信息、启动信息以及错误信息。先将包含err字段的日志过滤出来,然后使用get_json_object函数解析所有字段。建表语句和导数据和启动日志、页面日志基本一致,这里也不再赘述
现在需要将上面各个表的导数据封装成脚本形式
脚本内容如下:
#!/bin/bash hive=/opt/module/hive/bin/hive APP=gmall # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" SET mapreduce.job.queuename=hive; SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_start_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.start.entry'), get_json_object(line,'$.start.loading_time'), get_json_object(line,'$.start.open_ad_id'), get_json_object(line,'$.start.open_ad_ms'), get_json_object(line,'$.start.open_ad_skip_ms'), get_json_object(line,'$.ts') from ${APP}.ods_log where dt='$do_date' and get_json_object(line,'$.start') is not null; insert overwrite table ${APP}.dwd_action_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(action,'$.action_id'), get_json_object(action,'$.item'), get_json_object(action,'$.item_type'), get_json_object(action,'$.ts') from ${APP}.ods_log lateral view ${APP}.explode_json_array(get_json_object(line,'$.actions')) tmp as action where dt='$do_date' and get_json_object(line,'$.actions') is not null; insert overwrite table ${APP}.dwd_display_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(line,'$.ts'), get_json_object(display,'$.displayType'), get_json_object(display,'$.item'), get_json_object(display,'$.item_type'), get_json_object(display,'$.order') from ${APP}.ods_log lateral view ${APP}.explode_json_array(get_json_object(line,'$.displays')) tmp as display where dt='$do_date' and get_json_object(line,'$.displays') is not null; insert overwrite table ${APP}.dwd_page_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(line,'$.ts') from ${APP}.ods_log where dt='$do_date' and get_json_object(line,'$.page') is not null; insert overwrite table ${APP}.dwd_error_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(line,'$.start.entry'), get_json_object(line,'$.start.loading_time'), get_json_object(line,'$.start.open_ad_id'), get_json_object(line,'$.start.open_ad_ms'), get_json_object(line,'$.start.open_ad_skip_ms'), get_json_object(line,'$.actions'), get_json_object(line,'$.displays'), get_json_object(line,'$.ts'), get_json_object(line,'$.err.error_code'), get_json_object(line,'$.err.msg') from ${APP}.ods_log where dt='$do_date' and get_json_object(line,'$.err') is not null; " $hive -e "$sql"
脚本注意事项:
代码如下(示例):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
本文是自己学习尚硅谷电商数仓项目过程中对DWD层用户行为日志解析的一些记录,主要内容是建表、导数据语句的注意事项以及脚本编写的一些注意点。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。