赞
踩
根据给出数据样例统计以下指标
1.统计各商品的浏览数量(PV
)
2.统计各商品的访客数(UV
)
3.统计商品分类的浏览数(PV
)
4.统计商品分类的访客数(UV
)
5.统计商品、分类页面的会话数
6.统计各个外部来源渠道的访客数(UV
)
access.log
为某东网站一天的访问数据,通过制表符“\t”
分隔每个字段的数据
其中数据的含义如下:
第1列 时间:用户访问网站的时间,即当前日志的时间
第2列 IP
地址:IP
地址,即用户端的IP
地址
第3列 访问方法:只有GET/POST
第4列 URL
:用户当前访问网站的URL
第5列 HTTP
协议
第6列 用户ID
:唯一的标示了一个用户的身份
第7列 Referer URL
:用户上一个访问的URL地址
第8列 状态码:即访问状态,如200
表示访问成功
启动hadoop
集群,分别执行 start-hdfs.sh
和 start-yarn.sh
命令
等待hadoop
集群启动后进入hive
,执行导入如下命令
create table log_data(
localtime string,
ip string,
method string,
uuid string,
url string,
http_protorol string,
referer_url string,
status_code string
)
PARTITIONED BY (dt int)
row format delimited fields terminated by '\t';
create table fact_goods(
goods_id string,
pv int,
uv int,
session_num int
)
PARTITIONED BY (dt int);
create table fact_category(
category_id string,
pv int,
uv int,
session_num int
)
PARTITIONED BY (dt int);
create table fact_source(
source_name string,
UV int
)
PARTITIONED BY (dt int);
1)将access.log日志上传到虚拟机的/usr/local/qst/logs
2)执行如下代码将日志导入日志表
load data local inpath '/usr/local/qst/logs/access.log' overwrite into table log_data partition(dt=20200604);
通过需求的1-4项我们可以知道主要统计商品和分类的指标,通过日志我们能看出来只有两种类型的URL
如下:
https://item.jd.com/数字.html
https://list.jd.com/list.html?cat=数字,数字,数字
通过在浏览其中打开URL,我们可以发现如下规律
https://item.jd.com/数字.html
类型的URL打开的都是商品页面
https://list.jd.com/list.html?cat=数字,数字,数字
类型的URL
打开的都是商品的分类页面
由此推断第一种类型中的数字为“商品ID
”,第二种类型的URL中的数字组合为“分类ID
”
insert overwrite table fact_goods partition(dt=20200603)
select
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
count(1) as pv,
count(distinct uuid) as uv,
0 as session_num
from log_data
where dt = 20200603
and status_code == 200
and length(uuid) >= 10
and url like 'https://item.jd.com/%'
group by regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','');
insert overwrite table fact_category partition(dt=20200603)
select
parse_url(url,'QUERY','cat') as category_id,
count(1) as pv,
count(distinct uuid) as uv,
0 as session_num
from log_data
where dt = 20200603
and status_code == 200
and length(uuid) >= 10
and url like 'https://list.jd.com/list.html?cat=%'
group by parse_url(url,'QUERY','cat') ;
经过优化后的写法,因为分别计算商品和分类需要执行两次读取操作(log_data
),所以这里用 from... insert...
方式只读取一次数据
from( select url, uuid, regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id, parse_url(url,'QUERY','cat') as category_id, sum(unix_timestamp(localtime)) over(partition by uuid order by unix_timestamp(localtime) rows between current row and 1 following) - unix_timestamp(localtime) * 2 as stay_time from log_data where dt = 20200603 and status_code == 200 and length(uuid) >= 10 ) tempTable insert overwrite table fact_goods partition(dt=20200603) select goods_id, count(1) as pv, count(distinct uuid) as uv, sum(if(stay_time >= 1800 , 1 , 0 )) as session_num where url like 'https://item.jd.com/%' group by goods_id insert overwrite table fact_category partition(dt=20200603) select category_id, count(1) as pv, count(distinct uuid) as uv, sum(if(stay_time >= 1800 , 1 , 0 )) as session_num where url like 'https://list.jd.com/list.html?cat=%' group by category_id;
这里通过分析日志表中的“referer_url
”确定外部来源,首先判断“referer_url”
不为空并且为站外域名,其次取得站外来源的域名
insert overwrite table fact_source partition(dt=20200603)
select
parse_url('https://www.baidu.com/s?wd={query}','HOST') as source,
count(distinct uuid) as uv
where referer_url is not null and referer_url <> '-' and UPPER(referer_url) <> 'NULL' and length(referer_url) > 8
and referer_url not like '%jd.com%'
and status_code == 200
and length(uuid) >= 10
group by parse_url('https://www.baidu.com/s?wd={query}','HOST');
优化后我们结合上面的 from... insert...
语句
from( select url, referer_url, uuid, regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id, parse_url(url,'QUERY','cat') as category_id, parse_url(url,'HOST') as source, sum(unix_timestamp(localtime)) over(partition by uuid order by unix_timestamp(localtime) rows between current row and 1 following) - unix_timestamp(localtime) * 2 as stay_time from log_data where dt = 20200604 and status_code == 200 and length(uuid) >= 10 ) tempTable insert overwrite table fact_goods partition(dt=20200604) select goods_id, count(1) as pv, count(distinct uuid) as uv, sum(if(stay_time >= 1800 , 1 , 0 )) as session_num where url like 'https://item.jd.com/%' group by goods_id insert overwrite table fact_category partition(dt=20200604) select category_id, count(1) as pv, count(distinct uuid) as uv, sum(if(stay_time >= 1800 , 1 , 0 )) as session_num where url like 'https://list.jd.com/list.html?cat=%' group by category_id insert overwrite table fact_source partition(dt=20200604) select source, count(distinct uuid) as uv where referer_url is not null and referer_url <> '-' and referer_url <> 'null' and length(referer_url) > 8 and referer_url not like '%jd.com%' group by source;
1、在/usr/local/qst
目录下创建项目目录data_helper
,并在data_helper
下创建如下目录
bin | 存放运行脚本及代码 |
---|---|
conf | 存放配置文件 |
data | 数据 |
lib | 类库 |
log | 日志文件 |
tmp | 临时文件 |
2、编写控制脚本main.sh
(bin
文件夹)
3、编写Hive
脚本user_action.sql
(文件夹)
4、设置自动运行任务(crontab
)
0 3 * * * /usr/local/qst/data_helper/bin/main.sh
参考资料: hive函数 https://www.cnblogs.com/end/archive/2012/06/18/2553682.html hive from... insert... 语句 http://blog.chinaunix.net/uid-30041424-id-5766715.html Oracle开窗函数(Hive参照Oracle的开窗函数实现了相同功能) https://blog.csdn.net/qq_41708308/article/details/89374701 https://blog.csdn.net/qq_33879355/article/details/78447430 UTM参数 https://blog.csdn.net/yiguanfangzhou/article/details/94559120 linux crontab用法 https://blog.csdn.net/yeguxin/article/details/79543514 linux crontab测试 https://tool.lu/crontab #!/bin/bash和#!/bin/sh是什么意思以及区别 https://blog.csdn.net/u010486679/article/details/78534841 /bin/bash^M: 坏的解释器: 没有那个文件或目录 https://blog.csdn.net/ouyang_peng/article/details/86488451 sed -i 's/\r$//' /usr/local/qst/data_helper/bin/main.sh
生活坏到一定程度就会好起来,因为他无法更坏,努力过后,才知道许多事情,坚持坚持,就来了。 ——宫崎骏
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。