当前位置:   article > 正文

基于Hive的网站用户行为分析项目_基于hive的数据分析用户需求

基于hive的数据分析用户需求

一、项目需求

根据给出数据样例统计以下指标
1.统计各商品的浏览数量(PV
2.统计各商品的访客数(UV
3.统计商品分类的浏览数(PV
4.统计商品分类的访客数(UV
5.统计商品、分类页面的会话数
6.统计各个外部来源渠道的访客数(UV

二、数据介绍(access.log文件)

access.log为某东网站一天的访问数据,通过制表符“\t”分隔每个字段的数据
其中数据的含义如下:
第1列 时间:用户访问网站的时间,即当前日志的时间
第2列 IP地址:IP地址,即用户端的IP地址
第3列 访问方法:只有GET/POST
第4列 URL:用户当前访问网站的URL
第5列 HTTP协议
第6列 用户ID:唯一的标示了一个用户的身份
第7列 Referer URL:用户上一个访问的URL地址
第8列 状态码:即访问状态,如200表示访问成功

三、确定数据结构

启动hadoop集群,分别执行 start-hdfs.shstart-yarn.sh 命令
等待hadoop集群启动后进入hive,执行导入如下命令

1、根据已知日志数据建立日志表

create table log_data(
localtime string,
ip string,
method string,
uuid string,
url string,
http_protorol string,
referer_url string,
status_code string
)
PARTITIONED BY (dt int)
row format delimited fields terminated by '\t';
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

2、根据项目需求建立结果表

建立fact_goods表
create table fact_goods(
goods_id string,
pv int,
uv int,
session_num int
)
PARTITIONED BY (dt int);
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
建立fact_category表
create table fact_category(
category_id string,
pv int,
uv int,
session_num int
)
PARTITIONED BY (dt int);
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
建立fact_source表
create table fact_source(
source_name string,
UV int
)
PARTITIONED BY (dt int);
  • 1
  • 2
  • 3
  • 4
  • 5

3、将用户行为日志数据导入日志表

1)将access.log日志上传到虚拟机的/usr/local/qst/logs
2)执行如下代码将日志导入日志表

load data local inpath '/usr/local/qst/logs/access.log' overwrite into table log_data partition(dt=20200604); 
  • 1

四、业务实现

1、数据分析

通过需求的1-4项我们可以知道主要统计商品和分类的指标,通过日志我们能看出来只有两种类型的URL如下:

https://item.jd.com/数字.html
https://list.jd.com/list.html?cat=数字,数字,数字
  • 1
  • 2

通过在浏览其中打开URL,我们可以发现如下规律
https://item.jd.com/数字.html 类型的URL打开的都是商品页面
https://list.jd.com/list.html?cat=数字,数字,数字 类型的URL打开的都是商品的分类页面
由此推断第一种类型中的数字为“商品ID”,第二种类型的URL中的数字组合为“分类ID

2、商品的统计指标实现

1)计算商品指标
insert overwrite table fact_goods partition(dt=20200603) 
select 
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
count(1) as pv,
count(distinct uuid) as uv,
0 as session_num
from log_data 
where dt = 20200603
and status_code == 200 
and length(uuid) >= 10 
and url  like 'https://item.jd.com/%' 
group by regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','');
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
2)计算分类指标
insert overwrite table fact_category partition(dt=20200603) 
select 
parse_url(url,'QUERY','cat') as category_id,
count(1) as pv,
count(distinct uuid) as uv,
0 as session_num
from log_data 
where dt = 20200603
and status_code == 200 
and length(uuid) >= 10 
and url  like 'https://list.jd.com/list.html?cat=%' 
group by parse_url(url,'QUERY','cat') ;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
3)代码优化

经过优化后的写法,因为分别计算商品和分类需要执行两次读取操作(log_data),所以这里用 from... insert... 方式只读取一次数据

from(
select
url,
uuid,
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
parse_url(url,'QUERY','cat') as category_id,
sum(unix_timestamp(localtime)) over(partition by uuid order by unix_timestamp(localtime) rows between current row and 1 following)  - unix_timestamp(localtime) * 2 as stay_time
from log_data
where dt = 20200603
and status_code == 200 
and length(uuid) >= 10 
) tempTable 

insert overwrite table fact_goods partition(dt=20200603) 
select 
goods_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url like 'https://item.jd.com/%' 
group by goods_id

insert overwrite table fact_category partition(dt=20200603) 
select 
category_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url  like 'https://list.jd.com/list.html?cat=%' 
group by category_id;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30

3.统计各个外部来源渠道的访客数(UV)

这里通过分析日志表中的“referer_url”确定外部来源,首先判断“referer_url”不为空并且为站外域名,其次取得站外来源的域名

insert overwrite table fact_source partition(dt=20200603) 
select 
parse_url('https://www.baidu.com/s?wd={query}','HOST') as source,
count(distinct uuid) as uv
where referer_url is not null and referer_url <> '-' and UPPER(referer_url) <> 'NULL' and length(referer_url) > 8 
and referer_url not like '%jd.com%' 
and status_code == 200 
and length(uuid) >= 10 
group by parse_url('https://www.baidu.com/s?wd={query}','HOST');
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

优化后我们结合上面的 from... insert... 语句

from(
select
url,
referer_url,
uuid,
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
parse_url(url,'QUERY','cat') as category_id,
parse_url(url,'HOST') as source,
sum(unix_timestamp(localtime)) over(partition by uuid order by unix_timestamp(localtime) rows between current row and 1 following)  - unix_timestamp(localtime) * 2 as stay_time
from log_data
where dt = 20200604
and status_code == 200 
and length(uuid) >= 10 
) tempTable 

insert overwrite table fact_goods partition(dt=20200604) 
select 
goods_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url like 'https://item.jd.com/%' 
group by goods_id

insert overwrite table fact_category partition(dt=20200604) 
select 
category_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url  like 'https://list.jd.com/list.html?cat=%' 
group by category_id 

insert overwrite table fact_source partition(dt=20200604) 
select 
source,
count(distinct uuid) as uv
where referer_url is not null and referer_url <> '-' and referer_url <> 'null' and length(referer_url) > 8 
and referer_url not like '%jd.com%' 
group by source;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40

五、编写自动运行脚本

1、在/usr/local/qst目录下创建项目目录data_helper,并在data_helper下创建如下目录

bin存放运行脚本及代码
conf存放配置文件
data数据
lib类库
log日志文件
tmp临时文件

2、编写控制脚本main.shbin文件夹)
3、编写Hive脚本user_action.sql (文件夹)
4、设置自动运行任务(crontab

0 3 * * * /usr/local/qst/data_helper/bin/main.sh
  • 1
参考资料:
hive函数
https://www.cnblogs.com/end/archive/2012/06/18/2553682.html

hive from... insert... 语句
http://blog.chinaunix.net/uid-30041424-id-5766715.html

Oracle开窗函数(Hive参照Oracle的开窗函数实现了相同功能)
https://blog.csdn.net/qq_41708308/article/details/89374701
https://blog.csdn.net/qq_33879355/article/details/78447430

UTM参数
https://blog.csdn.net/yiguanfangzhou/article/details/94559120

linux crontab用法
https://blog.csdn.net/yeguxin/article/details/79543514
linux crontab测试
https://tool.lu/crontab

#!/bin/bash和#!/bin/sh是什么意思以及区别
https://blog.csdn.net/u010486679/article/details/78534841

/bin/bash^M: 坏的解释器: 没有那个文件或目录
https://blog.csdn.net/ouyang_peng/article/details/86488451
sed -i 's/\r$//' /usr/local/qst/data_helper/bin/main.sh
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

生活坏到一定程度就会好起来,因为他无法更坏,努力过后,才知道许多事情,坚持坚持,就来了。 ——宫崎骏

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/710043
推荐阅读
相关标签
  

闽ICP备14008679号