赞
踩
这篇文章旨在练习大数据Spark操作,做一些简单的分析。后续阶段将基于三台虚拟机情况下进行调优,以期实现性能最优化。
数据集:淘宝用户购物行为数据集_数据集-阿里云天池 (aliyun.com)
下载到:/opt/module/taobao/UserBehavior.csv
商品类目纬度数据来源:bigdata_analyse/category.sql at main · TurboWay/bigdata_analyse (github.com)
下载到:/opt/module/taobao/category.sql
字段 | 属性 | 备注 |
---|---|---|
user_id | 用户ID | 整数类型,序列化后的用户ID |
sku_id | 商品ID | 整数类型,序列化后的商品ID |
category_id | 商品类目ID | 整数类型,序列化后的商品所属类目ID |
behavior | 行为类型 | 字符串,枚举类型,包括(‘pv’, ‘buy’, ‘cart’, ‘fav’) |
create_time | 行为发生的时间 | 字符串,行为发生的时间 |
dt | 分区键 | string |
单机配置:
创建3台虚拟机,配置如下:
hadoop1102 | hadoop1103 | hadoop1104 | |
---|---|---|---|
CPU | |||
内存 | 8GB | 4GB | 3GB |
磁盘 | 50GB | 50GB | 50GB |
HDFS | DataNode、NameNode | DataNode | DataNode、SecondaryNameNode |
YARN | NodeManager | NodeManager、ResourceManager | NodeManager |
此外,还配置了zk、spark-hive等。软件版本设置如下…
配置具体可参考尚硅谷大数据课程。
执行category.sql文件,将商品类名维度数据导入mysql数据库中。
在/opt/module/taobao目录下创建mysql_to_hdfs.sh,执行mysql_to_hdfs.sh,将mysql数据导入到hdfs。
#!/bin/bash APP=taobao sqoop=/opt/module/sqoop-1.4.6/bin/sqoop import_data(){ $sqoop import \ --connect jdbc:mysql://hadoop1102:3306/$APP \ --username root \ --password XXXX \ --target-dir /origin_data/$APP/db/category \ --delete-target-dir \ --query "$1 and \$CONDITIONS" \ --num-mappers 1 \ --fields-terminated-by '\t' \ --null-string '\\N' \ --null-non-string '\\N' } import_data "select * from category where 1=1"
在shell命令行创建数据库taobao
hive
hive(default)>create taobao
hive(default)>use taobao
hive(taobao)>create external table ods_category (
`sub_category_id` bigint comment 'id',
`parent_category_name` string comment '类目名称'
) comment '商品维度表'
row format delimited fields terminated by '\t'
LOCATION '/warehouse/taobao/ods/ods_category/';
hive(taobao)>load data inpath '/origin_data/taobao/db/category' into table taobao.ods_category;
进入/opt/module/spark-yarn/,执行以下命令,以打开sparksql服务器
sbin/start-thriftserver.sh
然后可以在IDEA上面的Database插件上敲SQL,具体配置如下
drop table if exists taobao.ods_user_behavior;
create external table ods_user_behavior (
`user_id` bigint,
`sku_id` bigint,
`category_id` bigint,
`behavior` string,
`timestamp` bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/warehouse/taobao/ods/ods_user_behavior/';
load data local inpath '/opt/module/taobao/UserBehavior.csv' into table taobao.ods_user_behavior;
字段 | 属性 | 备注 |
---|---|---|
user_id | 用户ID | 整数类型,序列化后的用户ID |
sku_id | 商品ID | 整数类型,序列化后的商品ID |
category_id | 商品类目ID | 整数类型,序列化后的商品所属类目ID |
behavior | 行为类型 | 字符串,枚举类型,包括(‘pv’, ‘buy’, ‘cart’, ‘fav’) |
create_time | 行为发生的时间 | 字符串,行为发生的时间 |
dt | 分区键 | string |
drop table if exists taobao.dwd_user_behavior; create external table dwd_user_behavior ( `user_id` bigint, `sku_id` bigint, `category_id` bigint, `catgegory_name` string, `behavior` string, `create_time` string ) partitioned by (`dt` string) row format delimited fields terminated by '\t' LOCATION '/warehouse/taobao/dwd/dwd_user_behavior/'; insert overwrite table dwd_user_behavior select user_id, sku_id, category_id, parent_category_name category_name, behavior, from_unixtime(timestamp, 'yyyy-MM-dd HH:mm:ss') as create_time, from_unixtime(timestamp, 'yyyy-MM-dd') as dt from ods_user_behavior oub join ods_category oc on oub.category_id = oc.sub_category_id where from_unixtime(`timestamp`, 'yyyy-MM-dd') >= '2017-11-25' and from_unixtime(`timestamp`, 'yyyy-MM-dd') <= '2017-12-03';
这里使用了dt当作分区键,即按日期进行分区,有以下奇效:
至此,一个1000w的购买记录已经被存储到了hive中,接下来就要愉快的分析了啦!!!
以下数据分析均基于Spark
drop table if exists ads_date_hour_behavior; create table ads_date_hour_behavior( date string, hour string, pv bigint, cart bigint, fav bigint, buy bigint ) row format delimited fields terminated by '\t' LOCATION '/warehouse/taobao/ads/ads_date_hour_behavior/'; insert into ads_date_hour_behavior select date_format(create_time, 'yyyy-MM-dd'), hour(create_time), sum(`if`(behavior='pv', 1, 0)) pv, sum(`if`(behavior='cart', 1, 0)) cart, sum(`if`(behavior='fav', 1, 0)) fav, sum(`if`(behavior='buy', 1, 0)) buy from dwd_user_behavior group by date_format(create_time, 'yyyy-MM-dd'), hour(create_time) order by date_format(create_time, 'yyyy-MM-dd'), hour(create_time);
可以看到4种行为在不同时间的分布类似,在凌晨时间使用量特别少
drop table if exists ads_pv_uv_puv; create table ads_pv_uv_puv( dates string, pv bigint, uv bigint, puv decimal(16,2) ) row format delimited fields terminated by '\t' LOCATION '/warehouse/taobao/ads/ads_pv_uv_puv/'; insert overwrite table ads_pv_uv_puv select date_format(create_time, 'yyyy-MM-dd') dates, count(behavior) pv, count(distinct user_id) uv, cast(count(behavior)/count(distinct user_id) as decimal(16, 2)) as puv from dwd_user_behavior where behavior = 'pv' group by dates;
drop table if exists ads_category_sell_rank; create table ads_category_sell_rank ( catgegory_name string, sku_id bigint, sell_num bigint, rank bigint ) row format delimited fields terminated by '\t' location '/warehouse/taobao/ads/ads_category_sell_rank'; -- 商品类目销售排行版 insert overwrite table ads_category_sell_rank select catgegory_name, sku_id, sell_num, rank from( select distinct catgegory_name, sku_id, cnt sell_num, dense_rank() over (partition by catgegory_name order by cnt desc) rank from ( select sku_id, catgegory_name, count(sku_id) over (partition by sku_id) cnt from dwd_user_behavior where behavior='buy' ) t1 ) t2 where rank <= 10;
可以看到sku_id为1115492的商品在衣物品类的销售最棒,荣登销冠哈
电子类的商品这几天卖得最火了,不知道是卖什么呢?该不会是小米6吧(米粉狂喜)
select count(*)
from
(
select user_id from dwd_user_behavior
group by user_id
having count(behavior)=1
) a;
也有105位用户的
drop table if exists dws_user_behavior; create table dws_user_behavior ( user_id bigint, all_count bigint, pv_count bigint, fav_count bigint, cart_count bigint, buy_count bigint ) row format delimited fields terminated by '\t' location '/warehouse/taobao/dws/dws_user_behavior'; -- 用户行为汇总 insert overwrite table dws_user_behavior select user_id, count(behavior) all_count, sum(`if`(behavior='pv', 1, 0)) pv_count, sum(`if`(behavior='fav', 1, 0)) fav_count, sum(`if`(behavior='cart', 1, 0)) cart_count, sum(`if`(behavior='buy', 1, 0)) buy_count from dwd_user_behavior group by user_id
select concat(round(buy_two/buy_one*100, 2), '%') as `复购率`
from
(select sum(case when buy_count > 1 then 1 else 0 end) buy_two,
sum(case when buy_count > 0 then 1 else 0 end) buy_one
from dws_user_behavior) t1
得到结果65.9%
由于数据集包含的时间是从2017-11-25至2017-12-03,这里选取2017-12-04作为计算日期,统计客户最近发生购买行为的日期距离2017-12-04间隔几天,再对间隔时间进行排名,间隔天数越少,客户价值越大,排名越靠前。
先统计每位用户的购买频率,再对购买频率进行排名,频率越大,客户价值越大,排名越靠前。
select user_id, recent, dense_rank() over (order by recent asc) as recent_rank, frequency, dense_rank() over (order by frequency desc) as frequency_rank
from (
select user_id, datediff('2017-12-04', max(create_time)) as recent, count(behavior) frequency
from dwd_user_behavior
where behavior='buy'
group by user_id
) t1
对有购买行为的用户按照排名进行分组,共划分为四组,对排在前四分之一的用户打4分,排在四分之一到四分之二的用户打3分,排在前四分之二到四分之三的用户打2分,剩余的用户打1分,按照这个规则分别对用户时间间隔排名打分和购买频率排名打分,最后把两个分数合并在一起作为该名用户的最终评分。
-- 计算购买商品的用户个数
select count(distinct user_id) as user_buy_num
from dwd_user_behavior
where behavior = 'buy';
求得为671043
drop table if exists rfm_score ; create table rfm_score( user_id bigint, score int ) row format delimited fields terminated by '\t' location '/warehouse/taobao/ads/ads_rfm_score'; insert overwrite table rfm_score select user_id, case when recent_rank < 671043 * 1 / 4 then 4 when recent_rank < 671043 * 2 / 4 then 3 when recent_rank < 671043 * 3 / 4 then 2 else 1 end + case when frequency_rank < 671043 * 1 / 4 then 4 when frequency_rank < 671043 * 2 / 4 then 3 when frequency_rank < 671043 * 3 / 4 then 2 else 1 end as score from ( select user_id, recent, dense_rank() over (order by recent asc) as recent_rank, frequency, dense_rank() over (order by frequency desc) as frequency_rank from ( select user_id, datediff('2017-12-04', max(create_time)) as recent, count(behavior) frequency from dwd_user_behavior where behavior='buy' group by user_id ) t1 ) rfm;
drop table if exists ads_score_summary ; create table ads_score_summary( score bigint, cnt bigint, buy_user_num bigint, rate string ) row format delimited fields terminated by '\t' location '/warehouse/taobao/ads/ads_score_summary'; with all_score as ( select 0 score, 0 cnt, 0 buy_user_num, 0 rate union all select 1 score, 0 cnt, 0 buy_user_num, 0 rate union all select 2 score, 0 cnt, 0 buy_user_num, 0 rate union all select 3 score, 0 cnt, 0 buy_user_num, 0 rate union all select 4 score, 0 cnt, 0 buy_user_num, 0 rate union all select 5 score, 0 cnt, 0 buy_user_num, 0 rate union all select 6 score, 0 cnt, 0 buy_user_num, 0 rate union all select 7 score, 0 cnt, 0 buy_user_num, 0 rate union all select 8 score, 0 cnt, 0 buy_user_num, 0 rate ), now_score as ( select score, count(1) cnt, 671043 `buy_user_num`, concat(round(count(1) / 671043 * 100, 2), '%') rate from ads_rfm_score group by score order by score asc ) insert overwrite table ads_score_summary select nvl(now_score.score, all_score.score), nvl(now_score.cnt, 0), nvl(now_score.buy_user_num, 0), nvl(now_score.rate, 0) from now_score right join all_score on now_score.score = all_score.score
同一个用户会对一个商品购买多次,这个就是统计商品被多个用户同时购买多次的复购率,说明哪个商品需要一直重复购买,如纸巾、食品等消耗品。
drop table if exists ads_sku_rebuy; create table ads_sku_rebuy( sku_id bigint, `rebuy_rate` string ) row format delimited fields terminated by '\t' location '/warehouse/taobao/ads/ads_sku_rebuy'; with t1 as ( select sku_id, user_id, count(*) as buy_count from dwd_user_behavior where behavior='buy' group by sku_id, user_id ), t2 as ( select sku_id, user_id, sum(`if`(buy_count > 0, 1, 0)) buy_one, sum(`if`(buy_count > 1, 1, 0)) buy_two from t1 group by sku_id, user_id ) insert overwrite table ads_sku_rebuy select sku_id, concat(round(sum(buy_two)/sum(buy_one)*100, 2), '%') rebuy_rate from t2 group by sku_id
至此淘宝用户购买行为可视化分析就做完啦!接下来,我将继续展示SPARK调优,希望在虚拟机集群中实现性能最大化,敬请期待!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。