当前位置:   article > 正文

Hive两小案例_hive 综合应用案例 店铺

hive 综合应用案例 店铺

Hive案例

第1题

我们有如下的用户访问数据

userIdvisitDatevisitCount
u012017/1/215
u022017/1/236
u032017/1/228
u042017/1/203
u012017/1/236
u012017/2/218
u022017/1/236
u012017/2/224

要求使用SQL统计出每个用户的累积访问次数,如下表所示:

用户id月份小计累积
u012017-011111
u012017-021223
u022017-011212
u032017-0188
u042017-0133

数据:

u01 2017/1/21 5

u02 2017/1/23 6

u03 2017/1/22 8

u04 2017/1/20 3

u01 2017/1/23 6

u01 2017/2/21 8

u02 2017/1/23 6

u01 2017/2/22 4

  1. 创建table,加载数据

    create tbale action(
     userId string,visitDate     string ,visitCount   int 
    )
    row format delimited fields terminated by '\t';
    load data local inpath '/opt/module/datas/viste.txt' into table action;
    
    • 1
    • 2
    • 3
    • 4
    • 5
  2. 分析原表中的数据以及结果输出的表数据,发现visitDate字段需要进行格式转换以及分割,格式转换联想到date_format函数,

    select action.userid ,date_format(regexp_rep(action.visitDate,'/','-'),'yyyy-MM') date_dt,action.visitCount  from action;--t1
    
    • 1
  3. 分析结果表的小计字段,发现其为每个用户的每个月的花费小计,故使用分组,进行sum统计即可

    --select t1.userid,t1.date_dt,sum(t1.visitCount) from t1 group by t1.userid,t1.date_dt;  --t2
    
    select t1.userid,t1.date_dt,sum(t1.visitCount) sumvisitCount from 
    (select action.userid ,date_format(regexp_rep(action.visitDate,'/','-'),'yyyy-MM') date_dt,action.visitCount  from action) t1 group by t1.userid,t1.date_dt; --t2
    
    • 1
    • 2
    • 3
    • 4
  4. 继续观察总计字段,发现其为小计的当前行与其起始行的总计,即为到目前为止,每个用户所花费的总金额。所以需要开窗进行对行的范围限制,sum起始行到当前行的小计值,作为总计。

    --select t2.userid,sum(t2.sumvisitCount) over (partiotion by userid order by sumvisitCount rows between unbounded preceding and current row) totalvisitCount from t2;
    
    
    select t2.userid,sum(t2.sumvisitCount) over (partiotion by userid order by sumvisitCount rows between unbounded preceding and current row) totalvisitCount from (select t1.userid,t1.date_dt,sum(t1.visitCount) sumvisitCount from 
    (select action.userid ,date_format(regexp_rep(action.visitDate,'/','-'),'yyyy-MM') date_dt,action.visitCount  from action) t1 group by t1.userid,t1.date_dt) t2
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
第2题 京东

有50W个京东店铺,每个顾客访问任何一个店铺的任何一个商品时都会产生一条访问日志,访问日志存储的表名为visit,访客的用户id为user_id,被访问的店铺名称为shop,请统计:

u1 a

u2 b

u1 b

u1 a

u3 c

u4 b

u1 a

u2 c

u5 b

u4 b

u6 c

u2 c

u1 b

u2 a

u2 a

u3 a

u5 a

u5 a

u5 a

建表和加载数据:

create table visit(user_id string,shop string) row format delimited fields terminated by ‘\t’;

load data local inpath ‘/opt/module/datas/visit.txt’ into table visit;

1)每个店铺的UV(访客数)

–题目分析:对店铺进行分组统计,使用count函数进行对用户访问次数的统计(去重)

select shop,count(distinct user_id) uv from visit group by shop;
  • 1

2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数

–题目分析:

  1. 注意与上题的访客数不同,本题统计店铺的访问次数,即按照店铺分组统计总的访问次数(不是访客)

    select shop,user_id,count(*) visitCount from visit group by shop ,user_id; --t1
    
    • 1
  2. 对访问次数进行开窗,分区排序

    --select t1.shop,t1.user_id,t1.visitCount 
    rank() over (partiton by t1.shop order by t1.visitCount desc) vc1,
    dense_rank() over (partition by t1.shop order by t1.visitCount desc) vc2,
    row_number() over (partition by t1.shop order by t1.visitCount desc) vc3,
    from t1;--t2
    
    select t1.shop,t1.user_id,t1.visitCount 
    rank() over (partiton by t1.shop order by t1.visitCount desc) vc1,
    dense_rank() over (partition by t1.shop order by t1.visitCount desc) vc2,
    row_number() over (partition by t1.shop order by t1.visitCount desc) vc3,
    from (select shop,user_id,count(*) visitCount from visit group by shop ,user_id) t1; --t2
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
  3. 对访问数的top3进行筛选

    select * from t2 where t2.vc2<4;
    
    • 1
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/588141
推荐阅读
相关标签
  

闽ICP备14008679号