当前位置:   article > 正文

网站/APP 流量分析、用户访问分析_页面流量分析

页面流量分析

日萌社

人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新)



  1. 2.本项目中数据仓库的设计(注:采用星型模型)
  2. 1.事实表设计

 

	2.维度表设计

  1. 注意:
  2. 维度表的数据一般要结合业务情况自己写脚本按照规则生成,也可以使用工具生成,方便后续的关联分析。
  3. 比如一般会事前生成时间维度表中的数据,跨度从业务需要的日期到当前日期即可,具体根据你的分析粒度,
  4. 可以生成年、季、月、周、天、时等相关信息,用于分析。
  5. 3.模块开发----ETL
  6. ETL 工作的实质就是从各个数据源提取数据,对数据进行转换,并最终加载填充数据到数据仓库维度建模后的表中。
  7. 只有当这些维度/事实表被填充好,ETL工作才算完成。
  8. 本项目的数据分析过程在 hadoop 集群上实现,主要应用 hive 数据仓库工具,因此,采集并经过预处理后的数据,需要加载到 hive 数据仓库中,以进行后续的分析过程。
  9. 1.创建 ODS 层数据表
  10. 1.原始日志数据表
  11. 1.drop table if exists ods_weblog_origin;
  12. 2.create table ods_weblog_origin(
  13. valid string,
  14. remote_addr string,
  15. remote_user string,
  16. time_local string,
  17. request string,
  18. status string,
  19. body_bytes_sent string,
  20. http_referer string,
  21. http_user_agent string)
  22. partitioned by (datestr string)
  23. row format delimited fields terminated by '\001';
  24. 2.点击流模型 pageviews 模型表
  25. 1.drop table if exists ods_click_pageviews;
  26. 2.create table ods_click_pageviews(
  27. session string,
  28. remote_addr string,
  29. remote_user string,
  30. time_local string,
  31. request string,
  32. visit_step string,
  33. page_staylong string,
  34. http_referer string,
  35. http_user_agent string,
  36. body_bytes_sent string,
  37. status string)
  38. partitioned by (datestr string)
  39. row format delimited fields terminated by '\001';
  40. 3.点击流模型 visit 模型表
  41. 1.drop table if exist ods_click_stream_visit;
  42. 2.create table ods_click_stream_visit(
  43. session string,
  44. remote_addr string,
  45. inTime string,
  46. outTime string,
  47. inPage string,
  48. outPage string,
  49. referal string,
  50. pageVisits int)
  51. partitioned by (datestr string)
  52. row format delimited fields terminated by '\001';
  53. 2.导入 ODS 层数据
  54. 1.数据导入:load data inpath '/weblog/preprocessed/' overwrite into table ods_weblog_origin partition(datestr='20130918');
  55. 2.查看分区:show partitions ods_weblog_origin;
  56. 3.统计导入的数据总数:select count(*) from ods_weblog_origin;
  57. 4.点击流模型的两张表(pageviews、visit 模型表)数据导入操作同上。
  58. 5.注:生产环境中应该将数据 load 命令,写在脚本中,然后配置在 azkaban 中定时运行,注意运行的时间点,应该在预处理数据完成之后。
  59. 3.生成 ODS 层明细宽表
  60. 1.需求实现
  61. 整个数据分析的过程是按照数据仓库的层次分层进行的,总体来说,是从 ODS 原始数据中整理出一些中间表
  62. (比如,为后续分析方便,将原始数据中的时间、url 等非结构化数据作结构化抽取,将各种字段信息进行细化,形成明细表),
  63. 然后再在中间表的基础之上统计出各种指标数据。
  64. 2.ETL 实现:建明细表 ods_weblog_detail
  65. 1.drop table ods_weblog_detail;
  66. 2.create table ods_weblog_detail(
  67. valid string, --有效标识
  68. remote_addr string, # 来源 IP
  69. remote_user string, # 用户标识
  70. time_local string, # 访问完整时间
  71. daystr string, # 访问日期
  72. timestr string, # 访问时间
  73. month string, # 访问月
  74. day string, # 访问日
  75. hour string, # 访问时
  76. request string, # 请求的 url
  77. status string, # 响应码
  78. body_bytes_sent string, # 传输字节数
  79. http_referer string, # 来源 url
  80. ref_host string, # 来源的 host
  81. ref_path string, # 来源的路径
  82. ref_query string, # 来源参数 query
  83. ref_query_id string, # 来源参数 query 的值
  84. http_user_agent string) # 客户终端标识
  85. partitioned by(datestr string);
  86. 3.通过查询插入数据到明细宽表 ods_weblog_detail
  87. 1.抽取 refer_url 到中间表 t_ods_tmp_referurl,也就是将来访 url 分离出 host、path、query、query id。
  88. 2.drop table if exists t_ods_tmp_referurl;
  89. 3.create table t_ods_tmp_referurl as
  90. SELECT a.*,b.*
  91. FROM ods_weblog_origin a
  92. LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b
  93. as host, path, query, query_id;
  94. 4.LATERAL VIEW 用于和 split, Explode 等 UDTF 一起使用,它能够将一列数据拆成 多行数据。
  95. 5.UDTF(User-Defined Table-Generating Functions) :
  96. 用来解决 输入一行 输出多行(On-to-many maping) 的需求。
  97. Explode 也是拆列函数,比如 Explode (ARRAY) ,array 中的每个元素生成一行。
  98. 4.抽取转换 time_local 字段到中间表明细表 t_ods_tmp_detail
  99. 1.drop table if exists t_ods_tmp_detail;
  100. 2.create table t_ods_tmp_detail as
  101. select b.*,substring(time_local,0,10) as daystr,
  102. substring(time_local,12) as tmstr,
  103. substring(time_local,6,2) as month,
  104. substring(time_local,9,2) as day,
  105. substring(time_local,11,3) as hour
  106. from t_ods_tmp_referurl b;
  107. 5.以上语句可以合成一个总的语句
  108. insert into table shizhan.ods_weblog_detail partition(datestr='2013-09-18')
  109. select c.valid,c.remote_addr,c.remote_user,c.time_local,
  110. substring(c.time_local,0,10) as daystr,
  111. substring(c.time_local,12) as tmstr,
  112. substring(c.time_local,6,2) as month,
  113. substring(c.time_local,9,2) as day,
  114. substring(c.time_local,11,3) as hour,
  115. c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent
  116. from (SELECT a.valid,a.remote_addr,a.remote_user,a.time_local, a.request,a.status,a.body_bytes_sent,a.http_referer,
  117. a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id
  118. FROM shizhan.ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST',
  119. 'PATH','QUERY', 'QUERY:id') b as ref_host, ref_path, ref_query, ref_query_id) c;
  120. 4.模块开发----统计分析
  121. 数据仓库建设好以后,用户就可以编写 Hive SQL 语句对其进行访问并对其中数据进行分析。
  122. 在实际生产中,究竟需要哪些统计指标通常由数据需求相关部门人员提出,而且会不断有新的统计需求产生,以下为网站流量分析中的一些典型指标示例。
  123. 注:每一种统计指标都可以跟各维度表进行钻取。
  124. 1.流量分析
  125. 1.多维度统计 PV 总量
  126. 1.按时间维度
  127. 1.计算每小时 pvs,注意 gruop by 语法
  128. select count(*) as pvs,month,day,hour from ods_weblog_detail group by month,day,hour;
  129. 2.方式一:直接在 ods_weblog_detail 单表上进行查询
  130. 1.计算该处理批次(一天)中的各小时 pvs
  131. 1.drop table dw_pvs_everyhour_oneday;
  132. 2.create table dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint)
  133. partitioned by(datestr string);
  134. 3.insert into table dw_pvs_everyhour_oneday partition(datestr='20130918')
  135. select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from ods_weblog_detail a
  136. where a.datestr='20130918' group by a.month,a.day,a.hour;
  137. 2.计算每天的 pvs
  138. 1.drop table dw_pvs_everyday;
  139. 2.create table dw_pvs_everyday(pvs bigint,month string,day string);
  140. 3.insert into table dw_pvs_everyday
  141. select count(*) as pvs,a.month as month,a.day as day from ods_weblog_detail a
  142. group by a.month,a.day;
  143. 3.方式二:与时间维表关联查询
  144. 1.维度:日
  145. 1.drop table dw_pvs_everyday;
  146. 2.create table dw_pvs_everyday(pvs bigint,month string,day string);
  147. 3.insert into table dw_pvs_everyday
  148. select count(*) as pvs,a.month as month,a.day as day from (select distinct month, day from t_dim_time) a
  149. join ods_weblog_detail b
  150. on a.month=b.month and a.day=b.day
  151. group by a.month,a.day;
  152. 2.维度:月
  153. 1.drop table dw_pvs_everymonth;
  154. 2.create table dw_pvs_everymonth (pvs bigint,month string);
  155. 3.insert into table dw_pvs_everymonth
  156. 4.select count(*) as pvs,a.month from (select distinct month from t_dim_time) a
  157. join ods_weblog_detail b on a.month=b.month group by a.month;
  158. 3.另外,也可以直接利用之前的计算结果。比如从之前算好的小时结果中统计每一天的
  159. insert into table dw_pvs_everyday
  160. select sum(pvs) as pvs,month,day from dw_pvs_everyhour_oneday group by month,day having day='18';
  161. 2.按终端维度
  162. 1.数据中能够反映出用户终端信息的字段是 http_user_agent。
  163. 2.User Agent 也简称 UA。
  164. 1.它是一个特殊字符串头,是一种向访问网站提供所使用的浏览器类型及版本、操作系统及版本、浏览器内核、等信息的标识。
  165. 2.例如:User-Agent,Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
  166. Chrome/58.0.3029.276 Safari/537.36
  167. 3.上述 UA 信息就可以提取出以下的信息:
  168. chrome 58.0、浏览器 chrome、浏览器版本 58.0、系统平台 windows、浏览器内核 webkit
  169. 3.可以用下面的语句进行试探性统计,当然这样的准确度不是很高。
  170. select distinct(http_user_agent) from ods_weblog_detail where http_user_agent like '%Chrome%' limit 200;
  171. 3.按栏目维度
  172. 网站栏目可以理解为网站中内容相关的主题集中。
  173. 体现在域名上来看就是不同的栏目会有不同的二级目录。
  174. 比如某网站网址为 www.xxxx.cn,旗下栏目可以通过如下方式访问:
  175. 栏目维度:../job
  176. 栏目维度:../news
  177. 栏目维度:../sports
  178. 栏目维度:../technology
  179. 那么根据用户请求 url 就可以解析出访问栏目,然后按照栏目进行统计分析。
  180. 4.按 referer 维度
  181. 1.统计每小时各来访 url 产生的 pv 量
  182. 1.drop table dw_pvs_referer_everyhour;
  183. 2.create table dw_pvs_referer_everyhour(
  184. referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint)
  185. partitioned by(datestr string);
  186. 3.insert into table dw_pvs_referer_everyhour partition(datestr='20130918')
  187. select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt
  188. from ods_weblog_detail
  189. group by http_referer,ref_host,month,day,hour
  190. having ref_host is not null
  191. order by hour asc,day asc,month asc,pv_referer_cnt desc;
  192. 2.统计每小时各来访 host 的产生的 pv 数并排序
  193. 1.drop table dw_pvs_refererhost_everyhour;
  194. 2.create table dw_pvs_refererhost_everyhour(
  195. ref_host string,month string,day string,hour string,ref_host_cnts bigint)
  196. partitioned by(datestr string);
  197. 3.insert into table dw_pvs_refererhost_everyhour partition(datestr='20130918')
  198. select ref_host,month,day,hour,count(1) as ref_host_cnts
  199. from ods_weblog_detail
  200. group by ref_host,month,day,hour
  201. having ref_host is not null
  202. order by hour asc,day asc,month asc,ref_host_cnts desc;
  203. 3.注:还可以按来源地域维度、访客终端维度等计算

  1. 2.人均浏览量
  2. 1.需求描述:统计今日所有来访者平均请求的页面数。
  3. 2.人均浏览量也称作人均浏览页数,该指标可以说明网站对用户的粘性。
  4. 人均页面浏览量表示用户某一时段平均浏览页面的次数。
  5. 计算方式:总页面请求数/去重总人数
  6. remote_addr表示不同的用户。
  7. 可以先统计出不同 remote_addr 的 pv量, 然后累加(sum)所有 pv 作为总的页面请求数,再 count 所有 remote_addr 作为总的去重总人数。
  8. 3.总页面请求数/去重总人数
  9. 1.drop table dw_avgpv_user_everyday;
  10. 2.create table dw_avgpv_user_everyday(day string, avgpv string);
  11. 3.insert into table dw_avgpv_user_everyday
  12. select '20130918',sum(b.pvs)/count(b.remote_addr) from
  13. (select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='20130918' group by remote_addr) b;
  14. 3.统计 pv 总量最大的来源 TOPN (分组 TOP)
  15. 1.需求描述:统计每小时各来访 host 的产生的 pvs 数最多的前 N 个(topN) 。
  16. 2.row_number()函数
  17. 1.语法:row_number() over (partition by xxx order by xxx) rank。
  18. 2.rank 为分组的别名,相当于新增一个字段为 rank。
  19. 3.partition by 用于分组,比方说依照 sex 字段分组
  20. 4.order by 用于分组内排序,比方说依照 sex 分组,组内按照 age 排序
  21. 5.排好序之后,为每个分组内每一条分组记录从 1 开始返回一个数字
  22. 6.取组内某个数据,可以使用 “where 表名.rank > x” 之类的语法去取
  23. 3.以下语句对每个小时内的来访 host 次数倒序排序(从大到小)标号:
  24. select ref_host,ref_host_cnts,concat(month,day,hour),
  25. row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od
  26. from dw_pvs_refererhost_everyhour;
  27. 4.效果如下:

  1. 2.受访分析(从页面的角度分析)
  2. 1.各页面访问统计
  3. 主要是针对数据中的 request 进行统计分析,比如各页面 PV ,各页面 UV 等。
  4. 以上指标无非就是根据页面的字段 group by
  5. 例如:统计各页面 pv
  6. select request as request,count(request) as request_counts from ods_weblog_detail
  7. group by request having request is not null order by request_counts desc limit 20;
  8. 2.热门页面统计
  9. 统计每日最热门的页面 top10
  10. 1.drop table dw_hotpages_everyday;
  11. 2.create table dw_hotpages_everyday(day string,url string,pvs string);
  12. 3.insert into table dw_hotpages_everyday
  13. select '20130918',a.request,a.request_counts from
  14. (
  15. select request as request,count(request) as request_counts from ods_weblog_detail where datestr='20130918'
  16. group by request having request is not null
  17. ) a order by a.request_counts desc limit 10;
  18. 3.访客分析
  19. 1.独立访客
  20. 1.需求描述:按照时间维度,比如:小时来统计独立访客及其产生的 pv。
  21. 2.对于独立访客的识别,如果在原始日志中有用户标识,则根据用户标识即很好实现;
  22. 此处,由于原始日志中并没有用户标识,以访客 IP 来模拟,技术上是一样的,只是精确度相对较低。
  23. 3.时间维度:时
  24. 1.drop table dw_user_dstc_ip_h;
  25. 2.create table dw_user_dstc_ip_h(remote_addr string, pvs bigint, hour string);
  26. 3.insert into table dw_user_dstc_ip_h
  27. select remote_addr,count(1) as pvs,concat(month,day,hour) as hour from ods_weblog_detail
  28. Where datestr='20130918'
  29. group by concat(month,day,hour),remote_addr;
  30. 4.在此结果表之上,可以进一步统计,如每小时独立访客总数:
  31. select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour;
  32. 4.时间维度:日
  33. select remote_addr,count(1) as counts,concat(month,day) as day
  34. from ods_weblog_detail
  35. Where datestr='20130918'
  36. group by concat(month,day),remote_addr;
  37. 5.时间维度:月
  38. select remote_addr,count(1) as counts,month
  39. from ods_weblog_detail
  40. group by month,remote_addr;
  41. 4.每日新访客
  42. 1.需求:将每天的新访客统计出来。
  43. 2.实现思路:创建一个去重访客累积表,然后将每日访客对比累积表。

  1. 3.历日去重访客累积表
  2. 1.drop table dw_user_dsct_history;
  3. 2.create table dw_user_dsct_history(day string, ip string)
  4. partitioned by(datestr string);
  5. 4.每日新访客表
  6. 1.drop table dw_user_new_d;
  7. 2.create table dw_user_new_d (day string, ip string)
  8. partitioned by(datestr string);
  9. 5.每日新用户插入新访客表
  10. 1.insert into table dw_user_new_d partition(datestr='20130918')
  11. select tmp.day as day,tmp.today_addr as new_ip
  12. from(
  13. select today.day as day,today.remote_addr as today_addr,old.ip as old_addr
  14. from (
  15. select distinct remote_addr as remote_addr,"20130918" as day
  16. from ods_weblog_detail where datestr="20130918"
  17. ) today left outer join dw_user_dsct_history old on today.remote_addr=old.ip
  18. ) tmp
  19. where tmp.old_addr is null;
  20. 6.每日新用户追加到累计表
  21. insert into table dw_user_dsct_history partition(datestr='20130918')
  22. select day,ip from dw_user_new_d where datestr='20130918';
  23. 7.验证查看:
  24. select count(distinct remote_addr) from ods_weblog_detail;
  25. select count(1) from dw_user_dsct_history where datestr='20130918';
  26. select count(1) from dw_user_new_d where datestr='20130918';
  27. 8.注:还可以按来源地域维度、访客终端维度等计算
  28. 5.访客 Visit 分析(点击流模型)
  29. 1.回头/单次访客统计
  30. 1.需求:查询今日所有回头访客及其访问次数。

  1. 2.实现思路:上表中 “出现次数 > 1” 的访客,即回头访客;反之,则为单次访客。
  2. 1.drop table dw_user_returning;
  3. 2.create table dw_user_returning(day string, remote_addr string, acc_cnt string)
  4. partitioned by (datestr string);
  5. 3.insert overwrite table dw_user_returning partition(datestr='20130918')
  6. select tmp.day,tmp.remote_addr,tmp.acc_cnt
  7. from (select '20130918' as day,remote_addr,count(session) as acc_cnt from ods_click_stream_visit group by remote_addr) tmp
  8. where tmp.acc_cnt > 1;
  9. 2.人均访问频次
  10. 1.需求:统计出每天所有用户访问网站的平均次数(visit)
  11. 2.总 visit 数/去重总用户数
  12. select sum(pagevisits)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20130918';
  13. 6.关键路径转化率分析(漏斗模型)
  14. 1.需求分析
  15. 转化:在一条指定的业务流程中,各个步骤的完成人数及相对上一个步骤的百分比。

  1. 2.模型设计
  2. 定义好业务流程中的页面标识,下例中的步骤为:
  3. Step1/item
  4. Step2/category
  5. Step3/index
  6. Step4/order
  7. 3.开发实现
  8. 1.查询每一个步骤的总访问人数:查询每一步人数存入 dw_oute_numbs
  9. 1.create table dw_oute_numbs as
  10. select 'step1' as step,count(distinct remote_addr) as numbs from ods_click_pageviews
  11. where datestr='20130920' and request like '/item%'
  12. union
  13. select 'step2' as step,count(distinct remote_addr) as numbs from ods_click_pageviews
  14. where datestr='20130920' and request like '/category%'
  15. union
  16. select 'step3' as step,count(distinct remote_addr) as numbs from ods_click_pageviews
  17. where datestr='20130920' and request like '/order%'
  18. union
  19. select 'step4' as step,count(distinct remote_addr) as numbs from ods_click_pageviews
  20. where datestr='20130920' and request like '/index%';
  21. 注:UNION 将多个 SELECT 语句的结果集合并为一个独立的结果集。
  22. 2.查询每一步骤相对于路径起点人数的比例
  23. 思路:级联查询,利用自 join
  24. 1.dw_oute_numbs 跟自己 join
  25. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
  26. from dw_oute_numbs rn inner join dw_oute_numbs rr;
  27. 2.每一步的人数/第一步的人数==每一步相对起点人数比例
  28. select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio
  29. from (
  30. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
  31. from dw_oute_numbs rn inner join dw_oute_numbs rr
  32. ) tmp where tmp.rrstep='step1';
  33. 3.查询每一步骤相对于上一步骤的漏出率:自 join 表过滤出每一步跟上一步的记录
  34. 1.select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
  35. from dw_oute_numbs rn inner join dw_oute_numbs rr
  36. where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1;
  37. 2.select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as leakage_rate
  38. from (
  39. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
  40. from dw_oute_numbs rn inner join dw_oute_numbs rr
  41. ) tmp where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1;
  42. 4.汇总以上两种指标
  43. select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate
  44. from (
  45. select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate
  46. from (
  47. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
  48. from dw_oute_numbs rn inner join dw_oute_numbs rr
  49. ) tmp where tmp.rrstep='step1'
  50. )
  51. abs left outer join
  52. (
  53. select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate
  54. from (
  55. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
  56. from dw_oute_numbs rn inner join dw_oute_numbs rr
  57. ) tmp where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
  58. ) rel on abs.step=rel.step;

 


网站流量日志分析--模块开发--ETL--创建ODS层表

  1. 1.时间同步命令:ntpdate ntp6.aliyun.com
  2. 2.启动 mysql 版的 Hive,本地路径下启动hive
  3. 1.本地连接方式:
  4. cd /root/hive/bin
  5. ./hive
  6. 2.外部Linux连接访问当前Linux下的hive:(注意使用外部连接方式时必须先启动hiveserver2服务器)
  7. 1.后台模式启动hiveserver2服务器:
  8. cd /root/hive/bin
  9. nohup ./hiveserver2 1>/var/log/hiveserver.log 2>/var/log/hiveserver.err &
  10. 然后会返回hiveserver2服务器的进程号
  11. 2.外部Linux连接访问当前Linux下的hive
  12. cd /root/hive/bin
  13. ./beeline -u jdbc:hive2://NODE1:10000 -n root
  14. 然后输入NODE1所在linux的用户名和密码
  15. 3.本地模式:
  16. # 设置本地模式(仅需当前机器)执行查询语句,不设置的话则需要使用yarn集群(多台集群的机器)执行查询语句
  17. # 本地模式只推荐在开发环境开启,以便提高查询效率,但在生产上线环境下应重新设置为使用yarm集群模式
  18. set hive.exec.mode.local.auto=true;
  19. 4.创建数据库:
  20. create database itheima;
  21. use itheima;
  22. 5.创建表:
  23. 1.原始数据表:对应mr清洗完之后的数据,而不是原始日志数据
  24. 1.drop table if exists ods_weblog_origin;
  25. 2.create table ods_weblog_origin(
  26. valid string,
  27. remote_addr string,
  28. remote_user string,
  29. time_local string,
  30. request string,
  31. status string,
  32. body_bytes_sent string,
  33. http_referer string,
  34. http_user_agent string)
  35. partitioned by (datestr string)
  36. row format delimited fields terminated by '\001';

  1. 2.点击流pageview表
  2. 1.drop table if exists ods_click_pageviews;
  3. 2.create table ods_click_pageviews(
  4. session string,
  5. remote_addr string,
  6. remote_user string,
  7. time_local string,
  8. request string,
  9. visit_step string,
  10. page_staylong string,
  11. http_referer string,
  12. http_user_agent string,
  13. body_bytes_sent string,
  14. status string)
  15. partitioned by (datestr string)
  16. row format delimited fields terminated by '\001';

  1. 3.点击流visit表
  2. 1.drop table if exists ods_click_stream_visit;
  3. 2.create table ods_click_stream_visit(
  4. session string,
  5. remote_addr string,
  6. inTime string,
  7. outTime string,
  8. inPage string,
  9. outPage string,
  10. referal string,
  11. pageVisits int)
  12. partitioned by (datestr string)
  13. row format delimited fields terminated by '\001';

  1. 4.维度表示例:
  2. 1.drop table if exists t_dim_time;
  3. 2.create table t_dim_time(date_key int,year string,month string,day string,hour string) row format delimited fields terminated by ',';
  4. 5.show tables;


网站流量日志分析--模块开发--ETL--导入ODS层数据

  1. 1.hdfs中创建指定目录,准备用于存储数据文件
  2. hdfs dfs -mkdir -p /weblog/preprocessed
  3. hdfs dfs -mkdir -p /weblog/clickstream/pageviews
  4. hdfs dfs -mkdir -p /weblog/clickstream/visits
  5. hdfs dfs -mkdir -p /weblog/dim_time
  6. 2.浏览器查看hdfs文件系统:192.168.25.100:50070

  1. 3.把要导入的数据文件先上传到指定位置
  2. hdfs dfs -put /root/hivedata/weblog/output/part-m-00000 /weblog/preprocessed
  3. hdfs dfs -put /root/hivedata/weblog/pageviews/part-r-00000 /weblog/clickstream/pageviews
  4. hdfs dfs -put /root/hivedata/weblog/visitout/part-r-00000 /weblog/clickstream/visits
  5. hdfs dfs -put /root/hivedata/weblog/dim_time_dat.txt /weblog/dim_time

  1. 4.把hdfs文件系统路径下的数据文件导入到hive数据库表中:
  2. 1.把 清洗结果数据 导入到 源数据表ods_weblog_origin
  3. load data inpath '/weblog/preprocessed/' overwrite into table ods_weblog_origin partition(datestr='20130918');
  4. show partitions ods_weblog_origin; # 显示结果 datestr=20130918
  5. select count(*) from ods_weblog_origin; # 显示结果 13770
  6. 2.把 点击流模型pageviews数据 导入到 ods_click_pageviews表
  7. load data inpath '/weblog/clickstream/pageviews' overwrite into table ods_click_pageviews partition(datestr='20130918');
  8. select count(*) from ods_click_pageviews; # 显示结果 76
  9. 3.把 点击流模型visit数据 导入到 ods_click_stream_visit表
  10. load data inpath '/weblog/clickstream/visits' overwrite into table ods_click_stream_visit partition(datestr='20130918');
  11. select count(*) from ods_click_stream_visit; # 显示结果 57
  12. 4.把 dim_time_dat.txt 导入到 时间维度表
  13. load data inpath '/weblog/dim_time' overwrite into table t_dim_time;
  14. select count(*) from t_dim_time; # 显示结果 29

网站流量日志分析--模块开发--ETL--ODS明细宽表

  1. 1.创建表明细宽表 ods_weblog_detail
  2. 1.drop table ods_weblog_detail;
  3. 2.create table ods_weblog_detail(
  4. valid string, --有效标识
  5. remote_addr string, --来源IP
  6. remote_user string, --用户标识
  7. time_local string, --访问完整时间
  8. daystr string, --访问日期
  9. timestr string, --访问时间
  10. month string, --访问月
  11. day string, --访问日
  12. hour string, --访问时
  13. request string, --请求的url
  14. status string, --响应码
  15. body_bytes_sent string, --传输字节数
  16. http_referer string, --来源url
  17. ref_host string, --来源的host
  18. ref_path string, --来源的路径
  19. ref_query string, --来源参数query
  20. ref_query_id string, --来源参数query的值
  21. http_user_agent string --客户终端标识
  22. )partitioned by(datestr string);
  23. 2.抽取refer_url到中间表 t_ods_tmp_referurl
  24. 1.drop table if exists t_ods_tmp_referurl;
  25. 2.create table t_ods_tmp_referurl as
  26. SELECT a.*,b.*
  27. FROM ods_weblog_origin a
  28. LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id;
  29. 3.解析:
  30. regexp_replace(字段名, "\"", ""):把双引号 替换为 空字符串
  31. parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id'):将来访url值分离出四列值:host、path、query、query_id
  32. 3.创建中间表明细表 t_ods_tmp_detail,并且抽取转换time_local字段到中间表明细表 t_ods_tmp_detail
  33. 1.drop table if exists t_ods_tmp_detail;
  34. 2.create table t_ods_tmp_detail as
  35. select b.*,substring(time_local,0,10) as daystr,
  36. substring(time_local,12) as tmstr,
  37. substring(time_local,6,2) as month,
  38. substring(time_local,9,2) as day,
  39. substring(time_local,11,3) as hour
  40. From t_ods_tmp_referurl b;
  41. 4.把查询数据 插入到明细宽表ods_weblog_detail中
  42. insert into table ods_weblog_detail partition(datestr='20130918')
  43. select c.valid,c.remote_addr,c.remote_user,c.time_local,
  44. substring(c.time_local,0,10) as daystr,
  45. substring(c.time_local,12) as tmstr,
  46. substring(c.time_local,6,2) as month,
  47. substring(c.time_local,9,2) as day,
  48. substring(c.time_local,11,3) as hour,
  49. c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent
  50. from
  51. (SELECT a.valid,a.remote_addr,a.remote_user,a.time_local,
  52. a.request,a.status,a.body_bytes_sent,a.http_referer,a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id
  53. FROM ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b
  54. as ref_host, ref_path, ref_query, ref_query_id) c;

 网站流量日志分析--模块开发--统计分析--时间&来访维度统计pvs 

  1. 1.流量分析
  2. 1.计算每小时pvs,注意gruop by语句的语法
  3. select count(*) as pvs,month,day,hour from ods_weblog_detail group by month,day,hour;

  1. 2.多维度统计PV总量
  2. 1.第一种方式:直接在ods_weblog_detail单表上进行查询
  3. 1.计算该处理批次(一天)中的各小时pvs
  4. 1.drop table dw_pvs_everyhour_oneday;
  5. 2.create table dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) partitioned by(datestr string);
  6. 3.insert into table dw_pvs_everyhour_oneday partition(datestr='20130918')
  7. select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from ods_weblog_detail a
  8. 2.计算每天的pvs
  9. 1.drop table dw_pvs_everyday;
  10. 2.create table dw_pvs_everyday(pvs bigint,month string,day string);
  11. 3.insert into table dw_pvs_everyday
  12. select count(*) as pvs,a.month as month,a.day as day from ods_weblog_detail a
  13. group by a.month,a.day;

  1. 2.第二种方式:与时间维表关联查询
  2. 1.维度:日
  3. 1.drop table dw_pvs_everyday;
  4. 2.create table dw_pvs_everyday(pvs bigint,month string,day string);
  5. 3.insert into table dw_pvs_everyday
  6. select count(*) as pvs,a.month as month,a.day as day from (select distinct month, day from t_dim_time) a
  7. join ods_weblog_detail b
  8. on a.month=b.month and a.day=b.day
  9. group by a.month,a.day;

  1. 2.维度:月
  2. 1.drop table dw_pvs_everymonth;
  3. 2.create table dw_pvs_everymonth (pvs bigint,month string);
  4. 3.insert into table dw_pvs_everymonth
  5. select count(*) as pvs,a.month from (select distinct month from t_dim_time) a
  6. join ods_weblog_detail b on a.month=b.month group by a.month;

  1. 3.另外,也可以直接利用之前的计算结果。比如从之前算好的小时结果中统计每一天的
  2. insert into table dw_pvs_everyday
  3. select sum(pvs) as pvs,month,day from dw_pvs_everyhour_oneday group by month,day having day='18';


 网站流量日志分析--模块开发--统计分析--了解其他维度&人均浏览量

  1. 1.按照来访维度统计pv
  2. 1.统计每小时各来访url产生的pv量,查询结果存入:( "dw_pvs_referer_everyhour" )
  3. 1.drop table dw_pvs_referer_everyhour;
  4. 2.create table dw_pvs_referer_everyhour(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string);
  5. 3.insert into table dw_pvs_referer_everyhour partition(datestr='20130918')
  6. select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt
  7. from ods_weblog_detail
  8. group by http_referer,ref_host,month,day,hour
  9. having ref_host is not null
  10. order by hour asc,day asc,month asc,pv_referer_cnt desc;
  11. +-------------------------------------------------------------+----------------------------------------+---------------------------------+-------------------------------+--------------------------------+------------------------------------------+-----------------------------------+--+
  12. | dw_pvs_referer_everyhour.referer_url | dw_pvs_referer_everyhour.referer_host | dw_pvs_referer_everyhour.month | dw_pvs_referer_everyhour.day | dw_pvs_referer_everyhour.hour | dw_pvs_referer_everyhour.pv_referer_cnt | dw_pvs_referer_everyhour.datestr |
  13. +-------------------------------------------------------------+----------------------------------------+---------------------------------+-------------------------------+--------------------------------+------------------------------------------+-----------------------------------+--+
  14. | "http://blog.fens.me/r-density/" | blog.fens.me | 09 | 19 | 00 | 26 | 20130918 |
  15. | "http://blog.fens.me/r-json-rjson/" | blog.fens.me | 09 | 19 | 00 | 21 | 20130918 |
  16. | "http://blog.fens.me/vpn-pptp-client-ubuntu/" | blog.fens.me | 09 | 19 | 00 | 20 | 20130918 |
  17. | "http://blog.fens.me/hadoop-mahout-roadmap/" | blog.fens.me | 09 | 19 | 00 | 20 | 20130918 |
  18. | "http://blog.fens.me/hadoop-zookeeper-intro/" | blog.fens.me | 09 | 19 | 00 | 20 | 20130918 |
  19. | "http://www.fens.me/" | www.fens.me | 09 | 19 | 00 | 12 | 20130918 |
  20. | "http://h2w.iask.cn/jump.php?url=http%3A%2F%2Fwww.fens.me" | h2w.iask.cn | 09 | 19 | 00 | 5 | 20130918 |
  21. | "https://www.google.com.hk/" | www.google.com.hk | 09 | 19 | 00 | 3 | 20130918 |
  22. | "http://angularjs.cn/A0eQ" | angularjs.cn | 09 | 19 | 00 | 2 | 20130918 |
  23. | "http://blog.fens.me/about/" | blog.fens.me | 09 | 19 | 00 | 2 | 20130918 |
  24. +-------------------------------------------------------------+----------------------------------------+---------------------------------+-------------------------------+--------------------------------+------------------------------------------+-----------------------------------+--+
  25. 2.统计每小时各来访host的产生的pv数并排序
  26. 1.drop table dw_pvs_refererhost_everyhour;
  27. 2.create table dw_pvs_refererhost_everyhour(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);
  28. 3.insert into table dw_pvs_refererhost_everyhour partition(datestr='20130918')
  29. select ref_host,month,day,hour,count(1) as ref_host_cnts
  30. from ods_weblog_detail
  31. group by ref_host,month,day,hour
  32. having ref_host is not null
  33. order by hour asc,day asc,month asc,ref_host_cnts desc;
  34. +----------------------------------------+-------------------------------------+-----------------------------------+------------------------------------+---------------------------------------------+---------------------------------------+--+
  35. | dw_pvs_refererhost_everyhour.ref_host | dw_pvs_refererhost_everyhour.month | dw_pvs_refererhost_everyhour.day | dw_pvs_refererhost_everyhour.hour | dw_pvs_refererhost_everyhour.ref_host_cnts | dw_pvs_refererhost_everyhour.datestr |
  36. +----------------------------------------+-------------------------------------+-----------------------------------+------------------------------------+---------------------------------------------+---------------------------------------+--+
  37. | blog.fens.me | 09 | 19 | 00 | 111 | 20130918 |
  38. | www.fens.me | 09 | 19 | 00 | 13 | 20130918 |
  39. | h2w.iask.cn | 09 | 19 | 00 | 6 | 20130918 |
  40. | www.google.com.hk | 09 | 19 | 00 | 3 | 20130918 |
  41. | angularjs.cn | 09 | 19 | 00 | 3 | 20130918 |
  42. | cnodejs.org | 09 | 19 | 00 | 1 | 20130918 |
  43. | www.leonarding.com | 09 | 19 | 00 | 1 | 20130918 |
  44. | www.itpub.net | 09 | 19 | 00 | 1 | 20130918 |
  45. | blog.fens.me | 09 | 19 | 01 | 89 | 20130918 |
  46. | cos.name | 09 | 19 | 01 | 3 | 20130918 |
  47. +----------------------------------------+-------------------------------------+-----------------------------------+------------------------------------+---------------------------------------------+---------------------------------------+--+

 网站流量日志分析--模块开发--统计分析--分组TopN(rowNumber)

  1. 1.统计pv总量最大的来源TOPN
  2. 1.需求:按照时间维度,统计一天内各小时产生最多pvs的来源topN
  3. 2.row_number函数
  4. select ref_host,ref_host_cnts,concat(month,day,hour),
  5. row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od
  6. from dw_pvs_refererhost_everyhour;
  7. +-------------------------+----------------+----------+-----+--+
  8. | ref_host | ref_host_cnts | _c2 | od |
  9. +-------------------------+----------------+----------+-----+--+
  10. | blog.fens.me | 68 | 0918 06 | 1 |
  11. | www.angularjs.cn | 3 | 0918 06 | 2 |
  12. | www.google.com | 2 | 0918 06 | 3 |
  13. | www.baidu.com | 1 | 0918 06 | 4 |
  14. | cos.name | 1 | 0918 06 | 5 |
  15. | blog.fens.me | 711 | 0918 07 | 1 |
  16. | www.google.com.hk | 20 | 0918 07 | 2 |
  17. | www.angularjs.cn | 20 | 0918 07 | 3 |
  18. | www.dataguru.cn | 10 | 0918 07 | 4 |
  19. 3.综上可以得出
  20. 1.drop table dw_pvs_refhost_topn_everyhour;
  21. 2.create table dw_pvs_refhost_topn_everyhour(
  22. hour string,
  23. toporder string,
  24. ref_host string,
  25. ref_host_cnts string
  26. )partitioned by(datestr string);
  27. 3.insert into table dw_pvs_refhost_topn_everyhour partition(datestr='20130918')
  28. select t.hour,t.od,t.ref_host,t.ref_host_cnts from
  29. (select ref_host,ref_host_cnts,concat(month,day,hour) as hour,
  30. row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od
  31. from dw_pvs_refererhost_everyhour) t where od<=3;
  32. +-------------------------------------+-----------------------------------------+-----------------------------------------+----------------------------------------------+----------------------------------------+--+
  33. | dw_pvs_refhost_topn_everyhour.hour | dw_pvs_refhost_topn_everyhour.toporder | dw_pvs_refhost_topn_everyhour.ref_host | dw_pvs_refhost_topn_everyhour.ref_host_cnts | dw_pvs_refhost_topn_everyhour.datestr |
  34. +-------------------------------------+-----------------------------------------+-----------------------------------------+----------------------------------------------+----------------------------------------+--+
  35. | 0918 06 | 1 | blog.fens.me | 68 | 20130918 |
  36. | 0918 06 | 2 | www.angularjs.cn | 3 | 20130918 |
  37. | 0918 06 | 3 | www.google.com | 2 | 20130918 |
  38. | 0918 07 | 1 | blog.fens.me | 711 | 20130918 |
  39. | 0918 07 | 2 | www.google.com.hk | 20 | 20130918 |
  40. | 0918 07 | 3 | www.angularjs.cn | 20 | 20130918 |
  41. | 0918 08 | 1 | blog.fens.me | 1556 | 20130918 |
  42. | 0918 08 | 2 | www.fens.me | 26 | 20130918 |
  43. | 0918 08 | 3 | www.baidu.com | 15 | 20130918 |
  44. | 0918 09 | 1 | blog.fens.me | 1047 | 20130918 |
  45. +-------------------------------------+-----------------------------------------+-----------------------------------------+----------------------------------------------+----------------------------------------+--+
  46. 2.人均浏览页数
  47. 1.需求描述:统计今日所有来访者平均请求的页面数。
  48. 2.总页面请求数/去重总人数
  49. 1.drop table dw_avgpv_user_everyday;
  50. 2.create table dw_avgpv_user_everyday(
  51. day string,
  52. avgpv string);
  53. 3.insert into table dw_avgpv_user_everyday
  54. select '20130918',sum(b.pvs)/count(b.remote_addr) from
  55. (select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='20130918' group by remote_addr) b;


各页面访问统计

  1. 各页面PV
  2. select request as request,count(request) as request_counts from
  3. ods_weblog_detail group by request having request is not null order by request_counts desc limit 20;


网站流量日志分析--模块开发--受访分析--热门页面

  1. 热门页面统计
  2. 统计每日最热门的页面top10
  3. 1.drop table dw_hotpages_everyday;
  4. 2.create table dw_hotpages_everyday(day string,url string,pvs string);
  5. 3.insert into table dw_hotpages_everyday
  6. select '20130918',a.request,a.request_counts from
  7. (select request as request,count(request) as request_counts from ods_weblog_detail where datestr='20130918' group by request having request is not null) a
  8. order by a.request_counts desc limit 10;


 网站流量日志分析--模块开发--访客开发--独立访客&新访客

  1. 1.独立访客
  2. 1.需求:按照时间维度来统计独立访客及其产生的pv量
  3. 2.时间维度:时
  4. 1.drop table dw_user_dstc_ip_h;
  5. 2.create table dw_user_dstc_ip_h(
  6. remote_addr string,
  7. pvs bigint,
  8. hour string);
  9. 3.insert into table dw_user_dstc_ip_h
  10. select remote_addr,count(1) as pvs,concat(month,day,hour) as hour
  11. from ods_weblog_detail
  12. where datestr='20130918'
  13. group by concat(month,day,hour),remote_addr;

  1. 3.在上述基础之上,可以继续分析,比如每小时独立访客总数
  2. select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour;
  3. +---------------+----------+--+
  4. | dstc_ip_cnts | hour |
  5. +---------------+----------+--+
  6. | 19 | 0918 06 |
  7. | 98 | 0918 07 |
  8. | 129 | 0918 08 |
  9. | 149 | 0918 09 |
  10. | 107 | 0918 10 |
  11. | 54 | 0918 11 |
  12. | 52 | 0918 12 |
  13. | 71 | 0918 13 |
  14. | 62 | 0918 14 |
  15. | 72 | 0918 15 |
  16. | 93 | 0918 16 |
  17. | 55 | 0918 17 |
  18. 4.时间维度:日
  19. select remote_addr,count(1) as counts,concat(month,day) as day
  20. from ods_weblog_detail
  21. where datestr='20130918'
  22. group by concat(month,day),remote_addr;
  23. +------------------+---------+-------+--+
  24. | remote_addr | counts | day |
  25. +------------------+---------+-------+--+
  26. | 1.162.203.134 | 1 | 0918 |
  27. | 1.202.186.37 | 28 | 0918 |
  28. | 1.202.222.147 | 1 | 0918 |
  29. | 1.202.70.78 | 1 | 0918 |
  30. | 1.206.126.5 | 1 | 0918 |
  31. | 1.34.23.44 | 1 | 0918 |
  32. | 1.80.249.223 | 5 | 0918 |
  33. | 1.82.139.173 | 24 | 0918 |
  34. | 101.226.102.97 | 1 | 0918 |
  35. | 101.226.166.214 | 1 | 0918 |
  36. | 101.226.166.216 | 1 | 0918 |
  37. | 101.226.166.222 | 1 | 0918 |
  38. | 101.226.166.235 | 2 | 0918 |
  39. | 101.226.166.236 | 1 | 0918 |
  40. | 101.226.166.237 | 2 | 0918 |
  41. 5.时间维度: 月
  42. select remote_addr,count(1) as counts,month
  43. from ods_weblog_detail
  44. group by month,remote_addr;
  45. +------------------+---------+--------+--+
  46. | remote_addr | counts | month |
  47. +------------------+---------+--------+--+
  48. | 1.162.203.134 | 1 | 09 |
  49. | 1.202.186.37 | 35 | 09 |
  50. | 1.202.222.147 | 1 | 09 |
  51. | 1.202.70.78 | 1 | 09 |
  52. | 1.206.126.5 | 34 | 09 |
  53. | 1.34.23.44 | 1 | 09 |
  54. | 1.80.245.79 | 1 | 09 |
  55. | 1.80.249.223 | 5 | 09 |
  56. | 1.82.139.173 | 24 | 09 |
  57. | 101.226.102.97 | 1 | 09 |
  58. | 101.226.166.214 | 1 | 09 |
  59. 2.每日新访客
  60. 1.需求:将每天的新访客统计出来。
  61. 2.历日去重访客累积表
  62. 1.drop table dw_user_dsct_history;
  63. 2.create table dw_user_dsct_history(
  64. day string,
  65. ip string
  66. ) partitioned by(datestr string);
  67. 3.每日新访客表
  68. 1.drop table dw_user_new_d;
  69. 2.create table dw_user_new_d (
  70. day string,
  71. ip string
  72. ) partitioned by(datestr string);
  73. 4.每日新用户插入新访客表
  74. insert into table dw_user_new_d partition(datestr='20130918')
  75. select tmp.day as day,tmp.today_addr as new_ip from
  76. (
  77. select today.day as day,today.remote_addr as today_addr,old.ip as old_addr
  78. from
  79. (select distinct remote_addr as remote_addr,"20130918" as day from ods_weblog_detail where datestr="20130918") today
  80. left outer join
  81. dw_user_dsct_history old
  82. on today.remote_addr=old.ip
  83. ) tmp where tmp.old_addr is null;

  1. 5.每日新用户追加到累计表
  2. insert into table dw_user_dsct_history partition(datestr='20130918')
  3. select day,ip from dw_user_new_d where datestr='20130918';

  1. 6.验证:
  2. select count(distinct remote_addr) from ods_weblog_detail; # 结果值显示为 1027
  3. select count(1) from dw_user_dsct_history where datestr='20130918'; # 结果值显示为 1027
  4. select count(1) from dw_user_new_d where datestr='20130918'; # 结果值显示为 1027

 网站流量日志分析--模块开发--访客开发--回头客&人均频次(点击流模型)

  1. 1.回头/单次访客统计
  2. 1.drop table dw_user_returning;
  3. 2.create table dw_user_returning(
  4. day string,
  5. remote_addr string,
  6. acc_cnt string)
  7. partitioned by (datestr string);
  8. 3.insert overwrite table dw_user_returning partition(datestr='20130918')
  9. select tmp.day,tmp.remote_addr,tmp.acc_cnt
  10. from (select '20130918' as day,remote_addr,count(session) as acc_cnt from ods_click_stream_visit group by remote_addr) tmp where tmp.acc_cnt>1;

  1. 2.人均访问频次
  2. select sum(pagevisits)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20130918'; # 结果值显示为 1.4339622641509433

 网站流量日志分析--模块开发--转化分析--漏斗模型转化率分步实现

  1. 1.漏斗模型原始数据click-part-r-00000
  2. 1.hdfs dfs -put /root/hivedata/weblog/click-part-r-00000 /weblog/clickstream/pageviews
  3. 2.load data inpath '/weblog/clickstream/pageviews/click-part-r-00000' overwrite into table ods_click_pageviews partition(datestr='20130920');
  4. 3.select * from ods_click_pageviews where datestr='20130920' limit 10;
  5. +---------------------------------------+----------------------------------+----------------------------------+---------------------------------+------------------------------+---------------------------------+------------------------------------+-----------------------------------+---------------------------------------------------------+--------------------------------------+-----------------------------+------------------------------+--+
  6. | ods_click_pageviews.session | ods_click_pageviews.remote_addr | ods_click_pageviews.remote_user | ods_click_pageviews.time_local | ods_click_pageviews.request | ods_click_pageviews.visit_step | ods_click_pageviews.page_staylong | ods_click_pageviews.http_referer | ods_click_pageviews.http_user_agent | ods_click_pageviews.body_bytes_sent | ods_click_pageviews.status | ods_click_pageviews.datestr |
  7. +---------------------------------------+----------------------------------+----------------------------------+---------------------------------+------------------------------+---------------------------------+------------------------------------+-----------------------------------+---------------------------------------------------------+--------------------------------------+-----------------------------+------------------------------+--+
  8. | 47826dd6-be71-42df-96b2-14ff65425975 | | - | 2013-09-20 00:15:42 | /item/HZxEY8vF | 1 | 340 | /item/qaLW7pa5 | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 | 1800 | 200 | 20130920 |
  9. | 47826dd6-be71-42df-96b2-14ff65425975 | | - | 2013-09-20 00:21:22 | /item/IyA5hVop | 2 | 1 | /item/MQtiwwhj | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 | 1800 | 200 | 20130920 |
  10. | 47826dd6-be71-42df-96b2-14ff65425975 | | - | 2013-09-20 00:21:23 | /item/RDqibwBo | 3 | 44 | /item/RCbNqxIy | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 | 1800 | 200 | 20130920 |
  11. | 47826dd6-be71-42df-96b2-14ff65425975 | | - | 2013-09-20 00:22:07 | /item/IzrJixZc | 4 | 101 | /item/RCbNqxIy | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 | 1800 | 200 | 20130920 |
  12. | 47826dd6-be71-42df-96b2-14ff65425975 | | - | 2013-09-20 00:23:48 | /item/yrZqXxfN | 5 | 19 | /item/1Wvc1NeH | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 | 1800 | 200 | 20130920 |
  13. | 47826dd6-be71-42df-96b2-14ff65425975 | | - | 2013-09-20 00:24:07 | /item/hWBn8VCg | 6 | 442 | /item/LwOziljH | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 | 1800 | 200 | 20130920 |
  14. | 47826dd6-be71-42df-96b2-14ff65425975 | | - | 2013-09-20 00:31:29 | /item/1nQESbrT | 7 | 348 | /item/GFDdR8SR | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 | 1800 | 200 | 20130920 |
  15. | 47826dd6-be71-42df-96b2-14ff65425975 | | - | 2013-09-20 00:37:17 | /item/c | 8 | 2 | /category/d | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 | 1800 | 200 | 20130920 |
  16. | 47826dd6-be71-42df-96b2-14ff65425975 | | - | 2013-09-20 00:37:19 | /item/a | 9 | 11 | /category/c | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 | 1800 | 200 | 20130920 |
  17. | 47826dd6-be71-42df-96b2-14ff65425975 | | - | 2013-09-20 00:37:30 | /item/X2b5exuV | 10 | 348 | /item/N2Pos96N | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 | 1800 | 200 | 20130920 |
  18. +---------------------------------------+----------------------------------+----------------------------------+---------------------------------+------------------------------+---------------------------------+------------------------------------+-----------------------------------+---------------------------------------------------------+--------------------------------------+-----------------------------+------------------------------+--+
  19. 2.查询每一个步骤的总访问人数
  20. create table dw_oute_numbs as
  21. select 'step1' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20130920' and request like '/item%'
  22. union
  23. select 'step2' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20130920' and request like '/category%'
  24. union
  25. select 'step3' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20130920' and request like '/order%'
  26. union
  27. select 'step4' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20130920' and request like '/index%';

  1. 3.查询每一步骤相对于路径起点人数的比例
  2. 1.级联查询,自己跟自己join
  3. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
  4. inner join
  5. dw_oute_numbs rr;
  6. 自join后结果如下图所示:
  7. +---------+----------+---------+----------+--+
  8. | rnstep | rnnumbs | rrstep | rrnumbs |
  9. +---------+----------+---------+----------+--+
  10. | step1 | 1029 | step1 | 1029 |
  11. | step2 | 1029 | step1 | 1029 |
  12. | step3 | 1028 | step1 | 1029 |
  13. | step4 | 1018 | step1 | 1029 |
  14. | step1 | 1029 | step2 | 1029 |
  15. | step2 | 1029 | step2 | 1029 |
  16. | step3 | 1028 | step2 | 1029 |
  17. | step4 | 1018 | step2 | 1029 |
  18. | step1 | 1029 | step3 | 1028 |
  19. | step2 | 1029 | step3 | 1028 |
  20. | step3 | 1028 | step3 | 1028 |
  21. | step4 | 1018 | step3 | 1028 |
  22. | step1 | 1029 | step4 | 1018 |
  23. | step2 | 1029 | step4 | 1018 |
  24. | step3 | 1028 | step4 | 1018 |
  25. | step4 | 1018 | step4 | 1018 |
  26. +---------+----------+---------+----------+--+

  1. 2.每一步的人数/第一步的人数==每一步相对起点人数比例
  2. select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio
  3. from
  4. (
  5. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
  6. inner join
  7. dw_oute_numbs rr
  8. ) tmp where tmp.rrstep='step1';

  1. 4.查询每一步骤相对于上一步骤的漏出率
  2. 1.首先通过自join表过滤出每一步跟上一步的记录
  3. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
  4. inner join
  5. dw_oute_numbs rr
  6. where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1;
  7. +---------+----------+---------+----------+--+
  8. | rnstep | rnnumbs | rrstep | rrnumbs |
  9. +---------+----------+---------+----------+--+
  10. | step1 | 1029 | step2 | 1029 |
  11. | step2 | 1029 | step3 | 1028 |
  12. | step3 | 1028 | step4 | 1018 |
  13. +---------+----------+---------+----------+--+

  1. 2.然后就可以非常简单的计算出每一步相对上一步的漏出率
  2. select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as leakage_rate
  3. from
  4. (
  5. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
  6. inner join
  7. dw_oute_numbs rr
  8. ) tmp where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1;

  1. 5.汇总以上两种指标
  2. select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate
  3. from
  4. (
  5. select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate
  6. from
  7. (
  8. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
  9. inner join
  10. dw_oute_numbs rr
  11. ) tmp where tmp.rrstep='step1'
  12. ) abs
  13. left outer join
  14. (
  15. select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate
  16. from
  17. (
  18. select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
  19. inner join
  20. dw_oute_numbs rr
  21. ) tmp where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
  22. ) rel on abs.step=rel.step;


 网站流量日志分析--模块开发--转化分析--级联求和(累加)

  1. 1.创建表
  2. create table t_access_times(username string,month string,salary int)
  3. row format delimited fields terminated by ',';
  4. 2.导入数据
  5. 1.hdfs dfs -put /root/hivedata/weblog/t_access_times.dat /weblog
  6. 2.load data inpath '/weblog/t_access_times.dat' overwrite into table t_access_times;
  7. 3.select * from t_access_times limit 10;

  1. 3.第一步:先求个用户的月总金额
  2. select username,month,sum(salary) as salary from t_access_times group by username,month;
  3. +-----------+----------+---------+--+
  4. | username | month | salary |
  5. +-----------+----------+---------+--+
  6. | A | 2015-01 | 33 |
  7. | A | 2015-02 | 10 |
  8. | B | 2015-01 | 30 |
  9. | B | 2015-02 | 15 |
  10. +-----------+----------+---------+--+

  1. 4.第二步:将月总金额表 自己连接 自己连接
  2. select A.*,B.* FROM
  3. (select username,month,sum(salary) as salary from t_access_times group by username,month) A
  4. inner join
  5. (select username,month,sum(salary) as salary from t_access_times group by username,month) B
  6. on A.username=B.username
  7. where B.month <= A.month;
  8. +-------------+----------+-----------+-------------+----------+-----------+--+
  9. | a.username | a.month | a.salary | b.username | b.month | b.salary |
  10. +-------------+----------+-----------+-------------+----------+-----------+--+
  11. | A | 2015-01 | 33 | A | 2015-01 | 33 |
  12. | A | 2015-01 | 33 | A | 2015-02 | 10 |
  13. | A | 2015-02 | 10 | A | 2015-01 | 33 |
  14. | A | 2015-02 | 10 | A | 2015-02 | 10 |
  15. | B | 2015-01 | 30 | B | 2015-01 | 30 |
  16. | B | 2015-01 | 30 | B | 2015-02 | 15 |
  17. | B | 2015-02 | 15 | B | 2015-01 | 30 |
  18. | B | 2015-02 | 15 | B | 2015-02 | 15 |
  19. +-------------+----------+-----------+-------------+----------+-----------+--+

  1. 5.第三步:从上一步的结果中
  2. 进行分组查询,分组的字段是a.username a.month
  3. 求月累计值:将b.month <= a.month的所有b.salary求和即可
  4. select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate
  5. from
  6. (select username,month,sum(salary) as salary from t_access_times group by username,month) A
  7. inner join
  8. (select username,month,sum(salary) as salary from t_access_times group by username,month) B
  9. on A.username=B.username
  10. where B.month <= A.month
  11. group by A.username,A.month
  12. order by A.username,A.month;

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/IT小白/article/detail/300994
推荐阅读
相关标签
  

闽ICP备14008679号