当前位置:   article > 正文

离线数仓(12):DWD层实现之流量日志解析_一条流量日志

一条流量日志

目录

0. 相关文章链接

1. DWD层概述 

1.1. 数仓分层图

1.2. 数仓数据走向图

1.3. 实现目的

2. APP商品点击

2.1. 建表语句

2.2. 执行脚本

2.3. 创建视图

2.4. 结果展示

3. APP商品曝光

3.1. 建表语句

3.2. 执行脚本

3.3. 创建视图

3.4. 结果展示

4. APP启动

4.1. 建表语句

4.2. 执行脚本

4.3. 创建视图

4.4. 结果展示

5. 总结

5.1. 为什么从ODS层读取当天全量数据

5.2. DWD层流量表视图的创建

5.3. 对JSONArray进行炸裂处理


0. 相关文章链接

 离线数仓文章汇总 

1. DWD层概述 

1.1. 数仓分层

1.2. 数仓数据走向图

1.3. 实现目的

DWD层的目的是进行维度建模(尽可能的使用星型模型,减少雪花模型的使用),所以一般情况下一个DWD层的表为一个事实表:

  • 针对流量日志,在ODS层很多流量日志主题会聚合在一起(博主这里ODS层分为9个大主题,例:APP点击、APP曝光、H5所有、路径等),所以到DWD层会将ODS层大主题根据业务分为小主题,比如APP商品点击、APP商品曝光、商品详情页面曝光等;
  • 一般情况下,流量日志使用的是批量上报的方式,所以需要对一些主题的流量日志进行炸裂(例:APP商品曝光等);
  • 流量日志在DWD层为每天一个分区,存储当天的增量数据,并注意要使用压缩;
  • 流量日志在DWD层一个业务主题为一张表;
  • 流量日志在DWD层的粒度为最细(例:一条数据为用户针对一个商品的一次曝光);

2. APP商品点击

注:脚本实现如下所示,关于具体使用和总结可以查看文章最后DWD层流量日志解析总结章节。

2.1. 建表语句

  1. -- 建表语句
  2. create table if not exists ${yishou_data_dbname}.dwd_goods_click_dt (
  3. `user_id` STRING COMMENT '用户id',
  4. `goods_id` STRING COMMENT '商品id',
  5. `is_rec` STRING COMMENT '是否搜索结果推荐商品',
  6. `special_id` STRING COMMENT '专场id',
  7. `os` STRING COMMENT '操作系统',
  8. `goods_no` STRING COMMENT '货号',
  9. `pid` STRING COMMENT '来源',
  10. `ptime` STRING COMMENT '页面停留时间',
  11. `source` STRING COMMENT '专场列表来源',
  12. `report_time` STRING COMMENT '上报时间',
  13. `event_id` STRING COMMENT '事件id',
  14. `search_event_id` STRING COMMENT '搜索事件id',
  15. `keyword` STRING COMMENT '关键词',
  16. `app_version` STRING COMMENT 'APP版本',
  17. `index` STRING COMMENT '商品下标',
  18. `strategy_id` STRING COMMENT '策略id',
  19. `is_default` STRING COMMENT '主被动切换(1主动,0被动)',
  20. `is_operat` STRING COMMENT '是否广告位商品'
  21. )
  22. comment 'APP商品点击分区表'
  23. PARTITIONED BY (`dt` STRING COMMENT '日期分区(yyyymmdd)')
  24. STORED AS orc LOCATION 'obs://yishou-test/yishou_data_qa_test.db/dwd_goods_click_dt'
  25. ;

2.2. 执行脚本

  1. -- 写数脚本
  2. insert overwrite table ${yishou_data_dbname}.dwd_goods_click_dt partition(dt)
  3. select
  4. user_id,
  5. coalesce(get_json_object(click_goods, '$.goods_id'), get_json_object(click_goods, '$.good_id')) as goods_id,
  6. get_json_object(click_goods, '$.is_rec') as is_rec,
  7. special_id,
  8. os,
  9. get_json_object(click_goods, '$.goods_no') as goods_no,
  10. pid,
  11. ptime,
  12. source,
  13. time as report_time,
  14. event_id,
  15. search_event_id,
  16. keyword,
  17. app_version,
  18. get_json_object(click_goods, '$.index') as index,
  19. get_json_object(click_goods, '$.strategy_id') as strategy_id,
  20. get_json_object(click_goods, '$.is_default') as is_default,
  21. get_json_object(click_goods, '$.is_operat') as is_operat,
  22. from_unixtime(time - 25200, 'yyyyMMdd') as dt
  23. from (
  24. select
  25. get_json_object(scdata, '$.user_id') as user_id,
  26. get_json_object(scdata, '$.special_id') as special_id,
  27. get_json_object(scdata, '$.os') as os,
  28. get_json_object(scdata, '$.pid') as pid,
  29. get_json_object(scdata, '$.ptime') as ptime,
  30. get_json_object(scdata, '$.source') as source,
  31. __time__ as time,
  32. get_json_object(scdata, '$.event_id') as event_id,
  33. get_json_object(scdata, '$.search_event_id') as search_event_id,
  34. get_json_object(scdata, '$.keyword') as keyword,
  35. get_json_object(scdata, '$.app_version') as app_version,
  36. regexp_replace1(
  37. regexp_replace1(
  38. regexp_replace1(
  39. get_json_object(scdata, '$.click_goods_arr'),
  40. '^\\[',
  41. ''
  42. ),
  43. '\\]$',
  44. ''
  45. ),
  46. '},\\{',
  47. '}|-|{'
  48. ) as click_goods_arr
  49. from
  50. ${yishou_data_dbname}.ods_yishou_log_exposure_dt
  51. where
  52. dt BETWEEN ${one_day_ago} and ${today}
  53. and from_unixtime(__time__ - 25200, 'yyyyMMdd') = ${one_day_ago}
  54. and (lower(topic) = 'goodsexposure' or lower(__topic__) = 'goodsexposure')
  55. )
  56. lateral view outer explode (split(click_goods_arr, '\\|-\\|'))t as click_goods
  57. where
  58. click_goods is not null
  59. and click_goods != ''
  60. DISTRIBUTE by floor(rand() * 10)
  61. ;

2.3. 创建视图

  1. -- 创建视图
  2. drop view if exists ${yishou_data_dbname}.dwd_goods_click;
  3. create view if not exists ${yishou_data_dbname}.dwd_goods_click (
  4. `user_id` COMMENT '用户id',
  5. `goods_id` COMMENT '商品id',
  6. `is_rec` COMMENT '是否搜索结果推荐商品',
  7. `special_id` COMMENT '专场id',
  8. `os` COMMENT '操作系统',
  9. `goods_no` COMMENT '货号',
  10. `pid` COMMENT '来源',
  11. `ptime` COMMENT '页面停留时间',
  12. `source` COMMENT '专场列表来源',
  13. `report_time` COMMENT '上报时间',
  14. `event_id` COMMENT '事件id',
  15. `search_event_id` COMMENT '搜索事件id',
  16. `keyword` COMMENT '关键词',
  17. `app_version` COMMENT 'APP版本',
  18. `index` COMMENT '商品下标',
  19. `strategy_id` COMMENT '策略id',
  20. `is_default` COMMENT '主被动切换(1主动,0被动)',
  21. `is_operat` COMMENT '是否广告位商品',
  22. `dt` COMMENT '日期分区(yyyymmdd)'
  23. ) comment 'APP商品点击视图'
  24. as
  25. select
  26. user_id,
  27. coalesce(get_json_object(click_goods, '$.goods_id'), get_json_object(click_goods, '$.good_id')) as goods_id,
  28. get_json_object(click_goods, '$.is_rec') as is_rec,
  29. special_id,
  30. os,
  31. get_json_object(click_goods, '$.goods_no') as goods_no,
  32. pid,
  33. ptime,
  34. source,
  35. time as report_time,
  36. event_id,
  37. search_event_id,
  38. keyword,
  39. app_version,
  40. get_json_object(click_goods, '$.index') as index,
  41. get_json_object(click_goods, '$.strategy_id') as strategy_id,
  42. get_json_object(click_goods, '$.is_default') as is_default,
  43. get_json_object(click_goods, '$.is_operat') as is_operat,
  44. from_unixtime(time - 25200, 'yyyyMMdd') as dt
  45. from (
  46. select
  47. get_json_object(scdata, '$.user_id') as user_id,
  48. get_json_object(scdata, '$.special_id') as special_id,
  49. get_json_object(scdata, '$.os') as os,
  50. get_json_object(scdata, '$.pid') as pid,
  51. get_json_object(scdata, '$.ptime') as ptime,
  52. get_json_object(scdata, '$.source') as source,
  53. __time__ as time,
  54. get_json_object(scdata, '$.event_id') as event_id,
  55. get_json_object(scdata, '$.search_event_id') as search_event_id,
  56. get_json_object(scdata, '$.keyword') as keyword,
  57. get_json_object(scdata, '$.app_version') as app_version,
  58. regexp_replace1(
  59. regexp_replace1(
  60. regexp_replace1(
  61. get_json_object(scdata, '$.click_goods_arr'),
  62. '^\\[',
  63. ''
  64. ),
  65. '\\]$',
  66. ''
  67. ),
  68. '},\\{',
  69. '}|-|{'
  70. ) as click_goods_arr
  71. from ${yishou_data_dbname}.ods_yishou_log_exposure_dt
  72. where
  73. dt >= ${thirty_day_ago}
  74. and (lower(topic) = 'goodsexposure' or lower(__topic__) = 'goodsexposure')
  75. )
  76. lateral view outer explode (split(click_goods_arr, '\\|-\\|'))t as click_goods
  77. where
  78. click_goods is not null
  79. and click_goods != ''
  80. ;

2.4. 结果展示

分区表查询结果展示:

视图结果展示: 

3. APP商品曝光

注:脚本实现如下所示,关于具体使用和总结可以查看文章最后DWD层流量日志解析总结章节。

3.1. 建表语句

  1. -- 建表语句
  2. create table if not exists ${yishou_data_dbname}.dwd_goods_exposure_dt (
  3. `user_id` STRING COMMENT '用户id',
  4. `goods_id` STRING COMMENT '商品id',
  5. `is_rec` STRING COMMENT '是否搜索结果推荐商品',
  6. `special_id` STRING COMMENT '专场id',
  7. `os` STRING COMMENT '操作系统',
  8. `goods_no` STRING COMMENT '货号',
  9. `pid` STRING COMMENT '来源',
  10. `ptime` STRING COMMENT '页面停留时间',
  11. `source` STRING COMMENT '专场列表来源',
  12. `report_time` STRING COMMENT '上报时间',
  13. `event_id` STRING COMMENT '事件id',
  14. `search_event_id` STRING COMMENT '搜索事件id',
  15. `keyword` STRING COMMENT '关键词',
  16. `app_version` STRING COMMENT 'APP版本',
  17. `exposure_index` STRING COMMENT '曝光商品下标',
  18. `strategy_id` STRING COMMENT '策略id',
  19. `is_default` STRING COMMENT '主被动切换(1主动,0被动)',
  20. `is_operat` STRING COMMENT '是否广告位商品'
  21. )
  22. comment 'APP商品曝光分区表'
  23. PARTITIONED BY (`dt` STRING COMMENT '日期分区(yyyymmdd)')
  24. STORED AS orc LOCATION 'obs://yishou-test/yishou_data_qa_test.db/dwd_goods_exposure_dt'
  25. ;

3.2. 执行脚本

  1. -- 写数脚本
  2. insert overwrite table ${yishou_data_dbname}.dwd_goods_exposure_dt partition(dt)
  3. select
  4. user_id,
  5. coalesce(get_json_object(exposure_goods, '$.goods_id'), get_json_object(exposure_goods, '$.good_id')) as goods_id,
  6. get_json_object(exposure_goods, '$.is_rec') as is_rec,
  7. special_id,
  8. os,
  9. get_json_object(exposure_goods, '$.goods_no') as goods_no,
  10. pid,
  11. ptime,
  12. source,
  13. time as report_time,
  14. event_id,
  15. search_event_id,
  16. keyword,
  17. app_version,
  18. get_json_object(exposure_goods, '$.index') as exposure_index,
  19. get_json_object(exposure_goods, '$.strategy_id') as strategy_id,
  20. get_json_object(exposure_goods, '$.is_default') as is_default,
  21. get_json_object(exposure_goods, '$.is_operat') as is_operat,
  22. from_unixtime(time - 25200, 'yyyyMMdd') as dt
  23. from (
  24. select
  25. get_json_object(scdata, '$.user_id') as user_id,
  26. get_json_object(scdata, '$.special_id') as special_id,
  27. get_json_object(scdata, '$.os') as os,
  28. get_json_object(scdata, '$.pid') as pid,
  29. get_json_object(scdata, '$.ptime') as ptime,
  30. get_json_object(scdata, '$.source') as source,
  31. __time__ as time,
  32. get_json_object(scdata, '$.event_id') as event_id,
  33. get_json_object(scdata, '$.search_event_id') as search_event_id,
  34. get_json_object(scdata, '$.keyword') as keyword,
  35. get_json_object(scdata, '$.app_version') as app_version,
  36. regexp_replace1(
  37. regexp_replace1(
  38. regexp_replace1(
  39. get_json_object(scdata, '$.goods_arr'),
  40. '^\\[',
  41. ''
  42. ),
  43. '\\]$',
  44. ''
  45. ),
  46. '},\\{',
  47. '}|-|{'
  48. ) as exposure_goods_arr
  49. from
  50. ${yishou_data_dbname}.ods_yishou_log_exposure_dt
  51. where
  52. dt BETWEEN ${one_day_ago} and ${today}
  53. and from_unixtime(__time__ - 25200, 'yyyyMMdd') = ${one_day_ago}
  54. and (lower(topic) = 'goodsexposure' or lower(__topic__) = 'goodsexposure')
  55. )
  56. lateral view outer explode (split(exposure_goods_arr, '\\|-\\|'))t as exposure_goods
  57. where
  58. exposure_goods is not null
  59. and exposure_goods != ''
  60. DISTRIBUTE by floor(rand() * 20)
  61. ;

3.3. 创建视图

  1. -- 创建视图
  2. drop view if exists ${yishou_data_dbname}.dwd_goods_exposure;
  3. create view if not exists ${yishou_data_dbname}.dwd_goods_exposure (
  4. `user_id` COMMENT '用户id',
  5. `goods_id` COMMENT '商品id',
  6. `is_rec` COMMENT '是否搜索结果推荐商品',
  7. `special_id` COMMENT '专场id',
  8. `os` COMMENT '操作系统',
  9. `goods_no` COMMENT '货号',
  10. `pid` COMMENT '来源',
  11. `ptime` COMMENT '页面停留时间',
  12. `source` COMMENT '专场列表来源',
  13. `report_time` COMMENT '上报时间',
  14. `event_id` COMMENT '事件id',
  15. `search_event_id` COMMENT '搜索事件id',
  16. `keyword` COMMENT '关键词',
  17. `app_version` COMMENT 'APP版本',
  18. `exposure_index` COMMENT '曝光商品下标',
  19. `strategy_id` COMMENT '策略id',
  20. `is_default` COMMENT '主被动切换(1主动,0被动)',
  21. `is_operat` COMMENT '是否广告位商品',
  22. `dt` COMMENT '日期分区(yyyymmdd)'
  23. ) comment 'APP商品曝光视图'
  24. as
  25. select
  26. user_id,
  27. coalesce(get_json_object(exposure_goods, '$.goods_id'), get_json_object(exposure_goods, '$.good_id')) as goods_id,
  28. get_json_object(exposure_goods, '$.is_rec') as is_rec,
  29. special_id,
  30. os,
  31. get_json_object(exposure_goods, '$.goods_no') as goods_no,
  32. pid,
  33. ptime,
  34. source,
  35. time as report_time,
  36. event_id,
  37. search_event_id,
  38. keyword,
  39. app_version,
  40. get_json_object(exposure_goods, '$.index') as exposure_index,
  41. get_json_object(exposure_goods, '$.strategy_id') as strategy_id,
  42. get_json_object(exposure_goods, '$.is_default') as is_default,
  43. get_json_object(exposure_goods, '$.is_operat') as is_operat,
  44. from_unixtime(time - 25200, 'yyyyMMdd') as dt
  45. from (
  46. select
  47. get_json_object(scdata, '$.user_id') as user_id,
  48. get_json_object(scdata, '$.special_id') as special_id,
  49. get_json_object(scdata, '$.os') as os,
  50. get_json_object(scdata, '$.pid') as pid,
  51. get_json_object(scdata, '$.ptime') as ptime,
  52. get_json_object(scdata, '$.source') as source,
  53. __time__ as time,
  54. get_json_object(scdata, '$.event_id') as event_id,
  55. get_json_object(scdata, '$.search_event_id') as search_event_id,
  56. get_json_object(scdata, '$.keyword') as keyword,
  57. get_json_object(scdata, '$.app_version') as app_version,
  58. regexp_replace1(
  59. regexp_replace1(
  60. regexp_replace1(
  61. get_json_object(scdata, '$.goods_arr'),
  62. '^\\[',
  63. ''
  64. ),
  65. '\\]$',
  66. ''
  67. ),
  68. '},\\{',
  69. '}|-|{'
  70. ) as exposure_goods_arr
  71. from
  72. ${yishou_data_dbname}.ods_yishou_log_exposure_dt
  73. where
  74. dt > ${thirty_day_ago}
  75. and (lower(topic) = 'goodsexposure' or lower(__topic__) = 'goodsexposure')
  76. )
  77. lateral view outer explode (split(exposure_goods_arr, '\\|-\\|'))t as exposure_goods
  78. where
  79. exposure_goods is not null
  80. and exposure_goods != ''
  81. ;

3.4. 结果展示

分区表查询结果展示:

视图结果展示: 

4. APP启动

注:脚本实现如下所示,关于具体使用和总结可以查看文章最后DWD层流量日志解析总结章节。

4.1. 建表语句

  1. -- 建表语句
  2. create table if not exists ${yishou_data_dbname}.dwd_app_start_dt (
  3. `report_time` bigint COMMENT '上报时间(10位时间戳,秒值)',
  4. `time` bigint COMMENT '触发时间(10位时间戳,秒值)',
  5. `distinct_id` STRING COMMENT '唯一标识码',
  6. `os` STRING COMMENT '手机操作系统',
  7. `ab_test` STRING COMMENT 'AB版本区分',
  8. `is_first_day` STRING COMMENT '是否首日访问',
  9. `is_first_time` STRING COMMENT '是否首次访问',
  10. `resume_from_background` STRING COMMENT '是否从后台唤起',
  11. `user_id` STRING COMMENT '用户id',
  12. `app_version` STRING COMMENT 'app版本',
  13. `network_type` STRING COMMENT '网络环境',
  14. `manufacturer` STRING COMMENT '设备名称',
  15. `device_id` STRING COMMENT '设备号',
  16. `os_version` STRING COMMENT '手机操作系统版本号',
  17. `carrier` STRING COMMENT '运营商',
  18. `is_pro` STRING COMMENT '是否为Pro版{1:童装货源APP,0:一手APP}',
  19. `model` STRING COMMENT '手机型号'
  20. )
  21. comment 'APP启动分区表'
  22. PARTITIONED BY (`dt` STRING COMMENT '日期分区(yyyymmdd)')
  23. STORED AS orc LOCATION 'obs://yishou-test/yishou_data_qa_test.db/dwd_app_start_dt'
  24. ;

4.2. 执行脚本

  1. -- 写数脚本
  2. insert overwrite table ${yishou_data_dbname}.dwd_app_start_dt partition(dt)
  3. select
  4. __time__ as report_time,
  5. get_json_object(scdata, '$.time') / 1000 as time,
  6. get_json_object(scdata, '$.distinct_id') as distinct_id,
  7. get_json_object(scdata, '$.properties.os') as os,
  8. get_json_object(scdata, '$.properties.isAorB') as ab_test,
  9. get_json_object(scdata, '$.properties.is_first_day') as is_first_day,
  10. get_json_object(scdata, '$.properties.is_first_time') as is_first_time,
  11. get_json_object(scdata, '$.properties.resume_from_background') as resume_from_background,
  12. get_json_object(scdata, '$.properties.userid') as user_id,
  13. get_json_object(scdata, '$.properties.app_version') as app_version,
  14. get_json_object(scdata, '$.properties.network_type') as network_type,
  15. get_json_object(scdata, '$.properties.manufacturer') as manufacturer,
  16. get_json_object(scdata, '$.properties.device_id') as device_id,
  17. get_json_object(scdata, '$.properties.os_version') as os_version,
  18. get_json_object(scdata, '$.properties.carrier') as carrier,
  19. get_json_object(scdata, '$.properties.isPro') as is_pro,
  20. get_json_object(scdata, '$.properties.model') as model,
  21. from_unixtime(__time__ - 25200, 'yyyyMMdd') as dt
  22. from
  23. ${yishou_data_dbname}.ods_new_app_log_store_dt
  24. where
  25. dt BETWEEN ${one_day_ago} and ${today}
  26. and from_unixtime(__time__ - 25200, 'yyyyMMdd') = ${one_day_ago}
  27. and (lower(topic) = 'appstart' or lower(__topic__) = 'appstart')
  28. DISTRIBUTE by floor(rand() * 3)
  29. ;

4.3. 创建视图

  1. -- 创建视图
  2. drop view if exists ${yishou_data_dbname}.dwd_app_start;
  3. create view if not exists ${yishou_data_dbname}.dwd_app_start (
  4. `report_time` COMMENT '上报时间(10位时间戳,秒值)',
  5. `time` COMMENT '触发时间(10位时间戳,秒值)',
  6. `distinct_id` COMMENT '唯一标识码',
  7. `os` COMMENT '手机操作系统',
  8. `ab_test` COMMENT 'AB版本区分',
  9. `is_first_day` COMMENT '是否首日访问',
  10. `is_first_time` COMMENT '是否首次访问',
  11. `resume_from_background` COMMENT '是否从后台唤起',
  12. `user_id` COMMENT '用户id',
  13. `app_version` COMMENT 'app版本',
  14. `network_type` COMMENT '网络环境',
  15. `manufacturer` COMMENT '设备名称',
  16. `device_id` COMMENT '设备号',
  17. `os_version` COMMENT '手机操作系统版本号',
  18. `carrier` COMMENT '运营商',
  19. `is_pro` COMMENT '是否为Pro版{1:童装货源APP,0:一手APP}',
  20. `model` COMMENT '手机型号',
  21. `dt` COMMENT '日期分区(yyyymmdd)'
  22. )
  23. as
  24. select
  25. __time__ as report_time,
  26. get_json_object(scdata, '$.time') / 1000 as time,
  27. get_json_object(scdata, '$.distinct_id') as distinct_id,
  28. get_json_object(scdata, '$.properties.os') as os,
  29. get_json_object(scdata, '$.properties.isAorB') as ab_test,
  30. get_json_object(scdata, '$.properties.is_first_day') as is_first_day,
  31. get_json_object(scdata, '$.properties.is_first_time') as is_first_time,
  32. get_json_object(scdata, '$.properties.resume_from_background') as resume_from_background,
  33. get_json_object(scdata, '$.properties.userid') as user_id,
  34. get_json_object(scdata, '$.properties.app_version') as app_version,
  35. get_json_object(scdata, '$.properties.network_type') as network_type,
  36. get_json_object(scdata, '$.properties.manufacturer') as manufacturer,
  37. get_json_object(scdata, '$.properties.device_id') as device_id,
  38. get_json_object(scdata, '$.properties.os_version') as os_version,
  39. get_json_object(scdata, '$.properties.carrier') as carrier,
  40. get_json_object(scdata, '$.properties.isPro') as is_pro,
  41. get_json_object(scdata, '$.properties.model') as model,
  42. from_unixtime(__time__ - 25200, 'yyyyMMdd') as dt
  43. from
  44. ${yishou_data_dbname}.ods_new_app_log_store_dt
  45. where
  46. dt > ${thirty_day_ago}
  47. and (lower(topic) = 'appstart' or lower(__topic__) = 'appstart')
  48. ;

4.4. 结果展示

分区表查询结果展示:

视图结果展示: 

5. 总结

5.1. 为什么从ODS层读取当天全量数据

        在上述代码中,会从ODS中获取数据,然后根据不同业务主题进行数据ETL工作,再写入到DWD层对应表中,每次都是获取ODS所有的数据,一个ODS表中会包含几十个DWD的主题;这里可以考虑在ODS层就对以日期和主题名进行分区,但博主这里没有划分,因为ODS层会根据APP曝光、APP点击、H5日志等已经划分为9个大主题了,并且每次读取数据也很快(20秒以内),如下图SparkUI所示:

        在对ODS是否进行主题分区,可以综合考虑读取文件的时间和文件系统中小文件的大小个数;根据 离线数仓(8):ODS层实现之导入流量日志 文章中介绍,会3分钟生成一个文件,每天会有480个文件生成,每个文件大小平均为20M左右,但如果进行分区,那有多少个分区就会有多少倍的小文件;可以对这2方面综合考虑,来看是否在ODS层中对日期和主题进行分区。

5.2. DWD层流量表视图的创建

        在上述代码中,我们可以看到每天会将前一天的数据进行ELT,并将结果数据写入到DWD层对应表的分区中;那为什么还要创建视图呢?这是因为博主这会有近实时作业,要求使用最新的数据,每小时跑一次,这种作业如果每小时将数据重写一次到文件系统中会比较麻烦;而且现在是每小时,后续可能会要求30分钟,但综合考虑又不需要使用Flink(使用Flink完全实时对资源要求比较高);这时候就能使用视图了,在ODS中为3分钟级别更新,这样就能满足需要了。

        并且注意,视图为每天早晨会删除旧视图,然后创建新的视图,视图中获取的是近30天的数据;这是因为需要当天的数据时,一般来说30天的历史数据就完全足够,这可以根据实际情况增长或者缩短。

5.3. 对JSONArray进行炸裂处理

        在HQL中(博主这里使用的是华为云的DLI服务,使用的是Spark on Hive,所以编写的为SparkSQL),如果要获取JSONObject中的数据,可以使用get_json_object函数来获取;但没有现成的解析JSONArray的函数,所以一般情况下需要通过编写代码,自定义UDTF函数来实现此功能,然后通过 lateral view outer explode () 来将数据由一行转换成多行。

        博主这里由于历史原因,导致埋点的数据不是很符合JSONArray格式(即商品点击或商品曝光中,既有JSONArray,又有JSONObject);所以博主使用的方法是,通过regexp_replace方法和正则表达式(在代码中是regexp_replace1,是由于原先是阿里云后面迁移到华为云,迁移导致的),将其中的开始的 [ 和最后的 ] 进行替换,再将其中的 },{ 进行替换,最后使用 split 切割成集合,最后炸裂实现,具体可以查看上述代码。


注:其他 离线数仓 相关文章链接由此进 -> 离线数仓文章汇总


声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/943129
推荐阅读
相关标签
  

闽ICP备14008679号