赞
踩
目录
根据如上数仓分层图和数仓数据走向图,可以看出dim层的主要目的有如下2点:
ODS建表语句如下:
- create EXTERNAL table ${yishou_data_dbname}.ods_fmys_region_dt(
- region_id BIGINT comment "地区id",
- parent_id BIGINT comment "父地区id",
- region_name STRING comment "地区名",
- region_type BIGINT comment "区域级别{0:国家,1:省份,2:城市,4:区域 }",
- agency_id BIGINT comment "未知用途",
- auto_time STRING comment "处理更新时间"
- )
- PARTITIONED BY (`dt` BIGINT COMMENT '日期分区(yyyymmdd)')
- STORED AS orc LOCATION 'obs://yishou-test/yishou_data_qa_test.db/ods_fmys_region_dt'
- ;
DIM建表语句如下:
- create EXTERNAL table ${yishou_data_dbname}.dim_region_info(
- id BIGINT comment '区域id(包括区域、城市、省份、国家)'
- , name STRING comment '区域名称(包括区域、城市、省份、国家)'
- , type BIGINT comment '区域级别{0:国家,1:省份,2:城市,3:地区}'
- , region_id BIGINT comment '地区id'
- , region_name STRING comment '地区名称'
- , city_id BIGINT comment '城市id'
- , city_name STRING comment '城市名称'
- , province_id BIGINT comment '省份id'
- , province_name STRING comment '省份名称'
- , country_id BIGINT comment '国家id'
- , country_name STRING comment '国家名称'
- , agency_id BIGINT comment '未知用途'
- , auto_time STRING comment '处理更新时间'
- )
- PARTITIONED BY (`dt` BIGINT COMMENT '日期分区(yyyymmdd)')
- STORED AS orc LOCATION 'obs://yishou-test/yishou_data_qa_test.db/dim_region_info'
- ;
- -- 跑数脚本(从ODS层获取出昨天的数据,进行计算,写入到DIM昨天的分区中)
- with
- region_temp as (
- select
- region_id
- , parent_id
- , region_name
- , region_type
- , agency_id
- , auto_time
- from ${yishou_data_dbname}.ods_fmys_region_dt
- where dt = ${one_day_ago}
- )
-
- insert overwrite table ${yishou_data_dbname}.dim_region_info partition(dt = ${one_day_ago})
- SELECT
- region_table.region_id as id
- , region_table.region_name as name
- , region_table.region_type as type
- , case
- when region_table.region_type = 3 then region_table.region_id
- end as region_id
- , case
- when region_table.region_type = 3 then region_table.region_name
- end as region_name
- , case
- when region_table.region_type = 3 then city_table.region_id
- when region_table.region_type = 2 then region_table.region_id
- end as city_id
- , case
- when region_table.region_type = 3 then city_table.region_name
- when region_table.region_type = 2 then region_table.region_name
- end as city_name
- , case
- when region_table.region_type = 3 then province_table.region_id
- when region_table.region_type = 2 then city_table.region_id
- when region_table.region_type = 1 then region_table.region_id
- end as province_id
- , case
- when region_table.region_type = 3 then province_table.region_name
- when region_table.region_type = 2 then city_table.region_name
- when region_table.region_type = 1 then region_table.region_name
- end as province_name
- , case
- when region_table.region_type = 3 then country_table.region_id
- when region_table.region_type = 2 then province_table.region_id
- when region_table.region_type = 1 then city_table.region_id
- when region_table.region_type = 0 then region_table.region_id
- end as country_id
- , case
- when region_table.region_type = 3 then country_table.region_name
- when region_table.region_type = 2 then province_table.region_name
- when region_table.region_type = 1 then city_table.region_name
- when region_table.region_type = 0 then region_table.region_name
- end as country_name
- , region_table.agency_id as agency_id
- , region_table.auto_time as auto_time
- FROM region_temp as region_table
-
- left join region_temp as city_table
- on region_table.parent_id = city_table.region_id
-
- left join region_temp as province_table
- on city_table.parent_id = province_table.region_id
-
- left join region_temp as country_table
- on province_table.parent_id = country_table.region_id
-
- DISTRIBUTE BY floor(rand()*1)
- ;
执行的SQL:
SELECT * FROM ${yishou_data_dbname}.dim_region_info where dt = 20220902 order by id;
结果展示:
注意:下述代码只写出了部分,将商品表(博主这里的电商描述跟业界的有点不一样,下述的goods_no可以理解成业界的spu_id)和商品扩展表关联,并将品类和供应商相关信息并入
-
- with
- category_temp as (
- -- 将ods层的品类表获取出来(ods层品类表为分区表,只需要最新的分区即可)
- select
- cat_id
- , cat_name
- , keywords
- , cat_desc
- , parent_id
- , sort_order
- , template_file
- , measure_unit
- , show_in_nav
- , is_show
- , grade
- , lv
- , rv
- , filter_attr
- , goods_weight
- , goods_model_id
- , img
- , icon
- , normal_weight
- , hot_order
- , is_recommend
- , auto_time
- , normal_refund
- , completely_refund
- , normal_refund_grade
- from ${yishou_data_dbname}.ods_fmys_category_dt
- where dt = ${one_day_ago}
- ),
- cat_temp as (
- -- 业务库的品类表里面信息较少,进行聚集,需要获取一个品类的三级品类、二级品类、一级品类等所有信息
- select
- fc.cat_id,
- fc.cat_name,
- fc.sort_order,
- fc.is_show,
- fc.grade,
- fc.goods_model_id,
- fc.img,
- fc.icon,
- fc.normal_weight,
- fc.hot_order,
- fc.is_recommend,
- fc.auto_time,
- case
- when fc.grade = 3 then fc.cat_id
- end as third_cat_id,
- case
- when fc.grade = 3 then fc.cat_name
- end as third_cat_name,
- case
- when fc.grade = 3 then fc1.cat_id
- when fc.grade = 2 then fc.cat_id
- end as second_cat_id,
- case
- when fc.grade = 3 then fc1.cat_name
- when fc.grade = 2 then fc.cat_name
- end as second_cat_name,
- case
- when fc.grade = 3 then fc2.cat_id
- when fc.grade = 2 then fc1.cat_id
- when fc.grade = 1 then fc.cat_id
- end as primary_cat_id,
- case
- when fc.grade = 3 then fc2.cat_name
- when fc.grade = 2 then fc1.cat_name
- when fc.grade = 1 then fc.cat_name
- end as primary_cat_name,
- fc.normal_refund,
- fc.completely_refund
- from
- category_temp fc
- left join category_temp fc1 on fc.parent_id = fc1.cat_id
- left join category_temp fc2 on fc1.parent_id = fc2.cat_id
- ),
- goods_lib_temp as (
- -- 将ods层的商品表获取出来(ods层商品表为分区表,只需要最新的分区即可)
- select
- goods_no
- , supply_id
- , supplier_id
- , brand_id
- , market_id
- , picker_group
- , picker_group_code
- , pgid
- , cat_id
- , goods_kh
- , goods_name
- , goods_desc
- , goods_tag
- , custom_tag
- , enough_number
- , goods_weight
- , goods_model_id
- , fabric_id
- , market_price
- , shop_price
- , extra_price
- , shoppe_price
- , vip_discount
- , old_price
- , attribute
- , goods_model_setting
- , goods_img
- , video_url
- , video_status
- , origin
- , add_admin
- , goods_status
- , is_on_sale
- , is_action
- , action_value
- , add_time
- , action_time
- , stop_time
- , update_time
- , update_admin
- , size_chart
- , fitting_report
- , size_chart_id
- , parts
- , auto_time
- , estimated_return_time
- , activity_goods
- , picker_assist
- , vendor_supply_id
- from ${yishou_data_dbname}.ods_fmys_goods_lib_dt
- where dt = ${one_day_ago}
- ),
- goods_lib_ext_temp as (
- -- 将ods层的商品扩展表获取出来(ods层商品扩展表为分区表,只需要最新的分区即可)
- select
- goods_no
- , limit_day
- , first_sale_period
- , handle_time
- , handle_admin
- , is_stop_sale
- , off_sale_remark
- , pgm_recommend_desc
- , ys_sort
- , update_time
- , create_time
- , goods_type
- , scheduling_day
- , scheduling_day_time
- , scheduling_erp_day
- , scheduling_erp_day_time
- , goods_from
- , customize_one
- , customize_two
- , customize_three
- , extra_plus
- , put_off_times
- , change_to_sale_stock
- , if_supply_limit
- , fabric_desc
- , special_remark
- , supply_channel
- , grade
- , is_sole
- , model_param
- , if_lock
- , start_lock_time
- , lock_time
- , business_time
- , benefit_point
- , goods_style
- , live_model_id
- , is_delete
- , shop_push_new
- , one_hand
- , one_hand_size_title
- , extra_price_type
- , is_no_reason_support
- , is_advance_support
- , advance_time
- from ${yishou_data_dbname}.ods_fmys_goods_lib_ext_dt
- where dt = ${one_day_ago}
- ),
- supply_temp as (
- -- 将ods层的供应商表获取出来(ods层供应商表为分区表,只需要最新的分区即可)
- select
- supply_id
- , supply_name
- , supply_mobile_phone
- , supply_telephone
- , supply_wechat
- , supply_status
- , limit_day
- , create_time
- , create_admin
- , update_time
- , pgid
- , supply_floor
- , supply_area_id
- , is_scal
- , market_id
- , supply_level
- , settle_type
- , settle_days
- , next_settle_start_time
- , last_settle_time
- , need_settle_flag
- , purchase_status
- , share_status
- , the_new_flag
- , is_signed_contract
- , city
- , remark
- , supplier_num
- , pic
- , auto_time
- , is_strength
- , license
- , season
- , age_group
- , cat_ids
- , styles
- , styles_char
- , is_show_brand
- , advantage
- , disadvantage
- , avatar
- , is_full
- , is_self
- , supply_nickname
- , pg_member_id
- , pg_id
- , is_live
- , live_type
- , live_num
- , is_lower_enter
- , is_ka
- , m_goal
- , pg_say
- , disable_count
- , disable_remark
- , is_order_ability
- , shopId
- , is_all_style
- , baobmobile
- , baob_allot
- , fans
- , picture
- , baob_stock_sharing
- , first_sync_time
- , wcg_order_sync_time
- , supply_grade
- , second_styles
- , supply_picker_group_member_id
- , is_benchmarking
- , is_self_need_audit
- , as_plan
- , all_rebuild_plan
- , supply_developer_admin
- , dgid
- , dg_member_id
- , recommend_desc
- , follower_admin_id
- , sub_acct_no
- , is_sign
- , is_screenshot_for_add_goods
- , supply_action_value
- from ${yishou_data_dbname}.ods_fmys_supply_dt
- where dt = ${one_day_ago}
- ),
- insert overwrite table ${yishou_data_dbname}.dim_goods_no_info partition(dt = ${one_day_ago})
- select
- *
- from goods_lib_temp
- left join goods_lib_ext_temp on goods_lib_temp.goods_no = goods_lib_ext_temp.goods_no
- left join cat_temp on goods_lib_temp.cat_id = cat_temp.cat_id
- left join supply_temp on goods_lib_temp.supply_id = supply_temp.supply_id
- ;
在上述代码中,主要是以业务库的商品表(注意:不同公司业务逻辑不同,博主这里的goods_no就相当于通俗的spu_id,goods_id就相当于sku_id);对商品维度表(dim_goods_no_info)是一个每天的分区表,每天会保留全量数据;对商品属性关联时,比如上述商品关联品类和供应商时,会尽可能多的将品类和供应商的属性信息也关联到商品维度表中,虽然在这里计算会多点,但后续使用的时候就不需要关联其他属性了,在建设中,会尽可能的使用星型模型,减少后续的join操作。
- CREATE EXTERNAL TABLE `dim_calendar`(
- `date_time` STRING COMMENT '日期',
- `date_str` STRING COMMENT '日期字符串',
- `date_year` BIGINT COMMENT '年',
- `date_month` BIGINT COMMENT '月',
- `date_week` BIGINT COMMENT '周',
- `date_his_week` BIGINT COMMENT '历史以来的第几周(周一到周日算一周)',
- `date_his_week_fri` BIGINT COMMENT '历史以来的第几周(周五到周四算一周)',
- `date_his_week_sat` BIGINT COMMENT '历史以来的第几周(周六到周五算一周)',
- `date_his_week_sun` BIGINT COMMENT '历史以来的第几周(周日到周六算一周)',
- `date_season` BIGINT COMMENT '季',
- `weekday` BIGINT COMMENT '周几{周一:1,周二:2,周三:3,周四:4,周五:5,周六:6,周日:7}',
- `lunar_date` STRING COMMENT '农历'
- )
- STORED AS ORC LOCATION 'obs://yishou-test/yishou_data_qa_test/dim_calendar'
- ;
通常情况下,时间维度表的数据并不是来自于业务系统,而是手动写入,并且由于时间维度表数据的可预见性,无须每日导入,博主这里一次性将数据导入到了2030年。
因为用户的信息数据会变化,但变化频率又不高,如果每天在DIM层做一个分区会造成资源浪费(用户数据相对品类、供应商、商品等数据量比较多,而在ods层已经做了每天的快照备份);这时就可以引入拉链表的概念,将用户的变化情况记录下来,每个变化记录一条数据。
拉链表,记录每条信息的生命周期,一旦一条记录的生命周期结束,就重新开始一条新的记录,并把当前日期放入生效开始日期。 如果当前信息至今有效,在生效结束日期中填入一个极大值(如9999-99-99)。
拉链表适合于:数据会发生变化,但是变化频率并不高的维度(即:缓慢变化维) 。比如:用户信息会发生变化,但是每天变化的比例不高。如果数据量有一定规模,按照每日全量的方式保存效率很低。 比如:1亿用户*365天,每天一份用户信息。(做每日全量效率低)
通过,生效开始日期<=某个日期 且 生效结束日期>=某个日期 ,能够得到某个时间点的数据全量切片。例:
通过业务库表的历史变化行为记录表(从上线之日起的历史行为记录,上线之前的为全量表),根据主键进行分区,按记录时间升序排序;当这条记录更改完之后,那在下次更改前,该主键对应的信息就是这条记录的信息;如果没有后续更改,可以设置一个最大值(如 9999-99-99);第一条数据,可以设置开始时间为最小值(如 1970-01-01),表示从最开始就一直是这个信息。
因为拉链表是记录数据的新增和变化,所以使用ods层的记录表(ods_fmys_users_record_dt);如下代码所示,对每个user_id开窗,每个更改时间一条数据,当在这个时间点,用户修改了信息,那也就是说,从这个时间点到下个时间点,用户都使用的是这条记录的信息;如果下个时间没有,可以将end_date设置成 '9999-99-99 00:00:00' ,表示这是用户当前正在使用的信息;最后还可以将第一条数据的start_date设置成'1970-01-01 00:00:00',表示用户截止到有记录之前都是使用的这条数据的信息。
注意:当前拉链表只有等ods_fmys_users_record_dt上线之后才能使用,并且之前的数据已不能记录,只能统计上线之日起后续的更改;另外在下述代码中只贴出了3个字段,真正的用户信息是有很多字段的。
- insert overwrite table ${yishou_data_dbname}.dim_user_info
- select
- user_id
- , user_name
- , password
- , event
- , record_time
- , nano_time
- , case
- when row_number_id = 1 then '1970-01-01 00:00:00'
- else start_date
- end as start_date
- , end_date
- , row_number_id
- from (
- select
- user_id
- , user_name
- , password
- , event
- , record_time
- , nano_time
- , from_unixtime(record_time / 1000, 'yyyy-MM-dd HH:mm:ss') as start_date
- , lead(from_unixtime(record_time / 1000, 'yyyy-MM-dd HH:mm:ss'), 1, '9999-99-99 00:00:00') over(partition by user_id order by record_time, nano_time) as end_date
- , row_number() over(partition by user_id order by record_time, nano_time) as row_number_id
- from ${yishou_data_dbname}.ods_fmys_users_record_dt
- where dt = ${one_day_ago}
- )
- ;
在DIM层维度表中一共分为如下3种:
维度表的注意事项:
注:其他 离线数仓 相关文章链接由此进 -> 离线数仓文章汇总
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。