赞
踩
在数据仓库的数据模型设计过程中,经常会遇到下面这种表的设计:
拉链表可以解决上述数据缓慢化场景问题,既能保存历史状态,又能尽量节省空间资源,数仓中某些维度缓慢变化的应用场景可以使用缓慢变化维拉链表。
1)下图为业务后台数据库中每日的数据变化情况,每天只保留客户最新手机号码。
2)根据最后修改时间字段从业务后台数据库抽取最新被修改数据,形成分区表。
3)拉链表的形成过程(重点!!!)
首先根据上面的分析过程,对sql比较熟悉的小伙伴估计已经能够想出整套的解决方案了。一般在工作中,和业务人员沟通之后,自己如果能清楚的描述出业务过程,基本上需求已经实现一大半了,接下来就是写代码啦!
1)整体思路分析
2)进一步思考
1)算法思想
- 1、采集当日全量数据到ND(NewDay)表;
-
- 2、可从历史表中取出昨日全量数据存储到OD(OldDay)表;
-
- 3、(ND-OD)就是当日新增和变化的数据,也就是当天的增量,用W_I表示;
-
- 4、(OD-ND)为状态到此结束需要封链的数据,用W_U表示;
-
- 5、将W_I表的内容全部插入到历史表中,这些是新增记录,start_date为当天,而end_date为max值;
-
- 6、对历史表进行W_U部份的更新操作,start_date保持不变,而end_date改为当天,也就是关链操作
2)代码实现
1. ods层user表结构
- CREATE EXTERNAL TABLE ods.user (
- user_num STRING COMMENT '用户编号',
- mobile STRING COMMENT '手机号码',
- reg_date STRING COMMENT '注册日期',
- modified_time STRING COMMENT '更新日期'
- COMMENT '用户资料表'
- PARTITIONED BY (dt string)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
- STORED AS ORC
- LOCATION '/ods/user';
- )
2. ods层user_update表结构
- CREATE EXTERNAL TABLE ods.user_update (
- user_num STRING COMMENT '用户编号',
- mobile STRING COMMENT '手机号码',
- reg_date STRING COMMENT '注册日期',
- modified_time STRING COMMENT '更新日期'
- COMMENT '每日用户资料更新表'
- PARTITIONED BY (dt string)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
- STORED AS ORC
- LOCATION '/ods/user_update';
- )
3. 拉链表结构
- CREATE EXTERNAL TABLE dws.user_his (
- user_num STRING COMMENT '用户编号',
- mobile STRING COMMENT '手机号码',
- reg_date STRING COMMENT '注册日期',
- modified_time STRING COMMENT '更新日期',
- t_start_date STRING COMMENT '数据开始时间',
- t_end_date STRING COMMENT '数据结束时间'
- COMMENT '用户资料拉链表'
- ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
- STORED AS ORC
- LOCATION '/dws/user_his';
- )
4. 拉链表从2017-01-01更新到2017-01-02
- INSERT OVERWRITE TABLE dws.user_his
- SELECT * FROM
- (
- SELECT A.user_num,
- A.mobile,
- A.reg_date,
- A.modified_time
- A.t_start_time,
- CASE
- WHEN A.t_end_time = '9999-12-31' AND B.user_num IS NOT NULL THEN '2017-01-01'
- ELSE A.t_end_time
- END AS t_end_time
- FROM dws.user_his AS A
- LEFT JOIN ods.user_update AS B
- ON A.user_num = B.user_num
- UNION
- SELECT C.user_num,
- C.mobile,
- C.reg_date,
- C.modified_time,
- '2017-01-02' AS t_start_time,
- '9999-12-31' AS t_end_time
- FROM ods.user_update AS C
- ) AS T
5. 拉链表从2017-01-02更新到2017-01-03
- INSERT OVERWRITE TABLE dws.user_his
- SELECT * FROM
- (
- SELECT A.user_num,
- A.mobile,
- A.reg_date,
- A.modified_time
- A.t_start_time,
- CASE
- WHEN A.t_end_time = '9999-12-31' AND B.user_num IS NOT NULL THEN '2017-01-02'
- ELSE A.t_end_time
- END AS t_end_time
- FROM dws.user_his AS A
- LEFT JOIN ods.user_update AS B
- ON A.user_num = B.user_num
- UNION
- SELECT C.user_num,
- C.mobile,
- C.reg_date,
- C.modified_time,
- '2017-01-03' AS t_start_time,
- '9999-12-31' AS t_end_time
- FROM ods.user_update AS C
- ) AS T
6. 替换日期为变量,比如今日是2017-01-03,数仓中可以更新2017-01-02数据,此时$date = 2017-01-02
- INSERT OVERWRITE TABLE dws.user_his
- SELECT * FROM
- (
- SELECT A.user_num,
- A.mobile,
- A.reg_date,
- A.modified_time
- A.t_start_time,
- CASE
- WHEN A.t_end_time = '9999-12-31' AND B.user_num IS NOT NULL THEN date_add($date,-1)
- ELSE A.t_end_time
- END AS t_end_time
- FROM dws.user_his AS A
- LEFT JOIN ods.user_update AS B
- ON A.user_num = B.user_num
- UNION
- SELECT C.user_num,
- C.mobile,
- C.reg_date,
- C.modified_time,
- $date AS t_start_time,
- '9999-12-31' AS t_end_time
- FROM ods.user_update AS C
- ) AS T
1. 下图展示拉链表,分区表和全量表的结果
2.三种类型的表各自的对比如下,当然每种表有每种表各自适合的使用场景,具体问题要具体分析!
文中若有不正之处,望请指正!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。