当前位置:   article > 正文

因果推理-学习笔记_协变量和混杂因素的区别

协变量和混杂因素的区别

        因果推理的三个层次分别是关联(相关性)、介入(治疗/干预/处理)、反事实(想象)。关联即相关性,是最低层次,目前的人工智能(弱智能)处于这个层次,虽然是最低层次,但仍能解决现实世界的很多问题。介入即行动(do),在协变量X、混杂因子W、工具变量Z(影响T但不影响Y的变量)的环境下通过干预T(do)得到结果Y。反事实,通过反事实(反事实是指现实世界对应的虚拟世界)提问(想象)执果求因,人类智能所处的层次,能够改变世界。

        微软因果推理实现的主要框架dowhy和econml,dowhy通过建模、识别、估计、验证四个步骤实现因果推理的过程;econml是个潜在结果(不能观测到的结果)模型的期望估计框架,提供了基于三个假设的多种方法实现减少选择偏差(建模使用的样本与现实世界的偏差)。这些方法的实现主要依赖的机器学习的模型,包括线性回归、随机森林、决策树、深度学习等。dowhy在识别时可以使用econml的方法来进行估计。

        因果推理的研究对象为U,如:酒店客户。其属性/特征为X(X为协变量或混杂因子),结果为Y,干预T。有时还有引入Z(工具变量,也是为了减少选择偏差)。

        协变量指原因T与结果Y以外的所有其他变量。比如在现有数据中,除了原因和结果的变量,其他所有变量都是协变量。而混杂因素是这些协变量中“同时影响原因与结果的变量”。也就是说,协变量中包括混杂因素,也包括非混杂因素。

        下面以变换酒店房间对取消订单的影响为例,通过dowhy框架来理解因果推理的过程。

一、创建模型(通过先验证知识创建一个初始因果图,根据数据集、干预、初始因果图创建因果模型),这个因果图可能不完全,但dowhy会自动补充完整。

1.准备数据集,包括特征工程

  1. #准备数据集
  2. import dowhy
  3. import pandas as pd
  4. import numpy as np
  5. import matplotlib.pyplot as plt
  6. import logging
  7. logging.getLogger("dowhy").setLevel(logging.INFO)
  8. dataset = pd.read_csv('https://raw.githubusercontent.com/Sid-darthvader/DoWhy-The-Causal-Story-Behind-Hotel-Booking-Cancellations/master/hotel_bookings.csv')
  9. dataset.columns
  10. # Total stay in nights
  11. dataset['total_stay'] = dataset['stays_in_week_nights']+dataset['stays_in_weekend_nights']
  12. # Total number of guests
  13. dataset['guests'] = dataset['adults']+dataset['children'] +dataset['babies']
  14. # Creating the different_room_assigned feature
  15. dataset['different_room_assigned']=0
  16. slice_indices =dataset['reserved_room_type']!=dataset['assigned_room_type']
  17. dataset.loc[slice_indices,'different_room_assigned']=1
  18. # Deleting older features
  19. dataset = dataset.drop(['stays_in_week_nights','stays_in_weekend_nights','adults','children','babies'
  20. ,'reserved_room_type','assigned_room_type'],axis=1)
  21. dataset.isnull().sum() # Country,Agent,Company contain 488,16340,112593 missing entries
  22. dataset = dataset.drop(['agent','company'],axis=1)
  23. # Replacing missing countries with most freqently occuring countries
  24. dataset['country']= dataset['country'].fillna(dataset['country'].mode()[0])
  25. dataset = dataset.drop(['reservation_status','reservation_status_date','arrival_date_day_of_month'],axis=1)
  26. dataset = dataset.drop(['arrival_date_year'],axis=1)
  27. # Replacing 1 by True and 0 by False for the experiment and outcome variables
  28. dataset['different_room_assigned']= dataset['different_room_assigned'].replace(1,True)
  29. dataset['different_room_assigned']= dataset['different_room_assigned'].replace(0,False)
  30. dataset['is_canceled']= dataset['is_canceled'].replace(1,True)
  31. dataset['is_canceled']= dataset['is_canceled'].replace(0,False)
  32. dataset.dropna(inplace=True) # 新增对NA值的处理
  33. dataset.columns

2.确定变量之间的因果关系:

非常简单的看Y ~ X随机抽取中,多少会是相等的,如果100%相等,大概率X-> Y; 如果50%那就不确定有无因果关系。

随机(采样1万次)从1000条样本中看有多少取消订单的数量和变换房间的次数是相等的。

  1. # different_room_assigned - 518 不确定因果关系
  2. counts_sum=0
  3. for i in range(1,10000):
  4. counts_i = 0
  5. rdf = dataset.sample(1000)
  6. counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
  7. counts_sum+= counts_i
  8. print(counts_sum/10000)
  9. # 预约变化 booking_changes - 492,不确定
  10. counts_sum=0
  11. for i in range(1,10000):
  12. counts_i = 0
  13. rdf = dataset[dataset["booking_changes"]==0].sample(1000)
  14. counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
  15. counts_sum+= counts_i
  16. print(counts_sum/10000)

3.根据先验证知识建立因果图(贝叶斯网络,有向无环图)

  1. causal_graph = """digraph {
  2. different_room_assigned[label="Different Room Assigned"];
  3. is_canceled[label="Booking Cancelled"];
  4. booking_changes[label="Booking Changes"];
  5. previous_bookings_not_canceled[label="Previous Booking Retentions"];
  6. days_in_waiting_list[label="Days in Waitlist"];
  7. lead_time[label="Lead Time"];
  8. market_segment[label="Market Segment"];
  9. country[label="Country"];
  10. U[label="Unobserved Confounders"];
  11. is_repeated_guest;
  12. total_stay;
  13. guests;
  14. meal;
  15. hotel;
  16. U->different_room_assigned; U->is_canceled;U->required_car_parking_spaces;
  17. market_segment -> lead_time;
  18. lead_time->is_canceled; country -> lead_time;
  19. different_room_assigned -> is_canceled;
  20. country->meal;
  21. lead_time -> days_in_waiting_list;
  22. days_in_waiting_list ->is_canceled;
  23. previous_bookings_not_canceled -> is_canceled;
  24. previous_bookings_not_canceled -> is_repeated_guest;
  25. is_repeated_guest -> is_canceled;
  26. total_stay -> is_canceled;
  27. guests -> is_canceled;
  28. booking_changes -> different_room_assigned; booking_changes -> is_canceled;
  29. hotel -> is_canceled;
  30. required_car_parking_spaces -> is_canceled;
  31. total_of_special_requests -> is_canceled;
  32. country->{hotel, required_car_parking_spaces,total_of_special_requests,is_canceled};
  33. market_segment->{hotel, required_car_parking_spaces,total_of_special_requests,is_canceled};
  34. }"""

4.创建因果模型(实际上是建立了一个假设,通过识别、估计来验证这个假设)

  1. model= dowhy.CausalModel(
  2. data = dataset,
  3. graph=causal_graph.replace("\n", " "),
  4. treatment='different_room_assigned',
  5. outcome='is_canceled')
  6. model.view_model()

 二、因果识别(涉及平均处理/治疗估计ATE、前门frontdoor、后门backdoor、工具变量iv)

  1. identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
  2. print(identified_estimand)

 三、因果估计(计算期望,也可以使用econml来实现,econml方法很多,且支持新方法的扩展)

dowhy的方法:

线性回归:backdoor.linear_regression (比较快)

  1. estimate = model.estimate_effect(identified_estimand,
  2. method_name="backdoor.linear_regression",
  3. control_value=0,
  4. treatment_value=1,
  5. confidence_intervals=True,
  6. test_significance=True)
  7. print(estimate)

倾向得分匹配:backdoor.propensity_score_matching(比较慢)

倾向得分分层:backdoor.propensity_score_stratification(比较慢)

倾向得分加权:backdoor.propensity_score_weighting(比较慢)

工具变量:iv.instrumental_variable

回归不连续:iv.regression_discontinuity

econnml的方法:

双机器学习:backdoor.econml.dml.*(比较快)

  1. from sklearn.preprocessing import PolynomialFeatures
  2. from sklearn.linear_model import LassoCV
  3. from sklearn.ensemble import GradientBoostingRegressor
  4. dml_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.econml.dml.DML",
  5. control_value = 0,
  6. treatment_value = 1,
  7. confidence_intervals=False,
  8. method_params={"init_params":{'model_y':GradientBoostingRegressor(),
  9. 'model_t': GradientBoostingRegressor(),
  10. "model_final":LassoCV(fit_intercept=False),
  11. 'featurizer':PolynomialFeatures(degree=2, include_bias=True)},
  12. "fit_params":{}})
  13. print(dml_estimate)
  1. estimate = model.estimate_effect(identified_estimand,
  2. method_name="backdoor.econml.dml.LinearDML",
  3. method_params={
  4. 'init_params': {'model_y':GradientBoostingRegressor(),
  5. 'model_t': GradientBoostingRegressor(), },
  6. 'fit_params': {}
  7. })
  8. print(estimate)

双重稳定学习:backdoor.econml.drlearner.*

正交森林学习:backdoor.econml.ortho_forest.*

工具变量深度学习:iv.econml.deepiv.*

元学习:backdoor.econml.metalearners.*

  1. estimate = model.estimate_effect(identified_estimand,
  2. method_name="backdoor.econml.metalearners.SLearner",
  3. method_params={
  4. 'init_params': {'overall_model':GradientBoostingRegressor(),
  5. },
  6. 'fit_params': {}
  7. })
  8. print(estimate)
  1. estimate = model.estimate_effect(identified_estimand,
  2. method_name="backdoor.econml.metalearners.TLearner",
  3. method_params={
  4. 'init_params': {'models':GradientBoostingRegressor(),
  5. },
  6. 'fit_params': {}
  7. })
  8. print(estimate)
  1. estimate = model.estimate_effect(identified_estimand,
  2. method_name="backdoor.econml.metalearners.XLearner",
  3. method_params={
  4. 'init_params': { 'models': GradientBoostingRegressor(),
  5. 'propensity_model': GradientBoostingClassifier(),
  6. 'cate_models': GradientBoostingRegressor()
  7. },
  8. 'fit_params': {}
  9. })
  10. print(estimate)

这么多的估计方法,究竟该用哪种方法呢?建议读这边书《原因与结果的经济学》可以获得一些指导;econml的策略是选择得分最低的,在实际使用中视乎难以抉择。

使用倾向得分分层估计:

  1. estimate = model.estimate_effect(identified_estimand,
  2. method_name="backdoor.propensity_score_stratification",target_units="ate")
  3. # ATE = Average Treatment Effect
  4. # ATT = Average Treatment Effect on Treated (i.e. those who were assigned a different room)
  5. # ATC = Average Treatment Effect on Control (i.e. those who were not assigned a different room)
  6. print(estimate)

推理:变换房间(干预)会使客户取消订单的期望值下降32%。猜测原因:是客户到达酒店后,换了更好的房间。 

四、验证(通过多个反事实样本来验证推理结果的鲁棒性/稳定性)

1.随机样本(期望结果:新的影响与估计影响差异很小)

  1. refute1_results=model.refute_estimate(identified_estimand, estimate,
  2. method_name="random_common_cause")
  3. print(refute1_results)

 2.安慰疗法(期望结果:新的影响接近0)

  1. refute2_results=model.refute_estimate(identified_estimand, estimate,
  2. method_name="placebo_treatment_refuter")
  3. print(refute2_results)

 3. 子样本集(期望结果:新的影响与估计影响差异很小)

  1. refute3_results=model.refute_estimate(identified_estimand, estimate,
  2. method_name="data_subset_refuter")
  3. print(refute3_results)

反事实验证不能证明推理的正确性,但能增强推理的信心。

欢迎交流!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/802873
推荐阅读
相关标签
  

闽ICP备14008679号