赞
踩
因果推理的三个层次分别是关联(相关性)、介入(治疗/干预/处理)、反事实(想象)。关联即相关性,是最低层次,目前的人工智能(弱智能)处于这个层次,虽然是最低层次,但仍能解决现实世界的很多问题。介入即行动(do),在协变量X、混杂因子W、工具变量Z(影响T但不影响Y的变量)的环境下通过干预T(do)得到结果Y。反事实,通过反事实(反事实是指现实世界对应的虚拟世界)提问(想象)执果求因,人类智能所处的层次,能够改变世界。
微软因果推理实现的主要框架dowhy和econml,dowhy通过建模、识别、估计、验证四个步骤实现因果推理的过程;econml是个潜在结果(不能观测到的结果)模型的期望估计框架,提供了基于三个假设的多种方法实现减少选择偏差(建模使用的样本与现实世界的偏差)。这些方法的实现主要依赖的机器学习的模型,包括线性回归、随机森林、决策树、深度学习等。dowhy在识别时可以使用econml的方法来进行估计。
因果推理的研究对象为U,如:酒店客户。其属性/特征为X(X为协变量或混杂因子),结果为Y,干预T。有时还有引入Z(工具变量,也是为了减少选择偏差)。
协变量指原因T与结果Y以外的所有其他变量。比如在现有数据中,除了原因和结果的变量,其他所有变量都是协变量。而混杂因素是这些协变量中“同时影响原因与结果的变量”。也就是说,协变量中包括混杂因素,也包括非混杂因素。
下面以变换酒店房间对取消订单的影响为例,通过dowhy框架来理解因果推理的过程。
一、创建模型(通过先验证知识创建一个初始因果图,根据数据集、干预、初始因果图创建因果模型),这个因果图可能不完全,但dowhy会自动补充完整。
1.准备数据集,包括特征工程
- #准备数据集
- import dowhy
- import pandas as pd
- import numpy as np
- import matplotlib.pyplot as plt
- import logging
- logging.getLogger("dowhy").setLevel(logging.INFO)
-
- dataset = pd.read_csv('https://raw.githubusercontent.com/Sid-darthvader/DoWhy-The-Causal-Story-Behind-Hotel-Booking-Cancellations/master/hotel_bookings.csv')
- dataset.columns
-
- # Total stay in nights
- dataset['total_stay'] = dataset['stays_in_week_nights']+dataset['stays_in_weekend_nights']
- # Total number of guests
- dataset['guests'] = dataset['adults']+dataset['children'] +dataset['babies']
- # Creating the different_room_assigned feature
- dataset['different_room_assigned']=0
- slice_indices =dataset['reserved_room_type']!=dataset['assigned_room_type']
- dataset.loc[slice_indices,'different_room_assigned']=1
- # Deleting older features
- dataset = dataset.drop(['stays_in_week_nights','stays_in_weekend_nights','adults','children','babies'
- ,'reserved_room_type','assigned_room_type'],axis=1)
-
- dataset.isnull().sum() # Country,Agent,Company contain 488,16340,112593 missing entries
- dataset = dataset.drop(['agent','company'],axis=1)
- # Replacing missing countries with most freqently occuring countries
- dataset['country']= dataset['country'].fillna(dataset['country'].mode()[0])
-
- dataset = dataset.drop(['reservation_status','reservation_status_date','arrival_date_day_of_month'],axis=1)
- dataset = dataset.drop(['arrival_date_year'],axis=1)
-
- # Replacing 1 by True and 0 by False for the experiment and outcome variables
- dataset['different_room_assigned']= dataset['different_room_assigned'].replace(1,True)
- dataset['different_room_assigned']= dataset['different_room_assigned'].replace(0,False)
- dataset['is_canceled']= dataset['is_canceled'].replace(1,True)
- dataset['is_canceled']= dataset['is_canceled'].replace(0,False)
- dataset.dropna(inplace=True) # 新增对NA值的处理
- dataset.columns
2.确定变量之间的因果关系:
非常简单的看Y ~ X随机抽取中,多少会是相等的,如果100%相等,大概率X-> Y; 如果50%那就不确定有无因果关系。
随机(采样1万次)从1000条样本中看有多少取消订单的数量和变换房间的次数是相等的。
- # different_room_assigned - 518 不确定因果关系
- counts_sum=0
- for i in range(1,10000):
- counts_i = 0
- rdf = dataset.sample(1000)
- counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
- counts_sum+= counts_i
- print(counts_sum/10000)
-
- # 预约变化 booking_changes - 492,不确定
- counts_sum=0
- for i in range(1,10000):
- counts_i = 0
- rdf = dataset[dataset["booking_changes"]==0].sample(1000)
- counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
- counts_sum+= counts_i
- print(counts_sum/10000)
3.根据先验证知识建立因果图(贝叶斯网络,有向无环图)
- causal_graph = """digraph {
- different_room_assigned[label="Different Room Assigned"];
- is_canceled[label="Booking Cancelled"];
- booking_changes[label="Booking Changes"];
- previous_bookings_not_canceled[label="Previous Booking Retentions"];
- days_in_waiting_list[label="Days in Waitlist"];
- lead_time[label="Lead Time"];
- market_segment[label="Market Segment"];
- country[label="Country"];
- U[label="Unobserved Confounders"];
- is_repeated_guest;
- total_stay;
- guests;
- meal;
- hotel;
- U->different_room_assigned; U->is_canceled;U->required_car_parking_spaces;
- market_segment -> lead_time;
- lead_time->is_canceled; country -> lead_time;
- different_room_assigned -> is_canceled;
- country->meal;
- lead_time -> days_in_waiting_list;
- days_in_waiting_list ->is_canceled;
- previous_bookings_not_canceled -> is_canceled;
- previous_bookings_not_canceled -> is_repeated_guest;
- is_repeated_guest -> is_canceled;
- total_stay -> is_canceled;
- guests -> is_canceled;
- booking_changes -> different_room_assigned; booking_changes -> is_canceled;
- hotel -> is_canceled;
- required_car_parking_spaces -> is_canceled;
- total_of_special_requests -> is_canceled;
- country->{hotel, required_car_parking_spaces,total_of_special_requests,is_canceled};
- market_segment->{hotel, required_car_parking_spaces,total_of_special_requests,is_canceled};
- }"""
4.创建因果模型(实际上是建立了一个假设,通过识别、估计来验证这个假设)
- model= dowhy.CausalModel(
- data = dataset,
- graph=causal_graph.replace("\n", " "),
- treatment='different_room_assigned',
- outcome='is_canceled')
- model.view_model()
二、因果识别(涉及平均处理/治疗估计ATE、前门frontdoor、后门backdoor、工具变量iv)
- identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
- print(identified_estimand)
三、因果估计(计算期望,也可以使用econml来实现,econml方法很多,且支持新方法的扩展)
dowhy的方法:
线性回归:backdoor.linear_regression (比较快)
- estimate = model.estimate_effect(identified_estimand,
- method_name="backdoor.linear_regression",
- control_value=0,
- treatment_value=1,
- confidence_intervals=True,
- test_significance=True)
- print(estimate)
倾向得分匹配:backdoor.propensity_score_matching(比较慢)
倾向得分分层:backdoor.propensity_score_stratification(比较慢)
倾向得分加权:backdoor.propensity_score_weighting(比较慢)
工具变量:iv.instrumental_variable
回归不连续:iv.regression_discontinuity
econnml的方法:
双机器学习:backdoor.econml.dml.*(比较快)
- from sklearn.preprocessing import PolynomialFeatures
- from sklearn.linear_model import LassoCV
- from sklearn.ensemble import GradientBoostingRegressor
- dml_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.econml.dml.DML",
- control_value = 0,
- treatment_value = 1,
- confidence_intervals=False,
- method_params={"init_params":{'model_y':GradientBoostingRegressor(),
- 'model_t': GradientBoostingRegressor(),
- "model_final":LassoCV(fit_intercept=False),
- 'featurizer':PolynomialFeatures(degree=2, include_bias=True)},
- "fit_params":{}})
- print(dml_estimate)
- estimate = model.estimate_effect(identified_estimand,
- method_name="backdoor.econml.dml.LinearDML",
- method_params={
- 'init_params': {'model_y':GradientBoostingRegressor(),
- 'model_t': GradientBoostingRegressor(), },
- 'fit_params': {}
- })
- print(estimate)
双重稳定学习:backdoor.econml.drlearner.*
正交森林学习:backdoor.econml.ortho_forest.*
工具变量深度学习:iv.econml.deepiv.*
元学习:backdoor.econml.metalearners.*
- estimate = model.estimate_effect(identified_estimand,
- method_name="backdoor.econml.metalearners.SLearner",
- method_params={
- 'init_params': {'overall_model':GradientBoostingRegressor(),
- },
- 'fit_params': {}
- })
- print(estimate)
- estimate = model.estimate_effect(identified_estimand,
- method_name="backdoor.econml.metalearners.TLearner",
- method_params={
- 'init_params': {'models':GradientBoostingRegressor(),
- },
- 'fit_params': {}
- })
- print(estimate)
- estimate = model.estimate_effect(identified_estimand,
- method_name="backdoor.econml.metalearners.XLearner",
- method_params={
- 'init_params': { 'models': GradientBoostingRegressor(),
- 'propensity_model': GradientBoostingClassifier(),
- 'cate_models': GradientBoostingRegressor()
- },
- 'fit_params': {}
- })
- print(estimate)
这么多的估计方法,究竟该用哪种方法呢?建议读这边书《原因与结果的经济学》可以获得一些指导;econml的策略是选择得分最低的,在实际使用中视乎难以抉择。
使用倾向得分分层估计:
- estimate = model.estimate_effect(identified_estimand,
- method_name="backdoor.propensity_score_stratification",target_units="ate")
- # ATE = Average Treatment Effect
- # ATT = Average Treatment Effect on Treated (i.e. those who were assigned a different room)
- # ATC = Average Treatment Effect on Control (i.e. those who were not assigned a different room)
- print(estimate)
推理:变换房间(干预)会使客户取消订单的期望值下降32%。猜测原因:是客户到达酒店后,换了更好的房间。
四、验证(通过多个反事实样本来验证推理结果的鲁棒性/稳定性)
1.随机样本(期望结果:新的影响与估计影响差异很小)
- refute1_results=model.refute_estimate(identified_estimand, estimate,
- method_name="random_common_cause")
- print(refute1_results)
2.安慰疗法(期望结果:新的影响接近0)
- refute2_results=model.refute_estimate(identified_estimand, estimate,
- method_name="placebo_treatment_refuter")
- print(refute2_results)
3. 子样本集(期望结果:新的影响与估计影响差异很小)
- refute3_results=model.refute_estimate(identified_estimand, estimate,
- method_name="data_subset_refuter")
- print(refute3_results)
反事实验证不能证明推理的正确性,但能增强推理的信心。
欢迎交流!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。