赞
踩
1,背景
评分卡被广泛用于互联网金融企业和保险银行机构来解决目前信用风控问题,根据已有的数据,提供用户违约/预期等行为的概率指标预测。
其中申请评分卡,可以将风险控制在贷前的状态,也就是减少客户违约而造成经济损失的风险,是风险控制的重要一个环节,故建立准确的申请评分卡能够有效降低金融机构的财产损失风险。
2,目标
借助贷款违约数据,建立汽车金融(贷款违约)分类模型与客户申请评分卡,判断客户违约的可能性;
当有一个新的客户申请时,参考评分卡可以生成直接评分,判断用户是否较安全,并结合模型表现可以大致预估其违约概率,为是否放贷提供决策建议。
3,内容
根据汽车金融贷款违约数据,按照各类客户(同意贷款客户/拒绝贷款客户)的各维度特征与是否违约情况数据信息,结合LR、LRCV算法建立分类模型,利用ROC曲线、KS曲线评估模型分类效果。
2.1 导入模块
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from fancyimpute import KNN from PyWoE import woe from sklearn.model_selection import train_test_split import itertools from sklearn.linear_model import LogisticRegression from sklearn.linear_model import LogisticRegressionCV from sklearn.metrics import confusion_matrix,recall_score,classification_report from sklearn.metrics import confusion_matrix,recall_score,classification_report from sklearn.metrics import roc_curve, auc warnings.filterwarnings('ignore') from pylab import mpl plt.rcParams['font.sans-serif']='Microsoft YaHei' plt.rcParams['figure.dpi']=100 from matplotlib import rcParams rcParams['axes.unicode_minus']=False pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) pd.set_option('max_colwidth',100)
2.2 导入数据
总数据包括了被接受的客户相关信息,以及被拒绝的客户相关信息
accepts = pd.read_csv(r'C:\accepts.csv')
rejects= pd.read_csv(r'C:\rejects.csv')
数据大致信息
accepts.info() Data columns (total 25 columns): #总共25个特征,部分存在缺失值 application_id 5845 non null int64 #申请者ID account_number 5845 non null int64 #帐户号 bad_ind 5845 non null int64 #是否违约 vehicle_year 5844 non null float64#汽车购买时间 vehicle_make 5546 non null object #汽车制造商 bankruptcy_ind 5628 non null object #曾经破产标识 tot_derog 5632 non null float64#五年内信用不良事件数量(比如手机欠费消号) tot_tr 5632 non null float64#全部帐户数量 age_oldest_tr 5629 non null float64#最久账号存续时间(月) tot_open_tr 4426 non null float64#在使用帐户数量 tot_rev_tr 5207 non null float64#在使用可循环贷款帐户数量(比如信用卡) tot_rev_debt 5367 non null float64#在使用可循环贷款帐户余额(比如信用卡欠款) tot_rev_line 5367 non null float64#可循环贷款帐户限额(信用卡授权额度) rev_util 5845 non null int64 #可循环贷款帐户使用比例(余额/限额) fico_score 5531 non null float64#FICO打分 purch_price 5845 non null float64#汽车购买金额(元) msrp 5844 non null float64#建议售价 down_pyt 5845 non null float64#分期付款的首次交款 loan_term 5845 non null int64 #贷款期限(月) loan_amt 5845 non null float64#贷款金额 ltv 5844 non null float64#贷款金额/建议售价*100 tot_income 5840 non null float64#月均收入(元) veh_mileage 5844 non null float64#行驶里程(Mile) used_ind 5845 non null int64 #是否使用 weight 5845 non null float64#样本权重 dtypes: float64(17), int64(6), object(2)
查看拒绝用户大致信息,拒绝用户有22个特征,缺少bad_ind是否违约, account_number账户号, weight样本权重这三个特征,这里违约情况已经给出,不需要定义违约
rejects.info()
2.3 数据可视化
#各个变量的缺失值比例,都小于30%
num_of_null=accepts.isnull().sum()/accepts.shape[0]).sort_values(ascending=False)
num_of_null.plot(kind='bar')
#accepts中好坏客户的比重,大约为4:1
num_1=accepts.bad_ind.sum()
num_0=accepts.bad_ind.count()-num_1
ratio_1=num_1/(num_1+num_0)
ratio_0=1-ratio_1
plt.subplot(121)
sns.barplot(x=['good','bad'],y=[num_0,num_1])
plt.text(0,num_0+50,num_0,ha='center',va='baseline',fontsize=10)
plt.text(1,num_1+50,num_1,ha='center',va='baseline',fontsize=10)
plt.subplot(122)
plt.pie([num_0,num_1],labels=['good','bad'],autopct='%2.1f%%',startangle=90,explode=[0,0.1],textprops={'fontsize':10})
plt.axis('equal')
plt.suptitle('好坏客户的比重')
2.4 拒绝推断
客户总体包含两类人群(接受客户、拒绝客户),对所有客户进行分析,以防止选择偏差,故推断拒绝用户的特征。
这里使用KNN方式进行拒绝推断,和比例分配的方式
2.4.1KNN拒绝推断
#由于KNN的性质,取出accepts中的部分连续变量作为特征变量,预测目标变量bad_ind,这里使用相关系数进行筛选,选出相关系数较大的 corr=accepts.corr() #sns.heatmap(corr) #plt.title('correlation') corr.bad_ind.abs().sort_values(ascending=False).head(10) bad_ind 1.000000 weight 1.000000 fico_score 0.328627 tot_rev_line 0.193812 age_oldest_tr 0.178258 tot_derog 0.160237 ltv 0.152730 tot_tr 0.119947 rev_util 0.112972 veh_mileage 0.064927 #排除bad_ind, weight后,其他变量中,tot_tr, age_oldest_tr, tot_rev_line之间存在较强的线性相关性,故最后选择了"tot_derog","age_oldest_tr","rev_util","fico_score","ltv"作为分析变量 def KNN_RI(accepts,rejects,n_neighbors=5, weights='distance'):#用KNN进行拒绝推断,设定KNN参数 accepts_x = accepts[["tot_derog","age_oldest_tr","rev_util","fico_score","ltv"]]#选择连续变量作为特征变量 accepts_y = accepts['bad_ind'] rejects_x = rejects[["tot_derog","age_oldest_tr","rev_util","fico_score","ltv"]] #缺失值处理 accepts_x.fillna(accepts_x.mean(),inplace=True) rejects_x.fillna(rejects_x.mean(),inplace=True) #KNN标准化 accepts_x_std = StandardScaler().fit_transform(accepts_x) rejects_x_std = StandardScaler().fit_transform(rejects_x) knn = KNeighborsClassifier(n_neighbors, weights,p=1) knn.fit(accepts_x_std, accepts_y) rejects_y = knn.predict(rejects_x_std)#拒绝推断结果 rejects['bad_ind']=rejects_y#合并至rejects数据中 rejects.groupby('bad_ind').application_id.count() #ratio_accepts为接受样本中正负的比例,对rejects重采样,使正负样本的比例为其3~5倍,这里采样数505 ratio_accepts=(accepts.groupby('bad_ind').application_id.count()[1])/(accepts.groupby('bad_ind').application_id.count()[0]) number_3=(rejects.groupby('bad_ind').application_id.count()[1])/(ratio_accepts*3) number_5=(rejects.groupby('bad_ind').application_id.count()[1])/(ratio_accepts*5) print('resample_range:',number_3,'~',number_5) rejects_0 = rejects[rejects['bad_ind'] == 0].sample(505,random_state=11) rejects = pd.concat([rejects_0, rejects[rejects['bad_ind'] == 1]]) return rejects
2.4.2比例分配拒绝推断
def RF_RI(accepts,rejects,n_estimators=150, min_samples_split=4,min_samples_leaf=2,weights='distance'): #用可输出概率的模型推断,此处用随机森林 accepts_x = accepts.drop(['bad_ind','account_number','weight'],axis=1) accepts_y = accepts['bad_ind'] rejects_x=rejects.copy() #处理类别变量 accepts_x.bankruptcy_ind = accepts_x.bankruptcy_ind.map({'N':0, 'Y':1}) accepts_x['vehicle_make'] = pd.Categorical(accepts_x.vehicle_make).codes rejects_x.bankruptcy_ind = rejects_x.bankruptcy_ind.map({'N':0, 'Y':1}) rejects_x['vehicle_make'] = pd.Categorical(rejects_x.vehicle_make).codes accepts_x.fillna(accepts_x.median(),inplace=True) rejects_x.fillna(rejects_x.median(),inplace=True) #输出概率结果 alg = RandomForestClassifier(n_estimators,min_samples_split=min_samples_split,min_samples_leaf=min_samples_leaf,random_state=1) alg.fit(accepts_x,accepts_y) predictions = alg.predict_proba(rejects_x) rejects_x['bad_ind_proba']=predictions[:,1] #前1845个为坏客户 rejects['bad_ind']=np.nan*4233 rejects_x_1_list=rejects_x.sort_values(by='bad_ind_proba',ascending=False)['bad_ind_proba'].head(1845).index rejects['bad_ind']=rejects['bad_ind'][rejects.index.isin(rejects_x_1_list)].replace(np.nan,1).astype('int32') rejects['bad_ind']=rejects['bad_ind'].fillna(0).astype('int32') return rejects
2.5 数据预处理与特征筛选
#选择用KNN拒绝推断orRF比例推断 #rejects_new=KNN_RI(accepts,rejects) rejects_new=RF_RI(accepts,rejects) #合并accepts与rejects,排除account_number, weight两项 data = pd.concat([accepts.iloc[:, 2:-1], rejects_new.iloc[:,1:]], axis = 0) data.drop(['vehicle_make'], axis = 1, inplace = True) data.vehicle_year = data.vehicle_year.map(lambda x: 2018 - x) data.bankruptcy_ind = data.bankruptcy_ind.map({'N':0, 'Y':1}) #类别特征转换 #盖帽法处理异常值 for i in ['age_oldest_tr', 'down_pyt', 'fico_score','loan_amt', 'loan_term', 'ltv', 'msrp', 'purch_price', 'rev_util','tot_derog', 'tot_income', 'tot_open_tr', 'tot_rev_debt','tot_rev_line', 'tot_rev_tr', 'tot_tr', 'used_ind', 'veh_mileage','vehicle_year']: i_min = data[i].quantile(0.01) i_max = data[i].quantile(0.99) data[i] = data[i].map(lambda x: i_min if x <= i_min else x) data[i] = data[i].map(lambda x: i_max if x >= i_max else x) #KNN填充缺失值 #data_new = pd.DataFrame(KNN(k=3).fit_transform(data)) #平均值填充缺失值 data_new=data.fillna(data.mean()) data_new.columns=data.columns#得到预处理后的数据集data_new #这里采用用变量的IV值进行特征筛选,IV>0.02 def get_iv_values(data_new,col0): global iv iv = {} for i in col0: try: iv[i] = woe.WoE(v_type='c').fit(data_new[i],data_new['bad_ind']).optimize().iv except: print(i) iv=pd.Series(iv) iv=iv[iv.values>0.02] iv=iv.sort_values(ascending=False) return iv col0=data_new.columns.drop('bad_ind')#所有变量 get_iv_values(data_new,col0)#得到iv值>0.02 col=iv.index #由IV得到的特征变量 #对特征变量进行转换成woe值,作为输入 woe_c = data_new[col].apply(lambda x:woe.WoE(v_type='c').fit(x,data_new['bad_ind']).optimize().fit_transform(x,data_new['bad_ind']))
#用woe转换后的数据进行预测,划分数据集 #划分数据集 X = woe_c Y = data_new['bad_ind'] x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=1) #定义模型训练 def model_fit(model,x_train,y_train,x_test,y_test): global fpr,tpr,threshold model.fit(x_train,y_train) y_pred = model.predict(x_test)#模型预测结果 y_score=model.predict_proba(x_test)[:,1]#模型输出的正样本概率 print(classification_report(y_test,y_pred)) fpr,tpr,threshold=roc_curve(y_test,y_score) print('auc=%.2f'%auc(fpr,tpr)) #定义评估模型的曲线:ROC曲线、KS曲线 def plot_roc_curve(fpr,tpr): AUC=auc(fpr,tpr) plt.figure(figsize=(7,7)) plt.plot(fpr,tpr,label='AUC=%.2f'%AUC) plt.plot([0,1],[0,1],'--') plt.xlim([0,1]) plt.ylim([0,1]) plt.xlabel('False Positive Rate',fontdict={'fontsize':12},labelpad=10) plt.ylabel('True Positive Rate',fontdict={'fontsize':12},labelpad=10) plt.title('ROC curve',fontdict={'fontsize':20}) plt.legend(loc=0,fontsize=11) def plot_ks_curve(tpr,fpr,thresholds): KS=max(tpr-fpr)#KS值 TF_diff=pd.Series(tpr)-pd.Series(fpr) tpr_best=tpr[TF_diff==TF_diff.max()] fpr_best=fpr[TF_diff==TF_diff.max()] thr_best=thresholds[TF_diff==TF_diff.max()] #KS值对应的概率阈值 plt.figure(figsize=(7,7)) plt.plot(thresholds,tpr) plt.plot(thresholds,fpr) plt.plot([thr_best,thr_best],[fpr_best,tpr_best],'--') plt.xlim([thresholds.min(),thresholds.max()-1]) plt.ylim([0,1]) plt.text(0.6,0.75,s='KS=%.2f'%KS,fontdict={'fontsize':12}) plt.text(0.6,0.7,s='thr_best=%.3f'%thr_best,fontdict={'fontsize':12}) plt.xlabel('Threshold',fontdict={'fontsize':12},labelpad=10) plt.ylabel('Rate',fontdict={'fontsize':12},labelpad=10) plt.title('K-S curve',fontdict={'fontsize':20}) plt.legend(['True Positive Rate','False Positive Rate'],loc=3) #用逻辑回归训练模型 #LR模型 lr = LogisticRegression(C = 1, penalty = 'l1') #LRCV模型 clf=LogisticRegressionCV(Cs=[0.001,0.01,0.1,1,10,100,1000],#正则化强度备选集 cv=5,#交叉验证 class_weight='balanced',#自动调整类别权重 penalty='l2',#选用L2正则化 random_state=0,#设置一个固定的随机数种子 )#其余为默认参数 #输出各个模型结果 model_fit(lr,x_train,y_train,x_test,y_test) model_fit(clf,x_train,y_train,x_test,y_test) #LRCV模型得到的结果更好 precision recall f1-score support 0 0.91 0.62 0.74 1544 1 0.40 0.81 0.54 488 auc=0.78 precision recall f1-score support 0.0 0.89 0.72 0.80 2114 1.0 0.55 0.79 0.65 910 auc=0.83 #画出ROC曲线与KS曲线 plot_roc_curve(fpr,tpr) plot_ks_curve(tpr,fpr,threshold) #AUC值为0.83,模型准确率较高;KS值为0.53,好坏区分效果不错 #对应的最佳阈值为0.315,定义坏客户为违约客户,则预测是否违约的最佳判定原则为:违约估计概率>=0.315判定为违约,违约估计概率<0.315判定为正常
#定义各个变量的分段得分 def get_scorecard_name(x,data,label='bad_ind'): #x:特征变量woe值,data:训练数据,label:训练数据的目标变量标签 n = 0 for i in x.columns: if n == 0: temp = woe.WoE(v_type='c').fit(data[i],data[label]).optimize().bins temp['name'] = [i]*len(temp) scorecard = temp.copy() n += 1 else: temp = woe.WoE(v_type='c').fit(data[i],data[label]).optimize().bins temp['name'] = [i]*len(temp) scorecard = pd.concat([scorecard, temp], axis = 0) n += 1 return scorecard #得到特征变量的分段得分结果 scorecard=get_scorecard_name(X,data_new,label='bad_ind') #创建评分卡 #假设:Odds为1/20所对应的分值为600分,2Odds时对应分值减少20分 #600=A-B*log(1/20) #600-20=A-B*log(2*1/20) #可求得A、B,最后求得总评分Score #b=20/(np.log(2)) #28.8539 #a=600+b*np.log(1/20)# 513.561 model=clf base_score=int(np.ceil(28.8539*model.intercept_[0]+513.561))#基础分 print('base score is {}'.format(int(np.ceil(28.8539*model.intercept_[0]+513.561)))) scorecard['score'] = scorecard['woe'].map(lambda x: -int(np.ceil(28.8539*x)))#通过woe得到每个变量的分别得分 df_scorecard=pd.DataFrame(['score_base','—','%.f'%base_score],index=['name','bins','score']).T score_card=df_scorecard.append(scorecard[['name','bins','score']],ignore_index=True)#总评分卡 #分段得分 base score is 514 print(score_card) #求数据表中每个样本的得分 #定义分值转换函数 def fico_score0(x): if x < 6.025000E+02: return 42 elif x < 6.385000E+02: return 29 elif x < 6.535000E+02: return 18 elif x < 6.897490E+02: return 5 elif x < 6.995000E+02: return -8 elif x < 7.215000E+02: return -21 elif x < 7.615000E+02: return -42 else: return -69 def tot_rev_line0(x): if x < 1.741400E+04: return 9 else: return -29 def tot_derog0(x): if x < 5.000000E-01: return -18 else: return 12 def age_oldest_tr0(x): if x < 9.450000e+01: return 9 else: return -19 def rev_util0(x): if x < 7.350000E+01: return -7 else: return 19 def tot_tr0(x): if x < 1.692740E+01: return 7 else: return -12 def ltv0(x): if x < 8.950000E+01: return -19 else: return 4 def tot_income0(x): if x < 4.493330E+03: return 4 else: return -10 def loan_term0(x): if x < 4.050000E+01: return -18 else: return 1 def bankruptcy_ind0(x): if x < 4.116020E-02: return -3 else: return 13 def down_pyt0(x): if x < 5.924920E+03: return 0 else: return -28 def veh_mileage0(x): if x < 3.378350E+04: return -3 else: return 7 def tot_rev_debt0(x): if x < 1.011800E+04: return 1 else: return -12 def msrp0(x): if x < 1.556050E+04: return 5 else: return -4 func = [fico_score0, tot_rev_line0, tot_derog0, age_oldest_tr0, rev_util0, tot_tr0, ltv0, tot_income0, loan_term0, bankruptcy_ind0, down_pyt0, veh_mileage0, tot_rev_debt0, msrp] #分值转换 X_score_dict = {i:j for i,j in zip(X.columns,func)} X_score = data[X.columns].copy() X_score_new=pd.DataFrame(columns=X_score.columns) for i in X_score.columns: get_func=X_score_dict.get(i)#得到函数 X_score_new[i] = X_score[i].apply(get_func) #分值与基准分相加得到最终分数 X_score_new['SCORE'] = X_score_new[X.columns].apply(lambda x: sum(x) + 514, axis = 1) X_score_label = pd.concat([X_score_new, data['bad_ind']], axis = 1) X_score_label.head() fico_score tot_rev_line tot_derog age_oldest_tr rev_util tot_tr ltv \ 0 18 9 12 9 19 7 4 1 18 -29 -18 -19 -7 -12 4 2 29 -29 12 9 -7 7 4 3 29 9 12 9 -7 7 4 4 -69 9 -18 -19 -7 7 4 tot_income loan_term bankruptcy_ind down_pyt veh_mileage tot_rev_debt \ 0 -10 -18 -3 0 -3 1 1 -10 1 -3 0 -3 -12 2 4 1 -3 0 -3 -12 3 4 1 -3 0 -3 1 4 4 1 -3 0 -3 1 msrp SCORE 0 -4 555 1 -4 420 2 5 531 3 5 582 4 -4 417 # 查看逾期未逾期评分分布,正样本的分布不正态,效果不如负样本,评分卡待改进优化。 fig, ax = plt.subplots() ax1 = sns.kdeplot(X_score_label[X_score_label['bad_ind'] == 1]['SCORE'],label='1') ax2 = sns.kdeplot(X_score_label[X_score_label['bad_ind'] == 0]['SCORE'],label='0') plt.show()
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。