赞
踩
**摘要:**赛题以金融风控中的个人信贷为背景,要求选手根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题。赛题以预测用户贷款是否违约为任务,数据集报名后可见并可下载,该数据来自某信贷平台的贷款记录,总数据量超过120w,包含47列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取80万条作为训练集,20万条作为测试集A,20万条作为测试集B,同时会对employmentTitle、purpose、postCode和title等信息进行脱敏。比赛链接——天池
部分字段表,详细的请看赛题与数据中。
评测标准:
本次比赛的评价方法为AUC评估模型效果(越大越好)。现在来看看什么是AUC。
混淆矩阵(Confuse Matrix)
ROC空间将假正例率(FPR)定义为 X 轴,真正例率(TPR)定义为 Y 轴。
TPR:在所有实际为正例的样本中,被正确地判断为正例之比率。
T
P
R
=
T
P
T
P
+
F
N
TPR = \frac{TP}{TP + FN}
TPR=TP+FNTP
FPR:在所有实际为负例的样本中,被错误地判断为正例之比率。
F
P
R
=
F
P
F
P
+
T
N
FPR = \frac{FP}{FP + TN}
FPR=FP+TNFP
这是百度百科中关于AUC介绍的一张图。
AUC(Area Under Curve)被定义为ROC曲线下与坐标轴围成的面积,显然这个面积的数值不会大于1。又由于ROC曲线一般都处于y=x这条直线的上方,所以AUC的取值范围在0.5和1之间。AUC越接近1.0,检测方法真实性越高;等于0.5时,则真实性最低,无应用价值。
结果提交:
下面通过sklearn
库中的一些函数来简单演示下上面提到的混淆矩阵,ROC,以及AUC
## 混淆矩阵
import numpy as np
from sklearn.metrics import confusion_matrix
y_pred = [0, 1, 0, 1]
y_true = [0, 1, 1, 0]
print('混淆矩阵:\n',confusion_matrix(y_true, y_pred))
# 输出
混淆矩阵:
[[1 1]
[1 1]]
其中[0, 0]
表示的是TN,[0, 1]
表示的是FP,[1, 0]
表示的是FN,[1, 1]
表示的是FP。
## ROC曲线
from sklearn.metrics import roc_curve
y_pred = [0, 1, 1, 0, 1, 1, 0, 1, 1, 1]
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1]
FPR,TPR,thresholds=roc_curve(y_true, y_pred)
plt.title('ROC')
plt.plot(FPR, TPR,'b')
plt.plot([0,1],[0,1],'r--')
plt.ylabel('TPR')
plt.xlabel('FPR')
## AUC
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
print('AUC socre:',roc_auc_score(y_true, y_scores))
# AUC socre: 0.75
AUC的取值就是上面ROC蓝色曲线和下坐标轴构成的面积。
首先导入数据分析及可视化会用到的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
warnings.filterwarnings('ignore')
读取train.csv
和testA.csv
的数据,并查看数据的大小、特征、缺失值之类的。
print(train.info(), testA.info())
训练集train
的大小为80万,testA
的大小为20万行。
train.describe()
特征比较多,只能展示部分了。
missing = train.isnull().sum()
missing = missing[missing > 0]
# 并画出缺失率的图
missing_rate = missing/len(train)
missing_rate.plot.bar()
可以看出,train
的缺失值还是不少的,但是占比不是很多。不过缺失值对于xgb
,lgb
等树模型来说可以直接空缺,树模型会自己优化,要是用别的模型还是要处理的。
one_value_fea = [col for col in train.columns if train[col].nunique() <= 1]
# 'policyCode'
one_value_fea_test = [col for col in testA.columns if testA[col].nunique() <= 1]
# 'policyCode'
然后我们发现policyCode
这个记录不管是在训练集中还是测试集中 ,都只有一个值,那么这个就直接可以删去了,对模型丝毫没有影响。
查看特征的数值类型和对象类型:
数值类型一般是能直接带进模型里面的,而对象模型也就是object
是需要先处理一下的。
numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(train.select_dtypes(include=['object']).columns)
#每个数字特征得分布可视化
f = pd.melt(train, value_vars=numerical_fea)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
首先导入在这个过程中会用到的一些库,将数据读取进来。
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import datetime from tqdm import tqdm from sklearn.preprocessing import LabelEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.preprocessing import MinMaxScaler import xgboost as xgb import lightgbm as lgb from catboost import CatBoostRegressor import warnings from sklearn.model_selection import StratifiedKFold, KFold from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss warnings.filterwarnings('ignore') train =pd.read_csv('data/train.csv') testA = pd.read_csv('data/testA.csv')
方便后序对特征的操作,在这里就把训练集的数据和测试集的数据拼接起来(后面再拆分开)。
data = pd.concat([train, testA], axis=0, ignore_index=True)
data['employmentLength'].value_counts(dropna=False).sort_index()
1 year 65671
10+ years 328525
2 years 90565
3 years 80163
4 years 59818
5 years 62645
6 years 46582
7 years 44230
8 years 45168
9 years 37866
< 1 year 80226
NaN 58541
Name: employmentLength, dtype: int64
从上面看出employmentLength
是object
类型,因此我们可以将字符串转成数字类型。
data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
def employmentLength_to_int(s):
if pd.isnull(s):
return s
else:
return np.int8(s.split()[0])
data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
10.0 328525
2.0 90565
0.0 80226
3.0 80163
1.0 65671
5.0 62645
4.0 59818
6.0 46582
8.0 45168
7.0 44230
9.0 37866
Name: employmentLength, dtype: int64
现在再来看看earliesCreditLine
这列的数据长什么样子。
data['earliesCreditLine']
0 Aug-2001
1 May-2002
2 May-2006
3 May-1999
4 Aug-1977
...
999995 Nov-2005
999996 Oct-2006
999997 Dec-2001
999998 Aug-2005
999999 Aug-2002
Name: earliesCreditLine, Length: 1000000, dtype: object
我们也可以将其中的月份去掉只取其中的年,操作如下
data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]))
结果就变成了数值类型的数据了。
0 2001
1 2002
2 2006
3 1999
4 1977
...
999995 2005
999996 2006
999997 2001
999998 2005
999999 2002
Name: earliesCreditLine, Length: 1000000, dtype: int64
对一些类别特征做如下操作
# 类型数在2之上,又不是高维稀疏的
data = pd.get_dummies(data, columns=['grade', 'subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)
# 高维类别特征需要进行转换
for f in ['employmentTitle', 'postCode', 'title']:
data[f+'_cnts'] = data.groupby([f])['id'].transform('count')
data[f+'_rank'] = data.groupby([f])['id'].rank(ascending=False).astype(int)
del data[f]
我们发现在训练集中没有n2.2,n2.3
这列,而在测试集中有,因此我们可以选择删掉这两列(也可以不删,如果用的是树模型,树模型会自己优化缺失值)。
del data['n2.2']
del data['n2.3']
del data['id']
# id这列也没什么用,就一起删去了。
现在将训练集和测试集拆分开来。
features = [f for f in data.columns if f not in ['id','issueDate','isDefault']]
train = data[data.isDefault.notnull()].reset_index(drop=True)
test = data[data.isDefault.isnull()].reset_index(drop=True)
x_train = train[features]
x_test = test[features]
y_train = train['isDefault']
def cv_model(clf, train_x, train_y, test_x, clf_name): folds = 10 seed = 2020 kf = KFold(n_splits=folds, shuffle=True, random_state=seed) train = np.zeros(train_x.shape[0]) test = np.zeros(test_x.shape[0]) cv_scores = [] for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)): print('************************************ {} ************************************'.format(str(i+1))) trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index] if clf_name == "lgb": train_matrix = clf.Dataset(trn_x, label=trn_y) valid_matrix = clf.Dataset(val_x, label=val_y) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'min_child_weight': 5, 'num_leaves': 2 ** 5, 'lambda_l2': 10, 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 4, 'learning_rate': 0.1, 'seed': 2020, 'nthread': 28, 'n_jobs':24, 'silent': True, 'verbose': -1, } model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200) val_pred = model.predict(val_x, num_iteration=model.best_iteration) test_pred = model.predict(test_x, num_iteration=model.best_iteration) # print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20]) if clf_name == "xgb": train_matrix = clf.DMatrix(trn_x , label=trn_y) valid_matrix = clf.DMatrix(val_x , label=val_y) test_matrix = clf.DMatrix(test_x) params = {'booster': 'gbtree', 'objective': 'binary:logistic', 'eval_metric': 'auc', 'gamma': 1, 'min_child_weight': 1.5, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.04, 'tree_method': 'exact', 'seed': 2020, 'nthread': 36, "silent": True, } watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')] model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200) val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit) test_pred = model.predict(test_matrix , ntree_limit=model.best_ntree_limit) if clf_name == "cat": params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli', 'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False} model = clf(iterations=20000, **params) model.fit(trn_x, trn_y, eval_set=(val_x, val_y), cat_features=[], use_best_model=True, verbose=500) val_pred = model.predict(val_x) test_pred = model.predict(test_x) train[valid_index] = val_pred test = test_pred / kf.n_splits cv_scores.append(roc_auc_score(val_y, val_pred)) print(cv_scores) print("%s_scotrainre_list:" % clf_name, cv_scores) print("%s_score_mean:" % clf_name, np.mean(cv_scores)) print("%s_score_std:" % clf_name, np.std(cv_scores)) return train, test
def lgb_model(x_train, y_train, x_test):
lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
return lgb_train, lgb_test
def xgb_model(x_train, y_train, x_test):
xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
return xgb_train, xgb_test
def cat_model(x_train, y_train, x_test):
cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
return cat_train, cat_test
rh_test = 0.5*lgb_test+0.5*cat_test
testA['isDefault'] = rh_test
testA[['id','isDefault']].to_csv('test_sub.csv', index=False)
因为xgboost
模型的速度会比其它两个慢很多,我就没用这个去训练了。
baseline.py
# 导入需要的库 import pandas as pd import os import gc import lightgbm as lgb import xgboost as xgb from catboost import CatBoostRegressor from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge from sklearn.preprocessing import MinMaxScaler import math import numpy as np from tqdm import tqdm from sklearn.model_selection import StratifiedKFold, KFold from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss import matplotlib.pyplot as plt import time import warnings warnings.filterwarnings('ignore') # 导入数据 train = pd.read_csv('data/train.csv') testA = pd.read_csv('data/testA.csv') data = pd.concat([train, testA], axis=0, ignore_index=True) data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True) data['employmentLength'].replace('< 1 year', '0 years', inplace=True) def employmentLength_to_int(s): if pd.isnull(s): return s else: return np.int8(s.split()[0]) data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int) data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:])) data = pd.get_dummies(data, columns=['grade', 'subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True) for f in ['employmentTitle', 'postCode', 'title']: data[f+'_cnts'] = data.groupby([f])['id'].transform('count') data[f+'_rank'] = data.groupby([f])['id'].rank(ascending=False).astype(int) del data[f] del data['n2.2'] del data['n2.3'] del data['id'] features = [f for f in data.columns if f not in ['id','issueDate','isDefault']] train = data[data.isDefault.notnull()].reset_index(drop=True) test = data[data.isDefault.isnull()].reset_index(drop=True) x_train = train[features] x_test = test[features] y_train = train['isDefault'] def cv_model(clf, train_x, train_y, test_x, clf_name): folds = 10 seed = 2020 kf = KFold(n_splits=folds, shuffle=True, random_state=seed) train = np.zeros(train_x.shape[0]) test = np.zeros(test_x.shape[0]) cv_scores = [] for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)): print('************************************ {} ************************************'.format(str(i+1))) trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index] if clf_name == "lgb": train_matrix = clf.Dataset(trn_x, label=trn_y) valid_matrix = clf.Dataset(val_x, label=val_y) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'min_child_weight': 5, 'num_leaves': 2 ** 5, 'lambda_l2': 10, 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 4, 'learning_rate': 0.1, 'seed': 2020, 'nthread': 28, 'n_jobs':24, 'silent': True, 'verbose': -1, } model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200) val_pred = model.predict(val_x, num_iteration=model.best_iteration) test_pred = model.predict(test_x, num_iteration=model.best_iteration) # print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20]) if clf_name == "xgb": train_matrix = clf.DMatrix(trn_x , label=trn_y) valid_matrix = clf.DMatrix(val_x , label=val_y) test_matrix = clf.DMatrix(test_x) params = {'booster': 'gbtree', 'objective': 'binary:logistic', 'eval_metric': 'auc', 'gamma': 1, 'min_child_weight': 1.5, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.04, 'tree_method': 'exact', 'seed': 2020, 'nthread': 36, "silent": True, } watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')] model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200) val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit) test_pred = model.predict(test_matrix , ntree_limit=model.best_ntree_limit) if clf_name == "cat": params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli', 'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False} model = clf(iterations=20000, **params) model.fit(trn_x, trn_y, eval_set=(val_x, val_y), cat_features=[], use_best_model=True, verbose=500) val_pred = model.predict(val_x) test_pred = model.predict(test_x) train[valid_index] = val_pred test = test_pred / kf.n_splits cv_scores.append(roc_auc_score(val_y, val_pred)) print(cv_scores) print("%s_scotrainre_list:" % clf_name, cv_scores) print("%s_score_mean:" % clf_name, np.mean(cv_scores)) print("%s_score_std:" % clf_name, np.std(cv_scores)) return train, test def lgb_model(x_train, y_train, x_test): lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb") return lgb_train, lgb_test def xgb_model(x_train, y_train, x_test): xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb") return xgb_train, xgb_test def cat_model(x_train, y_train, x_test): cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat") return cat_train, cat_test lgb_train, lgb_test = lgb_model(x_train, y_train, x_test) cat_train, cat_test = cat_model(x_train, y_train, x_test) rh_test = 0.5*lgb_test+0.5*cat_test testA['isDefault'] = rh_test testA[['id','isDefault']].to_csv('test_sub.csv', index=False)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。