当前位置:   article > 正文

【教学赛】金融数据分析赛题1:银行客户认购产品预测(0.9676)

银行客户认购产品预测

 本文是对天池教学赛,银行客户认购产品预测的记录,教学赛网址如下:

【教学赛】金融数据分析赛题1:银行客户认购产品预测_学习赛_天池大赛-阿里云天池

 1. 读取数据

  1. import pandas as pd
  2. # 加载数据
  3. train = pd.read_csv('train.csv')
  4. test = pd.read_csv('test.csv')

2. 数据处理

2.1 合并数据

  1. # 训练集和测试集合并, 以便于处理特征的数据
  2. df = pd.concat([train, test], axis=0) #将训练数据和测试数据在行的方向拼接
  3. df

得到的结果:

  1. id age job marital education default housing loan contact month ... campaign pdays previous poutcome emp_var_rate cons_price_index cons_conf_index lending_rate3m nr_employed subscribe
  2. 0 1 51 admin. divorced professional.course no yes yes cellular aug ... 1 112 2 failure 1.4 90.81 -35.53 0.69 5219.74 no
  3. 1 2 50 services married high.school unknown yes no cellular may ... 1 412 2 nonexistent -1.8 96.33 -40.58 4.05 4974.79 yes
  4. 2 3 48 blue-collar divorced basic.9y no no no cellular apr ... 0 1027 1 failure -1.8 96.33 -44.74 1.50 5022.61 no
  5. 3 4 26 entrepreneur single high.school yes yes yes cellular aug ... 26 998 0 nonexistent 1.4 97.08 -35.55 5.11 5222.87 yes
  6. 4 5 45 admin. single university.degree no no no cellular nov ... 1 240 4 success -3.4 89.82 -33.83 1.17 4884.70 no
  7. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
  8. 7495 29996 49 admin. unknown university.degree unknown yes yes telephone apr ... 50 302 1 failure -1.8 95.77 -40.50 3.86 5058.64 NaN
  9. 7496 29997 34 blue-collar married basic.4y no no no cellular jul ... 8 440 3 failure 1.4 90.59 -47.29 1.77 5156.70 NaN
  10. 7497 29998 50 retired single basic.4y no yes no cellular jun ... 3 997 0 nonexistent -2.9 97.42 -39.69 1.29 5116.80 NaN
  11. 7498 29999 31 technician married professional.course no no no cellular aug ... 3 1028 0 nonexistent 1.4 96.90 -37.68 5.18 5144.45 NaN
  12. 7499 30000 46 admin. divorced university.degree no yes no cellular aug ... 2 387 3 success 1.4 97.49 -31.54 3.79 5082.25 NaN
  13. 30000 rows × 22 columns

可见数据既有数字也有文字,需要将文字转换为数字

2.2 将非数字的特征转换为数字

  1. # 首先选出所有的特征为object(非数字)的特征
  2. cat_columns = df.select_dtypes(include='object').columns #选择非数字的列,对其进行处理
  3. df[cat_columns]
  1. # 对非数字特征进行编码
  2. from sklearn.preprocessing import LabelEncoder
  3. job_le = LabelEncoder()
  4. df['job'] = job_le.fit_transform(df['job'])
  5. df['marital'] = df['marital'].map({'unknown':0, 'single':1, 'married':2, 'divorced':3})
  6. df['education'] = df['education'].map({'unknown':0, 'basic.4y':1, 'basic.6y':2, 'basic.9y':3, 'high.school':4, 'university.degree':5, 'professional.course':6, 'illiterate':7})
  7. df['housing'] = df['housing'].map({'unknown': 0, 'no': 1, 'yes': 2})
  8. df['loan'] = df['loan'].map({'unknown': 0, 'no': 1, 'yes': 2})
  9. df['contact'] = df['contact'].map({'cellular': 0, 'telephone': 1})
  10. df['day_of_week'] = df['day_of_week'].map({'mon': 0, 'tue': 1, 'wed': 2, 'thu': 3, 'fri': 4})
  11. df['poutcome'] = df['poutcome'].map({'nonexistent': 0, 'failure': 1, 'success': 2})
  12. df['default'] = df['default'].map({'unknown': 0, 'no': 1, 'yes': 2})
  13. df['month'] = df['month'].map({'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, \
  14. 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12})
  15. df['subscribe'] = df['subscribe'].map({'no': 0, 'yes': 1})

2.3 切分数据

  1. # 将数据集重新划分为训练集和测试集 通过subscribe是不是空来判断
  2. train = df[df['subscribe'].notnull()]
  3. test = df[df['subscribe'].isnull()]
  4. # 查看训练集中,标签为01的比例,可以看出01不均衡,016.6
  5. train['subscribe'].value_counts()

得到

  1. 0.0 19548
  2. 1.0 2952
  3. Name: subscribe, dtype: int64

2.4 分析数据

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. import seaborn as sns
  4. import warnings
  5. warnings.filterwarnings('ignore')
  6. %matplotlib inline
  7. num_features = [x for x in train.columns if x not in cat_columns and x!='id']
  8. fig = plt.figure(figsize=(80,60))
  9. for i in range(len(num_features)):
  10. plt.subplot(7,2,i+1)
  11. sns.boxplot(train[num_features[i]])
  12. plt.ylabel(num_features[i], fontsize=36)
  13. plt.show()

 存在离群点,对离群点进行处理

2.5 处理离群点

  1. for colum in num_features:
  2. temp = train[colum]
  3. q1 = temp.quantile(0.25)
  4. q2 = temp.quantile(0.75)
  5. delta = (q2-q1) * 10
  6. train[colum] = np.clip(temp, q1-delta, q2+delta)
  7. ## 将超过10倍的值,进行处理

2.6 其他处理

进行数据均衡和特征选择,但是做完处理后都导致了分类效果变差,此处省略。但是把原码贴出来,供参考。

  1. '''# 采用SMOTE进行过采样,虽然训练的效果好了,但是对于最终的分类效果反而降低了,此处先不采用过采样
  2. from imblearn.over_sampling import SMOTE
  3. from imblearn.over_sampling import ADASYN
  4. #smo = SMOTE(random_state=0, k_neighbors=10)
  5. adasyn = ADASYN()
  6. X_smo, y_smo = adasyn.fit_resample(train.iloc[:,:-1], train.iloc[:,-1])
  7. train_smo = pd.concat([X_smo, y_smo], axis=1)
  8. train_smo['subscribe'].value_counts()'''
  1. '''# 特征选择方法采用SelectFromModel,Model选择树模型
  2. from sklearn.ensemble import ExtraTreesClassifier
  3. from sklearn.feature_selection import SelectFromModel
  4. # 提取出训练数据和标签
  5. train_X = train.iloc[:,:-1]
  6. train_y = train.iloc[:,-1]
  7. # clf_ect是模型名,FeaSel为特征选择模型
  8. clf_etc = ExtraTreesClassifier(n_estimators=50)
  9. clf_etc = clf_etc.fit(train_X, train_y)
  10. FeaSel = SelectFromModel(clf_etc, prefit=True)
  11. train_sel = FeaSel.transform(train_X)
  12. test_sel = FeaSel.transform(test.iloc[:,:-1])
  13. # 提取特征名,并把特征名写回原始数据
  14. train_new = pd.DataFrame(train_sel)
  15. feature_idx = FeaSel.get_support() #提取选择的列名
  16. train_new.columns = train_X.columns[feature_idx] #将列名写回选择后的数据
  17. train_new = pd.concat([train_new, train_y],axis=1)
  18. test_new = pd.DataFrame(test_sel)
  19. test_new.columns = train_X.columns[feature_idx]'''

此部门内容可能存在变量命名方面的问题。 

2.7 数据保存

  1. train_new = train
  2. test_new = test
  3. # 将处理完的数据写回到train_new和test_new进行保存
  4. train_new.to_csv('train_new.csv', index=False)
  5. test_new.to_csv('test_new.csv', index=False)

3. 模型训练

3.1 导入包和数据 

  1. from sklearn.model_selection import GridSearchCV
  2. from sklearn.linear_model import LogisticRegression
  3. from sklearn.tree import DecisionTreeClassifier
  4. from sklearn.ensemble import RandomForestClassifier
  5. from sklearn.ensemble import GradientBoostingClassifier
  6. from sklearn.ensemble import AdaBoostClassifier
  7. from xgboost import XGBRFClassifier
  8. from lightgbm import LGBMClassifier
  9. from sklearn.model_selection import cross_val_score
  10. import time
  11. clf_lr = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')
  12. clf_dt = DecisionTreeClassifier()
  13. clf_rf = RandomForestClassifier()
  14. clf_gb = GradientBoostingClassifier()
  15. clf_adab = AdaBoostClassifier()
  16. clf_xgbrf = XGBRFClassifier()
  17. clf_lgb = LGBMClassifier()
  18. from sklearn.model_selection import train_test_split
  19. train_new = pd.read_csv('train_new.csv')
  20. test_new = pd.read_csv('test_new.csv')
  21. feature_columns = [col for col in train_new.columns if col not in ['subscribe']]
  22. train_data = train_new[feature_columns]
  23. target_data = train_new['subscribe']

3.2 模型调参 

  1. from lightgbm import LGBMClassifier
  2. from sklearn.metrics import classification_report
  3. from sklearn.model_selection import GridSearchCV
  4. from sklearn.metrics import accuracy_score
  5. from sklearn.model_selection import train_test_split
  6. X_train, X_test, y_train, y_test = train_test_split(train_data, target_data, test_size=0.2,shuffle=True, random_state=2023)
  7. #X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5,shuffle=True,random_state=2023)
  8. n_estimators = [300]
  9. learning_rate = [0.02]#中0.2最优
  10. subsample = [0.6]
  11. colsample_bytree = [0.7] ##在[0.5, 0.6, 0.7]中0.6最优
  12. max_depth = [9, 11, 13] ##在[7, 9, 11, 13]中11最优
  13. is_unbalance = [False]
  14. early_stopping_rounds = [300]
  15. num_boost_round = [5000]
  16. metric = ['binary_logloss']
  17. feature_fraction = [0.6, 0.75, 0.9]
  18. bagging_fraction = [0.6, 0.75, 0.9]
  19. bagging_freq = [2, 4, 5, 8]
  20. lambda_l1 = [0, 0.1, 0.4, 0.5]
  21. lambda_l2 = [0, 10, 15, 35]
  22. cat_smooth = [1, 10, 15, 20]
  23. param = {'n_estimators':n_estimators,
  24. 'learning_rate':learning_rate,
  25. 'subsample':subsample,
  26. 'colsample_bytree':colsample_bytree,
  27. 'max_depth':max_depth,
  28. 'is_unbalance':is_unbalance,
  29. 'early_stopping_rounds':early_stopping_rounds,
  30. 'num_boost_round':num_boost_round,
  31. 'metric':metric,
  32. 'feature_fraction':feature_fraction,
  33. 'bagging_fraction':bagging_fraction,
  34. 'lambda_l1':lambda_l1,
  35. 'lambda_l2':lambda_l2,
  36. 'cat_smooth':cat_smooth}
  37. model = LGBMClassifier()
  38. clf = GridSearchCV(model, param, cv=3, scoring='accuracy', verbose=1, n_jobs=-1)
  39. clf.fit(X_train, y_train, eval_set=[(X_train, y_train),(X_test, y_test)])
  40. print(clf.best_params_, clf.best_score_)

里面只有1个值的,是已经通过GridSearchCV找到的最优优值了,程序显示的是最后的6个参数的寻优,都放到一起训练时间太长了,所以选择分开寻找。

得到的结果:

  1. Early stopping, best iteration is:
  2. [287] training's binary_logloss: 0.22302 valid_1's binary_logloss: 0.253303
  3. {'bagging_fraction': 0.6, 'cat_smooth': 1, 'colsample_bytree': 0.7, 'early_stopping_rounds': 300, 'feature_fraction': 0.75, 'is_unbalance': False, 'lambda_l1': 0.4, 'lambda_l2': 10, 'learning_rate': 0.02, 'max_depth': 11, 'metric': 'binary_logloss', 'n_estimators': 300, 'num_boost_round': 5000, 'subsample': 0.6} 0.8853333333333334

3.3 预测结果

  1. y_true, y_pred = y_test, clf.predict(X_test)
  2. accuracy = accuracy_score(y_true,y_pred)
  3. print(classification_report(y_true, y_pred))
  4. print('Accuracy',accuracy)

结果

  1. precision recall f1-score support
  2. 0.0 0.91 0.97 0.94 3933
  3. 1.0 0.60 0.32 0.42 567
  4. accuracy 0.89 4500
  5. macro avg 0.75 0.64 0.68 4500
  6. weighted avg 0.87 0.89 0.87 4500
  7. Accuracy 0.8875555555555555

查看混淆矩阵 

  1. from sklearn import metrics
  2. confusion_matrix_result = metrics.confusion_matrix(y_true, y_pred)
  3. plt.figure(figsize=(8,6))
  4. sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
  5. plt.xlabel('predict')
  6. plt.ylabel('true')
  7. plt.show()

 4. 输出结果

  1. test_x = test[feature_columns]
  2. pred_test = clf.predict(test_x)
  3. result = pd.read_csv('./submission.csv')
  4. subscribe_map ={1: 'yes', 0: 'no'}
  5. result['subscribe'] = [subscribe_map[x] for x in pred_test]
  6. result.to_csv('./baseline_lgb1.csv', index=False)
  7. result['subscribe'].value_counts()

结果:

  1. no 6987
  2. yes 513
  3. Name: subscribe, dtype: int64

5. 提交结果

 6. 总结

本人的方法只获得了0.9676的结果,希望您能在本人的程序基础上进行改进,以得到更佳的效果。如果有了更好的方法,欢迎在留言区告诉我,相互讨论。

改进的思路:

1. 数据处理方面,本人在进行数据均衡时,训练的效果很好,但是最终的效果较差,应该是数据过拟合了;另外在数据的离群点处理方面也可以做更进一步的考虑;

2.方法的改进,本人对比了lr, dt, rf, gb, adab, xgbrf, lgb最终lgb的效果最好,所以最终选择lgb进行调参,可以考虑采用多种方法的组合,进行训练;

3.在lgb的基础上进行调参,这个是最没有科技含量的。不过花时间应该会得到比我的结果更好的效果。

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号