当前位置:   article > 正文

【数据分析】基于XGboost(决策树)的银行产品认购预测--小林月_基于xgboost(决策树)的银行产品认购预测的数据集

基于xgboost(决策树)的银行产品认购预测的数据集

目录

一、数据探索:

1.1 读取数据

1.2查看数据

1.3 数据预处理

二、字段描述

2.1 非离散型数据

2.2离散数值字段

三、数据建模

四、评估指标:

4.1:混淆矩阵

4.2: 准确率,回归率,F1

五、测试集准确率

六、模型优化


环境:使用python+jupter nodebook

数据:本文数据来源2023年【教学赛】金融数据分析赛题1:银行客户认购产品预测

赛题(数据)网址:【教学赛】金融数据分析赛题1:银行客户认购产品预测-天池大赛-阿里云天池

一、数据探索:

   1.1 读取数据

所需要的库包:

  1. import pandas as pd
  2. import numpy as np
'
运行
  1. trian=pd.read_csv("train.csv")
  2. test=pd.read_csv("test.csv")

   1.2查看数据

        是否正常,有无异常值:

        查看统计量

print(df.describe().T)

        查看数据分布(散点图):

  1. # 1 查看统计量
  2. print(df.describe().T)
  3. # 2 duration分箱展示
  4. import matplotlib.pyplot as plt
  5. import seaborn as sns
  6. # 3.查看数据分布
  7. # 分离数值变量与分类变量
  8. Nu_feature = list(df.select_dtypes(exclude=['object']).columns)
  9. Ca_feature = list(df.select_dtypes(include=['object']).columns)
  10. Ca_feature.remove('subscribe')
  11. col1=Ca_feature
  12. plt.figure(figsize=(20,10))
  13. j=1
  14. for col in col1:
  15. ax=plt.subplot(4,5,j)
  16. ax=plt.scatter(x=range(len(df)),y=df[col],color='red')
  17. plt.title(col)
  18. j+=1
  19. k=11
  20. for col in col1:
  21. ax=plt.subplot(4,5,k)
  22. ax=plt.scatter(x=range(len(test)),y=test[col],color='cyan')
  23. plt.title(col)
  24. k+=1
  25. plt.subplots_adjust(wspace=0.4,hspace=0.3)
  26. plt.show()

数据相关图(热力图)

  1. # # 4.数据相关图
  2. from sklearn.preprocessing import LabelEncoder
  3. lb = LabelEncoder()
  4. cols = Ca_feature
  5. for m in cols:
  6. df[m] = lb.fit_transform(df[m])
  7. test[m] = lb.fit_transform(test[m])
  8. #
  9. df['subscribe'] = df['subscribe'].replace(['no', 'yes'], [0, 1])
  10. correlation_matrix = df.corr()
  11. plt.figure(figsize=(12, 10))
  12. sns.heatmap(correlation_matrix, vmax=0.9, linewidths=0.05, cmap="RdGy")
  13. plt.show()

查看数据是否有空值或者unkonw

  1. #数据没有NA值但是有unknow值
  2. train_set.isin(['unknown']).mean()*100
  3. test_set.isin(['unknown']).mean()*100
  4. # 工作,教育和沟通方式用众数填充

1.3 数据预处理

对训练集和测试集数据进行填充:

  1. trian['default'].replace(['unknown'], test['default'].mode(), inplace=True)
  2. trian['job'].replace(['unknown'], trian['job'].mode(), inplace=True)
  3. trian['education'].replace(['unknown'], trian['education'].mode(), inplace=True)
  4. trian['marital'].replace(['unknown'], trian['marital'].mode(), inplace=True)
  5. trian['housing'].replace(['unknown'], trian['housing'].mode(), inplace=True)
  6. trian['loan'].replace(['unknown'], trian['loan'].mode(), inplace=True)
  7. # test.drop(['default'], inplace=True, axis=1)
  8. test['default'].replace(['unknown'], test['default'].mode(), inplace=True)
  9. test['job'].replace(['unknown'], test['job'].mode(), inplace=True)
  10. test['education'].replace(['unknown'], test['education'].mode(), inplace=True)
  11. test['marital'].replace(['unknown'], test['marital'].mode(), inplace=True)
  12. test['housing'].replace(['unknown'], test['housing'].mode(), inplace=True)
  13. test['loan'].replace(['unknown'], test['loan'].mode(), inplace=True)
  14. print(trian["job"].value_counts())

二、字段描述

        2.1 非离散型数据

  1. # #统计图
  2. plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
  3. trian['subscribe'] = trian['subscribe'].replace(['no', 'yes'], [0,1])
  4. plt.figure(figsize = [15,10])#画板大小
  5. sns.barplot(x = "job", y ="subscribe" , data = trian)
  6. x_1=["管理者","蓝领","技术员","服务员","经营者","退役人员","企业家","个体经营者","女佣","失业人员","学生"]
  7. from matplotlib import font_manager
  8. my_font = font_manager.FontProperties(fname='C:\Windows\Fonts\STHUPO.TTF',size=20)
  9. # plt.xticks(range(len(x_1)),x_1,fontproperties = my_font)
  10. plt.xticks(range(len(x_1)),x_1,fontsize=20,rotation=45)
  11. plt.yticks(fontsize=15)
  12. my= font_manager.FontProperties(size=20)
  13. plt.xlabel("“客户身份”",fontproperties = my)
  14. plt.ylabel("产品购买数量指数",fontproperties = my)
  15. plt.title("客户购买银行产品意向图",fontdict={"size": 25})
  16. plt.tight_layout()
  17. plt.show()

  1. import seaborn as sns
  2. object_columns = ['job', 'marital', 'education', 'default', 'housing','loan', 'contact','month','day_of_week','poutcome']
  3. #连续变量列名
  4. num_columns = ['age', 'duration', 'campaign', 'pdays','previous', "cons_conf_index",'emp_var_rate',"cons_price_index","lending_rate3m","nr_employed"]
  5. # # 统计图
  6. # #统计图
  7. plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
  8. plt.figure(figsize = [10,10])#画板大小
  9. sns.barplot(x = "marital", y ="subscribe" , data = trian)
  10. x_1=["结婚"," 已婚","单身"]
  11. from matplotlib import font_manager
  12. my_font = font_manager.FontProperties(fname='C:\Windows\Fonts\STHUPO.TTF',size=20)
  13. # plt.xticks(range(len(x_1)),x_1,fontproperties = my_font)
  14. plt.xticks(range(len(x_1)),x_1,fontsize=20)
  15. plt.yticks(fontsize=15)
  16. my= font_manager.FontProperties(size=20)
  17. plt.xlabel("客户婚姻状态",fontproperties = my)
  18. plt.ylabel("产品购买数量指数",fontproperties = my)
  19. plt.title("不同婚姻状态的客户购买银行产品意向图",fontdict={"size": 25})
  20. plt.tight_layout()
  21. plt.show()

  1. # #统计图
  2. plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
  3. plt.figure(figsize = [10,8])#画板大小
  4. sns.barplot(x = "education", y ="subscribe" , data = trian)
  5. x_1=["大学学历"," 高中","基本9年","教授","基本4年","基本6年","文盲"]
  6. from matplotlib import font_manager
  7. my_font = font_manager.FontProperties(fname='C:\Windows\Fonts\STHUPO.TTF',size=20)
  8. # plt.xticks(range(len(x_1)),x_1,fontproperties = my_font)
  9. plt.xticks(range(len(x_1)),x_1,fontsize=25)
  10. plt.yticks(fontsize=15)
  11. my= font_manager.FontProperties(size=20)
  12. plt.xlabel("客户教育程度",fontproperties = my)
  13. plt.ylabel("产品购买数量指数",fontproperties = my)
  14. plt.title("不同教育程度的客户购买银行产品意向图",fontdict={"size": 25})
  15. plt.tight_layout()
  16. plt.show()
  17. print(trian["education"].value_counts())

  1. object_columns = ['job', 'marital', 'education', 'default', 'housing','loan', 'contact','month','day_of_week','poutcome']
  2. # #统计图
  3. plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
  4. plt.figure(figsize = [8,8])#画板大小
  5. sns.barplot(x = "month", y ="subscribe" , data = trian)
  6. # x_1=["大学学历"," 高中","基本9年","教授","基本4年","基本6年","文盲"]
  7. from matplotlib import font_manager
  8. # my_font = font_manager.FontProperties(fname='C:\Windows\Fonts\STHUPO.TTF',size=20)
  9. # plt.xticks(range(len(x_1)),x_1,fontproperties = my_font)
  10. plt.xticks(fontsize=25)
  11. plt.yticks(fontsize=15)
  12. my= font_manager.FontProperties(size=20)
  13. plt.xlabel("月份",fontproperties = my)
  14. plt.ylabel("产品购买数量指数",fontproperties = my)
  15. plt.title("不同月份最后联系客户购买银行产品意向图",fontdict={"size": 25})
  16. plt.tight_layout()
  17. plt.show()

 其余字段大同小异

下面结合结婚状态字段对产品进行分析:

  1. # print(trian["marital"].value_counts())
  2. marital_colum=["married" ,"single" ,"divorced"]
  3. # # 选取某列含有特定“marital”的行
  4. trian1 = trian[trian['marital'].isin([marital_colum[0]])]
  5. trian1.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
  6. print(trian1["marital"].value_counts())
  7. plt.figure(figsize=[10, 10])
  8. sns.barplot(x="default", y="subscribe", hue="education", data=trian1, palette="muted")
  9. x_1=["yes"," no"]
  10. from matplotlib import font_manager
  11. plt.xticks(range(len(x_1)),x_1,fontsize=25)
  12. plt.yticks(fontsize=15)
  13. my= font_manager.FontProperties(size=20)
  14. plt.xlabel("有无违约记录",fontproperties = my)
  15. plt.ylabel("产品购买数量指数",fontproperties = my)
  16. plt.title("已婚",fontdict={"size": 25})
  17. plt.legend(prop = {'size':18})
  18. plt.tight_layout()
  19. plt.show()

  1. # print(trian["marital"].value_counts())
  2. marital_colum=["married" ,"single" ,"divorced"]
  3. # # 选取某列含有特定“marital”的行
  4. trian1 = trian[trian['marital'].isin([marital_colum[1]])]
  5. trian1.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
  6. print(trian1["marital"].value_counts())
  7. plt.figure(figsize=[10, 10])
  8. sns.barplot(x="default", y="subscribe", hue="education", data=trian1, palette="muted")
  9. x_1=["no"," yes"]
  10. from matplotlib import font_manager
  11. plt.xticks(range(len(x_1)),x_1,fontsize=25)
  12. plt.yticks(fontsize=15)
  13. my= font_manager.FontProperties(size=20)
  14. plt.xlabel("有无违约记录",fontproperties = my)
  15. plt.ylabel("产品购买数量指数",fontproperties = my)
  16. plt.title("单身",fontdict={"size": 25})
  17. plt.legend(prop = {'size':18})
  18. plt.tight_layout()
  19. plt.show()

2.2离散数值字段

  1. plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']
  2. z=0
  3. while(z<=9):
  4. trian1 = trian.loc[:,[num_columns[z],'subscribe']]
  5. # ax = plt.subplot(3, 3, z + 1)
  6. f = pd.melt(trian1, value_vars=num_columns[z], id_vars='subscribe')
  7. g = sns.FacetGrid(f,col='variable', hue='subscribe')
  8. z = z + 1
  9. g = g.map(sns.distplot,"value",bins=20)
  10. plt.show()

       

三、数据建模

  1. from lightgbm.sklearn import LGBMClassifier
  2. from sklearn.model_selection import train_test_split
  3. from sklearn.model_selection import KFold
  4. from sklearn.metrics import accuracy_score, auc, roc_auc_score
  5. X = df.drop(columns=['id', 'subscribe'])
  6. Y = df['subscribe']
  7. testA = test.drop(columns='id')
  8. # 划分训练及测试集
  9. x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
  10. from xgboost import XGBClassifier
  11. import xgboost as xgb
  12. from sklearn.metrics import precision_score, recall_score, f1_score
  13. model = xgb.XGBClassifier()
  14. # 交叉验证
  15. result1 = []
  16. mean_score1 = 0
  17. n_folds = 10
  18. import time
  19. start =time.time()
  20. kf = KFold(n_splits=n_folds, shuffle=True, random_state=2022)
  21. for train_index, test_index in kf.split(X):
  22. x_train = X.iloc[train_index]
  23. y_train = Y.iloc[train_index]
  24. x_test = X.iloc[test_index]
  25. y_test = Y.iloc[test_index]
  26. model.fit(x_train, y_train)
  27. y_pred1 = model.predict_proba((x_test))[:, 1]
  28. print('验证集AUC:{}'.format(roc_auc_score(y_test, y_pred1)))
  29. mean_score1 += roc_auc_score(y_test, y_pred1) / n_folds
  30. y_pred_final1 = model.predict_proba((testA))[:, 1]
  31. y_pred_test1 = y_pred_final1
  32. result1.append(y_pred_test1)
  33. end =time.time()
  34. print('程序运行时间为: %s Seconds'%(end-start))

使用验证集AUC模型评估模型:

 ROC曲线:

 

AUC值: 

四、评估指标:

4.1:混淆矩阵

4.2: 准确率,回归率,F1

五、测试集准确率

输出文件为:

  1. cat_pre1 = sum(result1) / n_folds
  2. ret1 = pd.DataFrame(cat_pre1, columns=['subscribe'])
  3. ret1['subscribe'] = np.where(ret1['subscribe'] > 0.5, 'yes', 'no').astype('str')
  4. ret1.to_csv('./XGB预测.csv', index=False)

最终提交结果为:

六、模型优化

 本文章未调参,如果进行网格优化调参可以让模型进一步变好

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小桥流水78/article/detail/823921
推荐阅读
相关标签
  

闽ICP备14008679号