赞
踩
本文是对天池教学赛,银行客户认购产品预测的记录,教学赛网址如下:
【教学赛】金融数据分析赛题1:银行客户认购产品预测_学习赛_天池大赛-阿里云天池
- import pandas as pd
-
- # 加载数据
- train = pd.read_csv('train.csv')
- test = pd.read_csv('test.csv')
- # 训练集和测试集合并, 以便于处理特征的数据
- df = pd.concat([train, test], axis=0) #将训练数据和测试数据在行的方向拼接
- df
得到的结果:
- id age job marital education default housing loan contact month ... campaign pdays previous poutcome emp_var_rate cons_price_index cons_conf_index lending_rate3m nr_employed subscribe
- 0 1 51 admin. divorced professional.course no yes yes cellular aug ... 1 112 2 failure 1.4 90.81 -35.53 0.69 5219.74 no
- 1 2 50 services married high.school unknown yes no cellular may ... 1 412 2 nonexistent -1.8 96.33 -40.58 4.05 4974.79 yes
- 2 3 48 blue-collar divorced basic.9y no no no cellular apr ... 0 1027 1 failure -1.8 96.33 -44.74 1.50 5022.61 no
- 3 4 26 entrepreneur single high.school yes yes yes cellular aug ... 26 998 0 nonexistent 1.4 97.08 -35.55 5.11 5222.87 yes
- 4 5 45 admin. single university.degree no no no cellular nov ... 1 240 4 success -3.4 89.82 -33.83 1.17 4884.70 no
- ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
- 7495 29996 49 admin. unknown university.degree unknown yes yes telephone apr ... 50 302 1 failure -1.8 95.77 -40.50 3.86 5058.64 NaN
- 7496 29997 34 blue-collar married basic.4y no no no cellular jul ... 8 440 3 failure 1.4 90.59 -47.29 1.77 5156.70 NaN
- 7497 29998 50 retired single basic.4y no yes no cellular jun ... 3 997 0 nonexistent -2.9 97.42 -39.69 1.29 5116.80 NaN
- 7498 29999 31 technician married professional.course no no no cellular aug ... 3 1028 0 nonexistent 1.4 96.90 -37.68 5.18 5144.45 NaN
- 7499 30000 46 admin. divorced university.degree no yes no cellular aug ... 2 387 3 success 1.4 97.49 -31.54 3.79 5082.25 NaN
- 30000 rows × 22 columns
可见数据既有数字也有文字,需要将文字转换为数字
- # 首先选出所有的特征为object(非数字)的特征
- cat_columns = df.select_dtypes(include='object').columns #选择非数字的列,对其进行处理
- df[cat_columns]
- # 对非数字特征进行编码
- from sklearn.preprocessing import LabelEncoder
-
- job_le = LabelEncoder()
- df['job'] = job_le.fit_transform(df['job'])
- df['marital'] = df['marital'].map({'unknown':0, 'single':1, 'married':2, 'divorced':3})
- df['education'] = df['education'].map({'unknown':0, 'basic.4y':1, 'basic.6y':2, 'basic.9y':3, 'high.school':4, 'university.degree':5, 'professional.course':6, 'illiterate':7})
- df['housing'] = df['housing'].map({'unknown': 0, 'no': 1, 'yes': 2})
- df['loan'] = df['loan'].map({'unknown': 0, 'no': 1, 'yes': 2})
- df['contact'] = df['contact'].map({'cellular': 0, 'telephone': 1})
- df['day_of_week'] = df['day_of_week'].map({'mon': 0, 'tue': 1, 'wed': 2, 'thu': 3, 'fri': 4})
- df['poutcome'] = df['poutcome'].map({'nonexistent': 0, 'failure': 1, 'success': 2})
- df['default'] = df['default'].map({'unknown': 0, 'no': 1, 'yes': 2})
- df['month'] = df['month'].map({'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, \
- 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12})
- df['subscribe'] = df['subscribe'].map({'no': 0, 'yes': 1})
- # 将数据集重新划分为训练集和测试集 通过subscribe是不是空来判断
- train = df[df['subscribe'].notnull()]
- test = df[df['subscribe'].isnull()]
-
- # 查看训练集中,标签为0和1的比例,可以看出0和1不均衡,0是1的6.6倍
- train['subscribe'].value_counts()
得到
- 0.0 19548
- 1.0 2952
- Name: subscribe, dtype: int64
- import numpy as np
- import matplotlib.pyplot as plt
- import seaborn as sns
- import warnings
-
- warnings.filterwarnings('ignore')
- %matplotlib inline
- num_features = [x for x in train.columns if x not in cat_columns and x!='id']
-
- fig = plt.figure(figsize=(80,60))
-
- for i in range(len(num_features)):
- plt.subplot(7,2,i+1)
- sns.boxplot(train[num_features[i]])
- plt.ylabel(num_features[i], fontsize=36)
- plt.show()
存在离群点,对离群点进行处理
- for colum in num_features:
- temp = train[colum]
- q1 = temp.quantile(0.25)
- q2 = temp.quantile(0.75)
- delta = (q2-q1) * 10
- train[colum] = np.clip(temp, q1-delta, q2+delta)
- ## 将超过10倍的值,进行处理
进行数据均衡和特征选择,但是做完处理后都导致了分类效果变差,此处省略。但是把原码贴出来,供参考。
- '''# 采用SMOTE进行过采样,虽然训练的效果好了,但是对于最终的分类效果反而降低了,此处先不采用过采样
- from imblearn.over_sampling import SMOTE
- from imblearn.over_sampling import ADASYN
- #smo = SMOTE(random_state=0, k_neighbors=10)
- adasyn = ADASYN()
- X_smo, y_smo = adasyn.fit_resample(train.iloc[:,:-1], train.iloc[:,-1])
- train_smo = pd.concat([X_smo, y_smo], axis=1)
- train_smo['subscribe'].value_counts()'''
- '''# 特征选择方法采用SelectFromModel,Model选择树模型
- from sklearn.ensemble import ExtraTreesClassifier
- from sklearn.feature_selection import SelectFromModel
- # 提取出训练数据和标签
- train_X = train.iloc[:,:-1]
- train_y = train.iloc[:,-1]
- # clf_ect是模型名,FeaSel为特征选择模型
- clf_etc = ExtraTreesClassifier(n_estimators=50)
- clf_etc = clf_etc.fit(train_X, train_y)
- FeaSel = SelectFromModel(clf_etc, prefit=True)
- train_sel = FeaSel.transform(train_X)
- test_sel = FeaSel.transform(test.iloc[:,:-1])
- # 提取特征名,并把特征名写回原始数据
- train_new = pd.DataFrame(train_sel)
- feature_idx = FeaSel.get_support() #提取选择的列名
- train_new.columns = train_X.columns[feature_idx] #将列名写回选择后的数据
- train_new = pd.concat([train_new, train_y],axis=1)
- test_new = pd.DataFrame(test_sel)
- test_new.columns = train_X.columns[feature_idx]'''
此部门内容可能存在变量命名方面的问题。
- train_new = train
- test_new = test
-
- # 将处理完的数据写回到train_new和test_new进行保存
- train_new.to_csv('train_new.csv', index=False)
- test_new.to_csv('test_new.csv', index=False)
- from sklearn.model_selection import GridSearchCV
- from sklearn.linear_model import LogisticRegression
- from sklearn.tree import DecisionTreeClassifier
- from sklearn.ensemble import RandomForestClassifier
- from sklearn.ensemble import GradientBoostingClassifier
- from sklearn.ensemble import AdaBoostClassifier
- from xgboost import XGBRFClassifier
- from lightgbm import LGBMClassifier
- from sklearn.model_selection import cross_val_score
- import time
-
- clf_lr = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')
- clf_dt = DecisionTreeClassifier()
- clf_rf = RandomForestClassifier()
- clf_gb = GradientBoostingClassifier()
- clf_adab = AdaBoostClassifier()
- clf_xgbrf = XGBRFClassifier()
- clf_lgb = LGBMClassifier()
-
- from sklearn.model_selection import train_test_split
- train_new = pd.read_csv('train_new.csv')
- test_new = pd.read_csv('test_new.csv')
- feature_columns = [col for col in train_new.columns if col not in ['subscribe']]
- train_data = train_new[feature_columns]
- target_data = train_new['subscribe']
- from lightgbm import LGBMClassifier
- from sklearn.metrics import classification_report
- from sklearn.model_selection import GridSearchCV
- from sklearn.metrics import accuracy_score
- from sklearn.model_selection import train_test_split
-
- X_train, X_test, y_train, y_test = train_test_split(train_data, target_data, test_size=0.2,shuffle=True, random_state=2023)
- #X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5,shuffle=True,random_state=2023)
-
- n_estimators = [300]
- learning_rate = [0.02]#中0.2最优
- subsample = [0.6]
- colsample_bytree = [0.7] ##在[0.5, 0.6, 0.7]中0.6最优
- max_depth = [9, 11, 13] ##在[7, 9, 11, 13]中11最优
- is_unbalance = [False]
- early_stopping_rounds = [300]
- num_boost_round = [5000]
- metric = ['binary_logloss']
- feature_fraction = [0.6, 0.75, 0.9]
- bagging_fraction = [0.6, 0.75, 0.9]
- bagging_freq = [2, 4, 5, 8]
- lambda_l1 = [0, 0.1, 0.4, 0.5]
- lambda_l2 = [0, 10, 15, 35]
- cat_smooth = [1, 10, 15, 20]
-
-
- param = {'n_estimators':n_estimators,
- 'learning_rate':learning_rate,
- 'subsample':subsample,
- 'colsample_bytree':colsample_bytree,
- 'max_depth':max_depth,
- 'is_unbalance':is_unbalance,
- 'early_stopping_rounds':early_stopping_rounds,
- 'num_boost_round':num_boost_round,
- 'metric':metric,
- 'feature_fraction':feature_fraction,
- 'bagging_fraction':bagging_fraction,
- 'lambda_l1':lambda_l1,
- 'lambda_l2':lambda_l2,
- 'cat_smooth':cat_smooth}
-
- model = LGBMClassifier()
-
- clf = GridSearchCV(model, param, cv=3, scoring='accuracy', verbose=1, n_jobs=-1)
- clf.fit(X_train, y_train, eval_set=[(X_train, y_train),(X_test, y_test)])
-
- print(clf.best_params_, clf.best_score_)
里面只有1个值的,是已经通过GridSearchCV找到的最优优值了,程序显示的是最后的6个参数的寻优,都放到一起训练时间太长了,所以选择分开寻找。
得到的结果:
- Early stopping, best iteration is:
- [287] training's binary_logloss: 0.22302 valid_1's binary_logloss: 0.253303
- {'bagging_fraction': 0.6, 'cat_smooth': 1, 'colsample_bytree': 0.7, 'early_stopping_rounds': 300, 'feature_fraction': 0.75, 'is_unbalance': False, 'lambda_l1': 0.4, 'lambda_l2': 10, 'learning_rate': 0.02, 'max_depth': 11, 'metric': 'binary_logloss', 'n_estimators': 300, 'num_boost_round': 5000, 'subsample': 0.6} 0.8853333333333334
- y_true, y_pred = y_test, clf.predict(X_test)
- accuracy = accuracy_score(y_true,y_pred)
- print(classification_report(y_true, y_pred))
- print('Accuracy',accuracy)
结果
- precision recall f1-score support
-
- 0.0 0.91 0.97 0.94 3933
- 1.0 0.60 0.32 0.42 567
-
- accuracy 0.89 4500
- macro avg 0.75 0.64 0.68 4500
- weighted avg 0.87 0.89 0.87 4500
-
- Accuracy 0.8875555555555555
查看混淆矩阵
- from sklearn import metrics
- confusion_matrix_result = metrics.confusion_matrix(y_true, y_pred)
- plt.figure(figsize=(8,6))
- sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
- plt.xlabel('predict')
- plt.ylabel('true')
- plt.show()
- test_x = test[feature_columns]
- pred_test = clf.predict(test_x)
- result = pd.read_csv('./submission.csv')
- subscribe_map ={1: 'yes', 0: 'no'}
- result['subscribe'] = [subscribe_map[x] for x in pred_test]
- result.to_csv('./baseline_lgb1.csv', index=False)
- result['subscribe'].value_counts()
结果:
- no 6987
- yes 513
- Name: subscribe, dtype: int64
本人的方法只获得了0.9676的结果,希望您能在本人的程序基础上进行改进,以得到更佳的效果。如果有了更好的方法,欢迎在留言区告诉我,相互讨论。
改进的思路:
1. 数据处理方面,本人在进行数据均衡时,训练的效果很好,但是最终的效果较差,应该是数据过拟合了;另外在数据的离群点处理方面也可以做更进一步的考虑;
2.方法的改进,本人对比了lr, dt, rf, gb, adab, xgbrf, lgb最终lgb的效果最好,所以最终选择lgb进行调参,可以考虑采用多种方法的组合,进行训练;
3.在lgb的基础上进行调参,这个是最没有科技含量的。不过花时间应该会得到比我的结果更好的效果。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。