赞
踩
目录
方法二:利用SMOTE+undersampling解决数据不平衡
一批交易数据,数据总量28万,其中正常交易数据量占比99.83%,欺诈交易数据量仅占比0.17%。
训练出一个模型,能判断出交易数据是正常数据还是欺诈数据
- import pandas as pd
- import numpy as np
- import seaborn as sns
- import matplotlib.pyplot as plt
- import time
- import warnings
- warnings.filterwarnings('ignore')
-
- # 加载数据文件
- df = pd.read_csv("creditcard.csv")
-
- # 查看数据内容
- df.head()
上图可以看到,数据除了Amount外,都经过了标准化处理
- # 检查是否有空置
- df.isnull().sum()
没有缺失值的情况
- # 查看数据类型
- df.dtypes
- # 查看Class分布
- labels = ['Not Fraud', 'Fraud'] # 标签
- size = df['Class'].value_counts() # 统计class的类别数量
- colors = ['lightgreen', 'orange'] # 颜色
- explode = [0, 0.1] # 饼图突出
- plt.figure(figsize=(9,9)) # 画布大小
- plt.pie(size, colors=colors, explode=explode, labels=labels, shadow=True, autopct='%.2f%%') # 饼图参数设置
- plt.axis('off') # 关闭坐标轴
- plt.title("Data Distribution") # 标题
- plt.legend() # 显示标签
- plt.show() # 显示
这里可以看到欺诈数据占比非常小,如果使用下采样的方法训练模型,有非常大的概率出现过拟合的现象,导致模型泛化能力差,这时候第一个方案可以停下。
如果不处理不平衡数据,用这些数据来训练,模型往往只会学到分辨样本多的类别,而少数类别因为学习的数据量过少,无法准确预测。
- import pandas as pd
- import numpy as np
- import seaborn as sns
- import matplotlib.pyplot as plt
- import time
- import warnings
- warnings.filterwarnings('ignore')
-
- # 加载数据文件
- df = pd.read_csv("creditcard.csv")
-
- # 查看数据内容
- df.head()
上图可以看到,数据除了Amount外,都经过了标准化处理
对Amount进行标准化处理
- from sklearn.preprocessing import StandardScaler
-
- df['scaled_amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1,1)) # 金额归一化
- df['scaled_time'] = StandardScaler().fit_transform(df['Time'].values.reshape(-1,1)) # 时间归一化
-
- df.drop(['Amount', 'Time'], axis=1, inplace=True) # 删除原始的数据列
- # 检查是否有空置
- df.isnull().sum()
没有缺失值的情况
- # 查看数据类型
- df.dtypes
这里我们提出一个想法,不管是对数据进行过采样还是欠采样,都是去平衡两类数据,那么我们可以尝试在用单纯的smote的同时,再创建一组数据集,对少数样本过采样,多数样本下采样,分别训练,查看效果。
- from imblearn.over_sampling import SMOTE
-
- X_new_1, y_new_1 = SMOTE().fit_resample(X, y)
-
- # 新的类分布
- y_new_1.value_counts()
- from imblearn.over_sampling import SMOTE
- from imblearn.under_sampling import RandomUnderSampler
- from imblearn.pipeline import Pipeline # 管道,把要用的工具封装在一起
-
- # over
- over = SMOTE(sampling_strategy=0.1)
-
- # under
- under = RandomUnderSampler(sampling_strategy=0.5)
-
- # pipeline
- steps = [('o', over),('u', under)]
-
- pipeline = Pipeline(steps=steps)
-
- X_new_2, y_new_2 = pipeline.fit_resample(X, y)
-
- # 新的类分布
- y_new_2.value_counts()

- from sklearn.model_selection import train_test_split
-
- X_new_1_train, X_new_1_test, y_new_1_train, y_new_1_test = train_test_split(X_new_1, y_new_1)
-
- # 数据集
- X_new_1_train = X_new_1_train.values
- X_new_1_test = X_new_1_test.values
-
- # 标签
- y_new_1_train = y_new_1_train.values
- y_new_1_test = y_new_1_test.values
- X_new_2_train, X_new_2_test, y_new_2_train, y_new_2_test = train_test_split(X_new_2, y_new_2)
-
- # 数据集
- X_new_2_train = X_new_2_train.values
- X_new_2_test = X_new_2_test.values
-
- # 标签
- y_new_2_train = y_new_2_train.values
- y_new_2_test = y_new_2_test.values
因为我们处理的是一个二分类问题,这里选择的二分类中表现较好的逻辑回归,还有强大的SVM,以及各种集成算法来训练模型。利用交叉验证和网格搜索,从中选择出最优的模型和最优的参数,
- # 简单分类器实现
- from sklearn.linear_model import LogisticRegression # 逻辑回归
- from sklearn.linear_model import SGDClassifier # 随机梯度
-
- from sklearn.svm import SVC # 支撑向量机
-
- from sklearn.ensemble import RandomForestClassifier # 随机森林
- from sklearn.model_selection import cross_val_score # 交叉验证计算accuracy
- from sklearn.model_selection import GridSearchCV # 网格搜索,获取最优参数
- from sklearn.model_selection import StratifiedKFold # 交叉验证
- from collections import Counter
- # 评估指标
- from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
-
- from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier # 集成学习
- from xgboost import XGBClassifier # 极限梯度提升树
-
- classifiers = {
- 'LogisticRegression':LogisticRegression(), # 逻辑回归
- # "SVC":SVC(), # 支撑向量机
-
-
- 'RFC':RandomForestClassifier(), # 随机森林
- 'Bagging':BaggingClassifier(), # 集成学习bagging
- 'SGD':SGDClassifier(), # 随机梯度
- 'GBC':GradientBoostingClassifier(), # 集成学习Gradient
- 'xgb':XGBClassifier() # 极限梯度提升树
- }

- def accuracy_score(X_train, y_train):
- for key, classifier in classifiers.items(): # 遍历每一个分类器,分别训练、计算得分
- classifier.fit(X_train, y_train)
- training_score = cross_val_score(classifier, X_train, y_train, cv=5) # 5折交叉验证
- print("Classifier Name : ", classifier.__class__.__name__," Training Score :", round(training_score.mean(), 2)*100,'%')
结果1
- # 1.1 SMOTE
- accuracy_score(X_new_1_train, y_new_1_train)
这里SVM算法也会消耗大量的时间,这里我把它pass掉了
结果2
- # 1.2 SMOTE + under sampling
- accuracy_score(X_new_2_train, y_new_2_train)
这里两种方法的结果都还不错,继续往下调优,然后我们可以看到的是,一些集成算法出现过拟合的现象,所以我们调优方向可以往降低模型复杂度这边走
- # 网格搜索:获取最优超参数
- # 1 LR
- def LR_gs(X_train, y_train):
- # LR
- LR_param = {
- 'penalty':['l1', 'l2'],
- 'C':[0.001, 0.01, 0.1, 1, 10]
- }
-
- LR_gs = GridSearchCV(LogisticRegression(),param_grid=LR_param, n_jobs=-1, scoring='accuracy')
- LR_gs.fit(X_train, y_train)
-
- LR_estimators = LR_gs.best_estimator_ # 最优参数
-
- return LR_estimators
-
-
- # 2 RFC
- def RFC_gs(X_train, y_train):
- RFC_param = {
- 'n_estimators':[100, 150, 200], # 多少棵树
- 'criterion':['gini', 'entropy'], # 衡量标准
- 'max_depth':list(range(2,5,1)), # 树的深度
- }
-
- RFC_gs = GridSearchCV(RandomForestClassifier(), param_grid=RFC_param, n_jobs=-1, scoring='accuracy')
- RFC_gs.fit(X_train, y_train)
-
- RFC_estimators = RFC_gs.best_estimator_
-
- return RFC_estimators
-
- # 3 Bag
- def BAG_gs(X_train, y_train):
- BAG_param = {
- 'n_estimators':[10, 15, 20]
- }
-
- BAG_gs = GridSearchCV(BaggingClassifier(), param_grid=BAG_param, n_jobs=-1, scoring='accuracy')
- BAG_gs.fit(X_train, y_train)
-
- BAG_estimators = BAG_gs.best_estimator_
-
- return BAG_estimators
-
-
- # 4 SGD
- def SGD_gs(X_train, y_train):
- SGD_param = {
- 'penalty':['l2','l1'],
- 'max_iter':[1000, 1500, 2000]
- }
-
- SGD_gs = GridSearchCV(SGDClassifier(), param_grid=SGD_param, n_jobs=-1, scoring='accuracy')
- SGD_gs.fit(X_train, y_train)
-
- SGD_estimators = SGD_gs.best_estimator_
-
- return SGD_estimators
-
-
- # 5 xgb
- def XGB_gs(X_train, y_train):
- XGB_param = {
- 'n_estimators':[60,80,100,200],
- 'max_depth':[3,4,5,6],
- 'learning_rate':[0.1,0.2,0.3,0.4]
- }
-
- XGB_gs = GridSearchCV(XGBClassifier(), param_grid=XGB_param, n_jobs=-1, scoring='accuracy')
- XGB_gs.fit(X_train, y_train)
-
- XGB_estimators = XGB_gs.best_estimator_
-
- return XGB_estimators
-
-
-
-

调用上面的函数:
- # 采用新的数据集:X_new_1_train, y_new_1_train
-
- # 模型交叉验证、训练,获取最优超参数
-
- LR_best_estimator = LR_gs(X_new_1_train, y_new_1_train)
-
- # KNN_best_estimator = KNN_gs(X_new_1_train, y_new_1_train)
-
- # SVC_best_estimator = SVC_gs(X_new_1_train, y_new_1_train)
-
- # DT_best_estimator = DT_gs(X_new_1_train, y_new_1_train)
-
- RFC_best_estimator = RFC_gs(X_new_1_train, y_new_1_train)
-
- BAG_best_estimator = BAG_gs(X_new_1_train, y_new_1_train)
-
- SGD_best_estimator = SGD_gs(X_new_1_train, y_new_1_train)
-
- XGB_best_estimator = XGB_gs(X_new_1_train, y_new_1_train)

注:这一步在我电脑计算了三个多小时
- # 采用新的数据集:X_new_2_train, y_new_2_train
-
- # 模型交叉验证、训练,获取最优超参数
-
- LR_best_estimator = LR_gs(X_new_2_train, y_new_2_train)
-
- # KNN_best_estimator = KNN_gs(X_new_2_train, y_new_2_train)
-
- # SVC_best_estimator = SVC_gs(X_new_2_train, y_new_2_train)
-
- # DT_best_estimator = DT_gs(X_new_2_train, y_new_2_train)
-
- RFC_best_estimator = RFC_gs(X_new_2_train, y_new_2_train)
-
- BAG_best_estimator = BAG_gs(X_new_2_train, y_new_2_train)
-
- SGD_best_estimator = SGD_gs(X_new_2_train, y_new_2_train)
-
- XGB_best_estimator = XGB_gs(X_new_2_train, y_new_2_train)

等以上步骤都计算完成,我们就得到了网格搜索找出的最优超参数,利用得到的最优超参数,去计算准确率,精准率,召回率,f1_score,auc面积。
(这个模型我们主要是为了检测异常数据,也就是找到少数类,所以我们更看重数据查的全不全,也就是召回率和AUC指标)
- # 预测新的数据集:X_new_test, y_new_test
- from sklearn.metrics import precision_recall_fscore_support
- from sklearn.metrics import accuracy_score
-
- result_df = pd.DataFrame(columns=['Accuracy', 'F1-score', 'Recall', 'Precision', 'AUC_ROC'],
- index=['LR','RFC','Bagging','SGD','XGB'])
-
- def caculate(models, X_test, y_test):
- # 计算各种参数的值
- accuracy_results = []
- F1_score_results = []
- Recall_results = []
- Precision_results = []
- AUC_ROC_results = []
-
- for model in models:
- y_pred = model.predict(X_test)
- accuracy = accuracy_score(y_test, y_pred) # 计算准确度
- precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred) # 计算:精确度,召回率,f1_score
- AUC_ROC = roc_auc_score(y_test, y_pred) # 计算ROC, AUC
-
- # 保存计算值
- accuracy_results.append(accuracy)
- F1_score_results.append(f1_score)
- Recall_results.append(recall)
- AUC_ROC_results.append(AUC_ROC)
- Precision_results.append(precision)
-
- return accuracy_results, F1_score_results, Recall_results, AUC_ROC_results, Precision_results

带入数据1
- # 将所有最优超参数的模型放在一起
- best_models = [LR_best_estimator, RFC_best_estimator,
- BAG_best_estimator, SGD_best_estimator, XGB_best_estimator]
-
- # 调用函数计算各项指标值
- accuracy_results, F1_score_results, Recall_results, AUC_ROC_results, Precision_results = caculate(best_models, X_new_1_test, y_new_1_test)
-
- # 将各项值放入到DataFrame中
- result_df['Accuracy'] = accuracy_results
- result_df['F1-score'] = F1_score_results
- result_df['Recall'] = Recall_results
- result_df['Precision'] = Precision_results
- result_df['AUC_ROC'] = AUC_ROC_results
-
- result_df # 显示计算结果
可视化:
- # 可视化 AUC的评分
- g = sns.barplot('AUC_ROC', result_df.index, data=result_df, palette='hsv', orient='h')
带入第二组数据:
- result_df_2 = pd.DataFrame(columns=['Accuracy', 'F1-score', 'Recall', 'Precision', 'AUC_ROC'],
- index=['LR','RFC','Bagging','SGD','XGB'])
-
-
-
- # 将所有最优超参数的模型放在一起
- best_models = [LR_best_estimator, RFC_best_estimator,
- BAG_best_estimator, SGD_best_estimator, XGB_best_estimator]
-
- # 调用函数计算各项指标值
- accuracy_results, F1_score_results, Recall_results, AUC_ROC_results, Precision_results = caculate(best_models, X_new_2_test, y_new_2_test)
-
- # 将各项值放入到DataFrame中
- result_df_2['Accuracy'] = accuracy_results
- result_df_2['F1-score'] = F1_score_results
- result_df_2['Recall'] = Recall_results
- result_df_2['Precision'] = Precision_results
- result_df_2['AUC_ROC'] = AUC_ROC_results
-
- result_df_2

可视化:
- # 可视化 AUC的评分
- g = sns.barplot('AUC_ROC', result_df_2.index, data=result_df_2, palette='hsv', orient='h')
看到auc面积这里,第二组数据的结果会略微高于第一组数据,这时候我们后续可以只使用第二组数据。
到这里我们可以看到,Bagging和XGBoost都有较好的表现效果,我们选出这两个算法,对它们进行一个集成,集成为一个算法,看看能不能得到更优的效果。
- # 集成学习
-
- # 根据以上AUC的结果,选择: LR 和 SVC 和 XGB 当做基模型
-
- # KNN_test = pd.Series(KNN_best_estimator.predict(X_new_1_test), name = 'KNN')
-
- Bagging_test = pd.Series(BAG_best_estimator.predict(X_new_1_test),name = 'Bagging')
-
- XGB_test = pd.Series(XGB_best_estimator.predict(X_new_1_test), name='XGB')
-
- # 把以上3个模型的预测结果集成起来
- ensemble_results = pd.concat([Bagging_test, XGB_test], axis=1)
-
- ensemble_results
集成,训练:
- # 将上述3个模型集成起来,当做一个模型
- from sklearn.ensemble import VotingClassifier
-
- voting_clf = VotingClassifier(estimators=[('BAG', BAG_best_estimator),
- ('XGB', XGB_best_estimator)], n_jobs=-1)
-
-
- # 训练
- voting_clf.fit(X_new_1_train, y_new_1_train)
-
-
- # 预测
- y_final_pred = voting_clf.predict(X_new_1_test)
-
-
- # 评估结果 : 最终集成学习预测的结果明显高于之前各个模型单独预测的结果
-
- print(classification_report(y_new_1_test, y_final_pred))

可以看到我们最终集成的模型预测的结果明显高于单独预测的结果,然后我们把这个超级模型保存起来:
- import pickle
-
- pickle.dump(voting_clf,open('./bag_xgb.dat','wb'))
后续也可以绘制一下ROC曲线,这里我觉得没必要了。over!
需要源码的话可以点个赞私信我
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。