当前位置:   article > 正文

糖尿病遗传风险检测挑战赛竞赛学习_cv5折训练

cv5折训练

目录

一:报名比赛

二:比赛数据分析

三:逻辑回归尝试

五:特征筛选

六:高阶树模型

七:多折训练与集成

!!!Staking!!!


任务一:报名比赛

糖尿病遗传风险检测挑战赛是讯飞开放平台上的一个算法挑战大赛,数据集不算很大,特征数不是很复杂,非常适合分类算法学习入门。本次通过Coggle参加了这一竞赛学习,希望可以进一步提升数据挖掘和算法能力。

下载比赛数据需要实名,导入依赖包,打开数据集可以看到数据信息如下:

  1. import pandas as pd
  2. import lightgbm as lgb
  3. import numpy as np
  4. import matplotlib.pyplot as plt
  5. import seaborn as sns
  6. from hyperopt import hp, fmin, tpe
  7. sns.set_style({'font.sans-serif':['SimHei']})
  8. plt.rcParams ['font.sans-serif'] = ['SimHei']
  9. plt.rcParams ['axes.unicode_minus']
  10. train_df=pd.read_csv('比赛训练集.csv',encoding='gbk')
  11. test_df=pd.read_csv('比赛测试集.csv',encoding='gbk')
  12. print('训练集的数据大小:',train_df.shape)
  13. print('测试集的数据大小:',test_df.shape)
  14. train_df.head()

任务二:比赛数据分析

 1.查看字段类型

  1. print(train_df.dtypes)
  2. print(test_df.dtypes)

根据数据表和数据类型可以初步判断,类型为’float64‘的特征体重指数、舒张压、口服耐糖量测试、胰岛素实验和肱三头肌皮褶厚度几项为数值型,并且大概率为连续数值型,而性别、糖尿病家族史为类别型。

2.统计缺失值

  1. print(train_df.isnull().mean())
  2. print(test_df.isnull().mean())

 

train_df.corr()

 可以看出,训练集和测试集都只有’舒张压‘字段存在缺失值,缺失率只有不到5%,不是很高,不需要剔除。

3.相关性分析

3.1 查看总体特征相关性,可以通过热力图更清楚地看到相关性的强弱程度

  1. train_df.corr()
  2. sns.heatmap(train_df.corr())

 

 3.2 查看特定特征与标签的相关性,countplot图可以看到单特征与标签的相关性,boxplot\violinplot可以看到双特征与标签的相关性,这种可视化的方式可以启发特征工程的思路。

  1. sns.countplot(x='患有糖尿病标识', hue='性别', data=train_df)
  2. sns.boxplot(y='出生年份', x='患有糖尿病标识', hue='性别', data=train_df)
  3. sns.violinplot(y='体重指数', x='患有糖尿病标识', hue='性别', data=train_df)

 

这里的图看起来好高级,感叹下,尤其是这个violin的图,sns666啊~~~

 任务三:逻辑回归尝试

首先进行简单的特征处理,然后尝试跑下逻辑回归

  1. dict_糖尿病家族史 = {
  2. '无记录': 0,
  3. '叔叔或姑姑有一方患有糖尿病': 1,
  4. '叔叔或者姑姑有一方患有糖尿病': 1,
  5. '父母有一方患有糖尿病': 2
  6. }
  7. train_df['糖尿病家族史'] = train_df['糖尿病家族史'].map(dict_糖尿病家族史)
  8. test_df['糖尿病家族史'] = test_df['糖尿病家族史'].map(dict_糖尿病家族史)
  9. train_df['舒张压'].fillna(0, inplace=True)
  10. test_df['舒张压'].fillna(0, inplace=True)
  11. fea = list(train_df.columns)
  12. fea.remove('患有糖尿病标识')
  13. label = '患有糖尿病标识'
  14. # 导入sklearn中的逻辑回归
  15. from sklearn.linear_model import LogisticRegression
  16. model = LogisticRegression()
  17. model.fit(train_df[fea],train_df[label])
  18. # 使用训练集和逻辑回归进行训练,并在测试集上进行预测
  19. test_df['label'] = model.predict(test_df[fea])
  20. test_df.rename({'编号':'uuid'},axis=1)[['uuid','label']].to_csv('submit.csv',index=None)

将预测的结果文件提交到比赛,初步得分:0.72468 

从训练集中拿出20%划分为测试集,训练调参

  1. # 划分数据集
  2. from sklearn.model_selection import train_test_split
  3. train_x,test_x,train_y,test_y = train_test_split(train_df[fea],train_df[label],test_size=0.2,random_state=2022)
  4. train_x.shape,test_x.shape,train_y.shape,test_y.shape
  5. # 训练
  6. from sklearn.linear_model import LogisticRegression
  7. model = LogisticRegression(max_iter=1000,C=0.1)
  8. model.fit(train_x,train_y)
  9. # 测试
  10. from sklearn.metrics import f1_score
  11. pre_test_y = model.predict(test_x)
  12. score = f1_score(test_y,pre_test_y)
  13. print(score)

这里经过反复调试,感觉降低max_iter效果稍微好一点,最后得出的F1score好低,只有0.757

  1. # 统计每个性别对应的【体重指数】、【舒张压】平均值
  2. train_df.groupby('性别')['舒张压'].mean()
  3. train_df.groupby('性别')['体重指数'].mean()
  1. # 计算每个患者与每个性别平均值的差异
  2. train_df['sex_szy'] = abs(train_df['舒张压'] - train_df.groupby('性别').transform('mean')['舒张压'])
  3. train_df['sex_tzzs'] = abs(train_df['体重指数'] - train_df.groupby('性别').transform('mean')['体重指数'])
  1. # 指定特征和标签
  2. fea = list(train_df.columns)
  3. fea.remove('患有糖尿病标识')
  4. label = '患有糖尿病标识'
  5. # 划分数据集
  6. from sklearn.model_selection import train_test_split
  7. train_x,test_x,train_y,test_y = train_test_split(train_df[fea],train_df[label],test_size=0.2,random_state=2022)
  8. train_x.shape,test_x.shape,train_y.shape,test_y.shape
  9. # 逻辑回归训练
  10. from sklearn.linear_model import LogisticRegression
  11. model_lr = LogisticRegression(max_iter=40,C=1)
  12. model_lr.fit(train_x,train_y)
  13. # 逻辑回归测试
  14. from sklearn.metrics import f1_score
  15. test_y_lr = model_lr.predict(test_x)
  16. score = f1_score(test_y,test_y_lr)
  17. print(score)

此时返回结果0.774468085106383,较之前的0.757有所提升,可见男性与女性的舒张压和体重分布不同,同一特征值下对于标签的预测能力也不相同,应该区分来看。但因为两者平均数差异相差并没有非常大,所以是稍有提升。

对其他特征进行探索:

train_df['sex_gstj'] = abs(train_df['肱三头肌皮褶厚度']-train_df.groupby('性别').transform('mean')['肱三头肌皮褶厚度'])

 首先尝试对’肱三头肌皮褶厚度’使用上面的方法,效果不升反而略有下降

  1. train_df['age'] = 2022 - train_df['出生年份']
  2. test_df['age'] = 2022 - test_df['出生年份']

 再将数据中的出生年份换算成年龄,效果有小小小提升,得分0.7751060820367751

任务五:特征筛选

1.决策树训练

  1. # 决策树训练
  2. from sklearn.tree import DecisionTreeClassifier
  3. model_dtc = DecisionTreeClassifier()
  4. model_dtc.fit(train_x,train_y)
  5. # 查看特征重要性
  6. fea_imp = pd.DataFrame([*zip(fea,model_dtc.feature_importances_)])
  7. fea_imp = fea_imp.rename({0:'fea',1:'imp'},axis=1).sort_values(by='imp',ascending=False)
  8. fea_imp

使用决策树模型训练数据,并筛选出feature_importances_排名前五的特征,这里特征前五的重要性都超过了0.07,我们自己构造的特征暂时没有进入前五的。

  1. # 选取重要性排名前五的特征
  2. fea_top5 = list(fea_imp[:5]['fea'])
  3. # top5fea逻辑回归训练
  4. from sklearn.linear_model import LogisticRegression
  5. model_lr = LogisticRegression(max_iter=40,C=1)
  6. model_lr.fit(train_x[fea_top5],train_y)
  7. # top5fea逻辑回归测试
  8. from sklearn.metrics import f1_score
  9. test_y_lr = model_lr.predict(test_x[fea_top5])
  10. score = f1_score(test_y,test_y_lr)
  11. print(score)

此处只用筛选出的排名前5的特征训练模型,得分0.7418899858956276,略低于整体模型的0.775,但差不多,说明Top5特征已经覆盖了绝大多数有效信息

任务六:高阶树模型

  1. # 安装lgb包
  2. # !pip install lightgbm
  3. # 划分数据集
  4. from sklearn.model_selection import train_test_split
  5. train_x,test_x,train_y,test_y = train_test_split(train_df[fea],train_df[label],test_size=0.2,random_state=2022)
  6. train_x.shape,test_x.shape,train_y.shape,test_y.shape
  7. # lightgbm模型训练
  8. import lightgbm as lgb
  9. model_lgb =lgb.LGBMClassifier()
  10. model_lgb.fit(train_x,train_y)
  11. # lightgbm模型测试
  12. test_y_lgb = model_lgb.predict(test_x)
  13. score = f1_score(test_y,test_y_lgb)
  14. print(score)

安装lgb包,重新划分训练集和验证集,使用lgb模型训练并预测,效果提升明显此处得分0.9492847854356306。将2预测的结果文件提交到比赛,得分如下:

 使用GridSearchCV进行搜索调参

  1. # GridSearchCV参数搜索
  2. from sklearn.model_selection import GridSearchCV
  3. params_lgb={'learning_rate':[0.005,0.01,0.0],
  4. 'n_estimators':[100,300,500],
  5. 'max_depth':[9,11,13],
  6. 'num_leaves':[31,35,39]}
  7. best_model=GridSearchCV(model_lgb,param_grid=params_lgb,refit=True,cv=5).fit(train_x,train_y)
  8. print('best parameters:',best_model.best_params_)

搜索得到最佳参数parameters =  {'learning_rate': 0.005, 'max_depth': 13, 'n_estimators': 300},代入模型进行预测,并再次提交结果,比上次又提高了一丢丢

任务七:多折训练与集成

尝试使用KFold完成数据划分,此处kf.split是索引

  1. # 使用KFold完成数据划分
  2. from sklearn.model_selection import KFold,StratifiedKFold
  3. kf = KFold(n_splits=2)
  4. for train_index, test_index in kf.split(train_df):
  5. train_x, test_x = train_df.loc[train_index][fea], train_df.loc[test_index][fea]
  6. train_y, test_y = train_df.loc[train_index][label], train_df.loc[test_index][label]
  7. train_x.shape,test_x.shape,train_y.shape,test_y.shape

使用StratifiedKFold完成数据划分

  1. # 使用StratifiedKFold完成数据划分
  2. from sklearn.model_selection import KFold,StratifiedKFold
  3. skf = StratifiedKFold(n_splits=2)
  4. for train_index, test_index in skf.split(train_df,train_df[label]):
  5. train_x, test_x = train_df.loc[train_index][fea], train_df.loc[test_index][fea]
  6. train_y, test_y = train_df.loc[train_index][label], train_df.loc[test_index][label]
  7. train_x.shape,test_x.shape,train_y.shape,test_y.shape

使用StratifiedKFold配合LightGBM完成模型的训练和预测

  1. # lightgbm模型训练
  2. import lightgbm as lgb
  3. model_lgb =lgb.LGBMClassifier(learning_rate= 0.005, max_depth= 13, n_estimators= 500)
  4. model_lgb.fit(train_x,train_y)
  5. # lightgbm模型测试
  6. test_y_lgb = model_lgb.predict(test_x)
  7. score = f1_score(test_y,test_y_lgb)
  8. print(score)

测试得分为:0.9406867845993757,感觉并没有效果很好。。。

反复测试调试后,得到最佳参数parameters: {'learning_rate': 0.01, 'max_depth': 9, 'n_estimators': 300, 'num_leaves': 31},代入模型预测,再次提交结果

!!!Staking!!!

首先使用5折交叉验证训练5种不同模型

  1. def train_predict(train, test, best_clf,features,label):
  2. print('train_predict...')
  3. prediction_test = 0
  4. cv_score = []
  5. prediction_train = pd.Series([], dtype='float64')
  6. kf = KFold(n_splits=5, random_state=22, shuffle=True)
  7. for train_part_index, eval_index in kf.split(train[features], train[label]):
  8. best_clf.fit(train[features].loc[train_part_index].values, train[label].loc[train_part_index].values)
  9. prediction_test += best_clf.predict(test[features].values)
  10. eval_pre = best_clf.predict(train[features].loc[eval_index].values)
  11. score = f1_score(train[label].loc[eval_index].values, eval_pre)
  12. cv_score.append(score)
  13. print(score)
  14. prediction_train = prediction_train.append(pd.Series(best_clf.predict(train[features].loc[eval_index]),
  15. index=eval_index))
  16. print(cv_score, sum(cv_score) / 5)
  17. return
  1. from sklearn.model_selection import cross_val_score
  2. from sklearn.model_selection import train_test_split
  3. from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
  4. from sklearn.tree import DecisionTreeClassifier
  5. from sklearn.neighbors import KNeighborsClassifier
  6. from sklearn.linear_model import LogisticRegression
  7. model={}
  8. model['rfc']=RandomForestClassifier(max_depth=17,min_samples_leaf=2,n_estimators=100)
  9. model['gdbt']=GradientBoostingClassifier(learning_rate=0.005,max_depth=13,n_estimators=500)
  10. model['cart']=DecisionTreeClassifier(max_depth=7,min_samples_leaf=2)
  11. model['knn']=KNeighborsClassifier()
  12. model['lr']=LogisticRegression(max_iter=40)
  13. for i in model:
  14. train_predict(train_df, test_df, model[i],fea,label)

对五种模型验证集的预测结果和测试集的训练结果进行stacking,并获取预测结果

  1. def stack_model(oof_1,oof_2,oof_3,oof_4,oof_5,predictions_1,predictions_2,predictions_3,predictions_4,predictions_5,label):
  2. train_stack = np.hstack([oof_1, oof_2, oof_3, oof_4, oof_5])
  3. test_stack = np.hstack([predictions_1, predictions_2, predictions_3, predictions_4, predictions_5])
  4. oof = np.zeros(train_stack.shape[0])
  5. predictions = np.zeros(test_stack.shape[0])
  6. cv_score=[]
  7. folds = RepeatedKFold(n_splits=5, n_repeats=2, random_state=2020)
  8. for foldn, (trn_idx, val_idx) in enumerate(folds.split(train_stack, label)):
  9. print("fold {}".format(foldn+1))
  10. trn_data, trn_y = train_stack[trn_idx], label[trn_idx]
  11. val_data, val_y = train_stack[val_idx], label[val_idx]
  12. print("-" * 10 + "Stacking " + str(foldn+1) + "-" * 10)
  13. clf = BayesianRidge()
  14. clf.fit(trn_data, trn_y)
  15. oof[val_idx] = clf.predict(train_stack[val_idx])
  16. predictions += clf.predict(test_stack) / (5 * 2)
  17. eval_pre = oof[val_idx]>0.5
  18. score = f1_score(label[val_idx], eval_pre)
  19. cv_score.append(score)
  20. print(score)
  21. print(cv_score, sum(cv_score) / 10)
  22. return predictions
  23. label_value = train_df[ '患有糖尿病标识'].values
  24. predictions_stack = stack_model(oof_cart, oof_gdbt, oof_knn, oof_lr, oof_rfc,
  25. predictions_cart, predictions_gdbt, predictions_knn, predictions_lr, predictions_rfc,label_value)

提交最终结果~~~

 分数是0.96069,比前几次略好,但是感觉依然存在提升的空间,还需要持续研究呀

本次任务完结~~撒花~~

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/从前慢现在也慢/article/detail/781201
推荐阅读
相关标签
  

闽ICP备14008679号