赞
踩
目录
糖尿病遗传风险检测挑战赛是讯飞开放平台上的一个算法挑战大赛,数据集不算很大,特征数不是很复杂,非常适合分类算法学习入门。本次通过Coggle参加了这一竞赛学习,希望可以进一步提升数据挖掘和算法能力。
下载比赛数据需要实名,导入依赖包,打开数据集可以看到数据信息如下:
- import pandas as pd
- import lightgbm as lgb
- import numpy as np
- import matplotlib.pyplot as plt
- import seaborn as sns
-
- from hyperopt import hp, fmin, tpe
-
- sns.set_style({'font.sans-serif':['SimHei']})
- plt.rcParams ['font.sans-serif'] = ['SimHei']
- plt.rcParams ['axes.unicode_minus']
-
- train_df=pd.read_csv('比赛训练集.csv',encoding='gbk')
- test_df=pd.read_csv('比赛测试集.csv',encoding='gbk')
-
- print('训练集的数据大小:',train_df.shape)
- print('测试集的数据大小:',test_df.shape)
-
- train_df.head()
- print(train_df.dtypes)
- print(test_df.dtypes)
根据数据表和数据类型可以初步判断,类型为’float64‘的特征体重指数、舒张压、口服耐糖量测试、胰岛素实验和肱三头肌皮褶厚度几项为数值型,并且大概率为连续数值型,而性别、糖尿病家族史为类别型。
- print(train_df.isnull().mean())
- print(test_df.isnull().mean())
train_df.corr()
可以看出,训练集和测试集都只有’舒张压‘字段存在缺失值,缺失率只有不到5%,不是很高,不需要剔除。
3.1 查看总体特征相关性,可以通过热力图更清楚地看到相关性的强弱程度
- train_df.corr()
- sns.heatmap(train_df.corr())
3.2 查看特定特征与标签的相关性,countplot图可以看到单特征与标签的相关性,boxplot\violinplot可以看到双特征与标签的相关性,这种可视化的方式可以启发特征工程的思路。
- sns.countplot(x='患有糖尿病标识', hue='性别', data=train_df)
- sns.boxplot(y='出生年份', x='患有糖尿病标识', hue='性别', data=train_df)
- sns.violinplot(y='体重指数', x='患有糖尿病标识', hue='性别', data=train_df)
这里的图看起来好高级,感叹下,尤其是这个violin的图,sns666啊~~~
首先进行简单的特征处理,然后尝试跑下逻辑回归
- dict_糖尿病家族史 = {
- '无记录': 0,
- '叔叔或姑姑有一方患有糖尿病': 1,
- '叔叔或者姑姑有一方患有糖尿病': 1,
- '父母有一方患有糖尿病': 2
- }
- train_df['糖尿病家族史'] = train_df['糖尿病家族史'].map(dict_糖尿病家族史)
- test_df['糖尿病家族史'] = test_df['糖尿病家族史'].map(dict_糖尿病家族史)
-
- train_df['舒张压'].fillna(0, inplace=True)
- test_df['舒张压'].fillna(0, inplace=True)
-
- fea = list(train_df.columns)
- fea.remove('患有糖尿病标识')
- label = '患有糖尿病标识'
-
- # 导入sklearn中的逻辑回归
- from sklearn.linear_model import LogisticRegression
- model = LogisticRegression()
- model.fit(train_df[fea],train_df[label])
-
- # 使用训练集和逻辑回归进行训练,并在测试集上进行预测
- test_df['label'] = model.predict(test_df[fea])
- test_df.rename({'编号':'uuid'},axis=1)[['uuid','label']].to_csv('submit.csv',index=None)
将预测的结果文件提交到比赛,初步得分:0.72468
从训练集中拿出20%划分为测试集,训练调参
- # 划分数据集
- from sklearn.model_selection import train_test_split
- train_x,test_x,train_y,test_y = train_test_split(train_df[fea],train_df[label],test_size=0.2,random_state=2022)
- train_x.shape,test_x.shape,train_y.shape,test_y.shape
-
- # 训练
- from sklearn.linear_model import LogisticRegression
- model = LogisticRegression(max_iter=1000,C=0.1)
- model.fit(train_x,train_y)
- # 测试
- from sklearn.metrics import f1_score
- pre_test_y = model.predict(test_x)
- score = f1_score(test_y,pre_test_y)
- print(score)
这里经过反复调试,感觉降低max_iter效果稍微好一点,最后得出的F1score好低,只有0.757
- # 统计每个性别对应的【体重指数】、【舒张压】平均值
- train_df.groupby('性别')['舒张压'].mean()
- train_df.groupby('性别')['体重指数'].mean()
- # 计算每个患者与每个性别平均值的差异
- train_df['sex_szy'] = abs(train_df['舒张压'] - train_df.groupby('性别').transform('mean')['舒张压'])
- train_df['sex_tzzs'] = abs(train_df['体重指数'] - train_df.groupby('性别').transform('mean')['体重指数'])
- # 指定特征和标签
- fea = list(train_df.columns)
- fea.remove('患有糖尿病标识')
- label = '患有糖尿病标识'
-
- # 划分数据集
- from sklearn.model_selection import train_test_split
- train_x,test_x,train_y,test_y = train_test_split(train_df[fea],train_df[label],test_size=0.2,random_state=2022)
- train_x.shape,test_x.shape,train_y.shape,test_y.shape
-
- # 逻辑回归训练
- from sklearn.linear_model import LogisticRegression
- model_lr = LogisticRegression(max_iter=40,C=1)
- model_lr.fit(train_x,train_y)
-
- # 逻辑回归测试
- from sklearn.metrics import f1_score
- test_y_lr = model_lr.predict(test_x)
- score = f1_score(test_y,test_y_lr)
- print(score)
此时返回结果0.774468085106383,较之前的0.757有所提升,可见男性与女性的舒张压和体重分布不同,同一特征值下对于标签的预测能力也不相同,应该区分来看。但因为两者平均数差异相差并没有非常大,所以是稍有提升。
对其他特征进行探索:
train_df['sex_gstj'] = abs(train_df['肱三头肌皮褶厚度']-train_df.groupby('性别').transform('mean')['肱三头肌皮褶厚度'])
首先尝试对’肱三头肌皮褶厚度’使用上面的方法,效果不升反而略有下降
- train_df['age'] = 2022 - train_df['出生年份']
- test_df['age'] = 2022 - test_df['出生年份']
再将数据中的出生年份换算成年龄,效果有小小小提升,得分0.7751060820367751
- # 决策树训练
- from sklearn.tree import DecisionTreeClassifier
- model_dtc = DecisionTreeClassifier()
- model_dtc.fit(train_x,train_y)
-
- # 查看特征重要性
- fea_imp = pd.DataFrame([*zip(fea,model_dtc.feature_importances_)])
- fea_imp = fea_imp.rename({0:'fea',1:'imp'},axis=1).sort_values(by='imp',ascending=False)
- fea_imp
使用决策树模型训练数据,并筛选出feature_importances_排名前五的特征,这里特征前五的重要性都超过了0.07,我们自己构造的特征暂时没有进入前五的。
- # 选取重要性排名前五的特征
- fea_top5 = list(fea_imp[:5]['fea'])
-
- # top5fea逻辑回归训练
- from sklearn.linear_model import LogisticRegression
- model_lr = LogisticRegression(max_iter=40,C=1)
- model_lr.fit(train_x[fea_top5],train_y)
-
- # top5fea逻辑回归测试
- from sklearn.metrics import f1_score
- test_y_lr = model_lr.predict(test_x[fea_top5])
- score = f1_score(test_y,test_y_lr)
- print(score)
此处只用筛选出的排名前5的特征训练模型,得分0.7418899858956276,略低于整体模型的0.775,但差不多,说明Top5特征已经覆盖了绝大多数有效信息
- # 安装lgb包
- # !pip install lightgbm
-
- # 划分数据集
- from sklearn.model_selection import train_test_split
- train_x,test_x,train_y,test_y = train_test_split(train_df[fea],train_df[label],test_size=0.2,random_state=2022)
- train_x.shape,test_x.shape,train_y.shape,test_y.shape
-
- # lightgbm模型训练
- import lightgbm as lgb
- model_lgb =lgb.LGBMClassifier()
- model_lgb.fit(train_x,train_y)
-
- # lightgbm模型测试
- test_y_lgb = model_lgb.predict(test_x)
- score = f1_score(test_y,test_y_lgb)
- print(score)
安装lgb包,重新划分训练集和验证集,使用lgb模型训练并预测,效果提升明显此处得分0.9492847854356306。将2预测的结果文件提交到比赛,得分如下:
使用GridSearchCV进行搜索调参
- # GridSearchCV参数搜索
-
- from sklearn.model_selection import GridSearchCV
-
- params_lgb={'learning_rate':[0.005,0.01,0.0],
- 'n_estimators':[100,300,500],
- 'max_depth':[9,11,13],
- 'num_leaves':[31,35,39]}
-
- best_model=GridSearchCV(model_lgb,param_grid=params_lgb,refit=True,cv=5).fit(train_x,train_y)
- print('best parameters:',best_model.best_params_)
搜索得到最佳参数parameters = {'learning_rate': 0.005, 'max_depth': 13, 'n_estimators': 300},代入模型进行预测,并再次提交结果,比上次又提高了一丢丢
尝试使用KFold完成数据划分,此处kf.split是索引
- # 使用KFold完成数据划分
- from sklearn.model_selection import KFold,StratifiedKFold
-
- kf = KFold(n_splits=2)
- for train_index, test_index in kf.split(train_df):
- train_x, test_x = train_df.loc[train_index][fea], train_df.loc[test_index][fea]
- train_y, test_y = train_df.loc[train_index][label], train_df.loc[test_index][label]
- train_x.shape,test_x.shape,train_y.shape,test_y.shape
使用StratifiedKFold完成数据划分
- # 使用StratifiedKFold完成数据划分
- from sklearn.model_selection import KFold,StratifiedKFold
- skf = StratifiedKFold(n_splits=2)
- for train_index, test_index in skf.split(train_df,train_df[label]):
- train_x, test_x = train_df.loc[train_index][fea], train_df.loc[test_index][fea]
- train_y, test_y = train_df.loc[train_index][label], train_df.loc[test_index][label]
- train_x.shape,test_x.shape,train_y.shape,test_y.shape
使用StratifiedKFold配合LightGBM完成模型的训练和预测
- # lightgbm模型训练
- import lightgbm as lgb
- model_lgb =lgb.LGBMClassifier(learning_rate= 0.005, max_depth= 13, n_estimators= 500)
- model_lgb.fit(train_x,train_y)
-
- # lightgbm模型测试
- test_y_lgb = model_lgb.predict(test_x)
- score = f1_score(test_y,test_y_lgb)
- print(score)
测试得分为:0.9406867845993757,感觉并没有效果很好。。。
反复测试调试后,得到最佳参数parameters: {'learning_rate': 0.01, 'max_depth': 9, 'n_estimators': 300, 'num_leaves': 31},代入模型预测,再次提交结果
首先使用5折交叉验证训练5种不同模型
- def train_predict(train, test, best_clf,features,label):
-
- print('train_predict...')
- prediction_test = 0
- cv_score = []
- prediction_train = pd.Series([], dtype='float64')
- kf = KFold(n_splits=5, random_state=22, shuffle=True)
- for train_part_index, eval_index in kf.split(train[features], train[label]):
- best_clf.fit(train[features].loc[train_part_index].values, train[label].loc[train_part_index].values)
- prediction_test += best_clf.predict(test[features].values)
- eval_pre = best_clf.predict(train[features].loc[eval_index].values)
- score = f1_score(train[label].loc[eval_index].values, eval_pre)
- cv_score.append(score)
- print(score)
- prediction_train = prediction_train.append(pd.Series(best_clf.predict(train[features].loc[eval_index]),
- index=eval_index))
-
- print(cv_score, sum(cv_score) / 5)
-
- return
- from sklearn.model_selection import cross_val_score
- from sklearn.model_selection import train_test_split
-
- from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
- from sklearn.tree import DecisionTreeClassifier
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn.linear_model import LogisticRegression
-
-
- model={}
- model['rfc']=RandomForestClassifier(max_depth=17,min_samples_leaf=2,n_estimators=100)
- model['gdbt']=GradientBoostingClassifier(learning_rate=0.005,max_depth=13,n_estimators=500)
- model['cart']=DecisionTreeClassifier(max_depth=7,min_samples_leaf=2)
- model['knn']=KNeighborsClassifier()
- model['lr']=LogisticRegression(max_iter=40)
-
- for i in model:
- train_predict(train_df, test_df, model[i],fea,label)
对五种模型验证集的预测结果和测试集的训练结果进行stacking,并获取预测结果
- def stack_model(oof_1,oof_2,oof_3,oof_4,oof_5,predictions_1,predictions_2,predictions_3,predictions_4,predictions_5,label):
-
- train_stack = np.hstack([oof_1, oof_2, oof_3, oof_4, oof_5])
- test_stack = np.hstack([predictions_1, predictions_2, predictions_3, predictions_4, predictions_5])
- oof = np.zeros(train_stack.shape[0])
- predictions = np.zeros(test_stack.shape[0])
-
- cv_score=[]
- folds = RepeatedKFold(n_splits=5, n_repeats=2, random_state=2020)
-
- for foldn, (trn_idx, val_idx) in enumerate(folds.split(train_stack, label)):
- print("fold {}".format(foldn+1))
- trn_data, trn_y = train_stack[trn_idx], label[trn_idx]
- val_data, val_y = train_stack[val_idx], label[val_idx]
- print("-" * 10 + "Stacking " + str(foldn+1) + "-" * 10)
- clf = BayesianRidge()
- clf.fit(trn_data, trn_y)
- oof[val_idx] = clf.predict(train_stack[val_idx])
- predictions += clf.predict(test_stack) / (5 * 2)
- eval_pre = oof[val_idx]>0.5
- score = f1_score(label[val_idx], eval_pre)
- cv_score.append(score)
- print(score)
- print(cv_score, sum(cv_score) / 10)
-
- return predictions
-
- label_value = train_df[ '患有糖尿病标识'].values
- predictions_stack = stack_model(oof_cart, oof_gdbt, oof_knn, oof_lr, oof_rfc,
- predictions_cart, predictions_gdbt, predictions_knn, predictions_lr, predictions_rfc,label_value)
提交最终结果~~~
分数是0.96069,比前几次略好,但是感觉依然存在提升的空间,还需要持续研究呀
本次任务完结~~撒花~~
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。