当前位置:   article > 正文

食品与疾病关系预测算法赛道-baseline食品与疾病关系预测算法赛道-baseline

食品与疾病关系预测算法赛道

★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>

一、赛题背景

电子病历、人工智能、物联网设备、5G技术的快速演化,与大规模的数据资源,成为了数字化医疗的全新支撑。
对于医药公司、医疗机构、科技公司,如何整合全新的技术能力、人才资源、数据资源,为数字化医疗的创新提供更强的动力,成为了共同关切的重要话题。

本次数字医疗算法应用创新大赛是对于这一重要话题的主动探索,旨在最大化利用当前高速发展的算法技术与数据资源,培养数字化医疗人才、激励数字化医疗创新。大赛双赛道分别为食品与疾病关系预测算法赛道和生物共融与数字疗法应用赛道。

1.1 赛题任务

本赛道将提供脱敏后的食物与疾病特征,参赛团队根据主办方提供数据,在高度稀疏数据的场景中,进一步挖掘、融合特征并设计模型,以预测食物与疾病的关系。初赛阶段为二分类问题,分类标签分别为 0(无关)、1(存在正面或负面的影响)。

1.2 赛题数据简介

本次算法赛将提供超过 23.5W 的食物、疾病对应关系及其量化得分,其中食物特征超过 200 个,疾病特征由 3 种不同的方式抽取,累积超过 4000 个特征信息。初赛为 0、1 二分类预测,提供食物、疾病特征,与食物疾病的关系标签。

复赛阶段同时评估 0、1 二分类与相关性评级,在原训练集中增加食物疾病相关性评级的标签数据。

1.2.1 训练集

训练集包括疾病特征数据、食物特征数据(共计 348 种食物)、以及食物疾病关系,用于模型训练:

  • 疾病特征集:disease_feature1.csv、disease_feature2.csv、disease_feature3.csv
  • 食物特征集:train_food.csv
  • 食物疾病关系:train_answer.csv
  • 「复赛阶段」食物与疾病关系的相关性评级:semi_train_answer.csv

1.2.2 测试集

初赛测试集分两个阶段(A/B 榜),不提供预测结果,其中:

  • 初赛第一阶段 A 榜测试集: 2023 年 2 月 22 日中午 12:00:00— 2023 年 3 月 20 日中午 12:00:00,包括 A 榜阶段食物特征数据(共计 115 种食物)与初赛 A 榜提交样例,用于模型结果验证:
preliminary_a_food.csv
preliminary_a_submit_sample.csv
  • 1
  • 2
  • 初赛第二阶段 B 榜测试集: 2023 年 3 月 20 日中午 12:00:00— 2023 年 3 月 22 日中午 12:00:00,包括 B 榜阶段食物特征数据(共计 116 种食物)与初赛 B 榜提交样例,用于模型结果验证:
preliminary_b_food.csv
preliminary_b_submit_sample.csv
  • 1
  • 2
  • 初赛第二阶段 B 测试集与初赛第一阶段 A 榜测试集分布与规模相同,将于 B 榜提交开始后在赛事主页提供下载,最终初赛排名以初赛第二阶段 B 榜成绩为准。

1.3 字段说明

1.3.1 疾病特征

累计包含 407 种疾病的 4630 种特征信息,三种不同的特征抽取方式将疾病特征划分为三部分特征集,数据高度稀疏。

字段名称格式解释说明范围/特征集1范围/特征集2范围/特征集3
disease_id字符串疾病 id共涉及 220 种疾病共涉及 301 种疾病共涉及 392 种疾病
F_x浮点型疾病特征值F_0 ~F_4629,字段名称不连续,共涉及 996 种疾病特征F_0 ~F_4629,字段名称不连续,共涉及 3181 种疾病特征F_1 ~F_4627,字段名称不连续,共涉及 1453 种疾病特征

1.3.2 食物特征

序列字段名称格式解释说明示例
1food_id字符串食物 idfood_0
2~213N_x浮点型212 种食物特征,字段名称从 N_0 ~N_2110.123

1.3.3 食物疾病关系

序列字段名称格式解释说明示例
1food_id字符串食物 idfood_0
2disease_id字符串疾病 iddisease_0
3related整型食物与疾病是否相关:0(无关)、1(存在正面或负面的影响)0

1.4 评测方法

1.5 提交示例

序列字段名称格式解释说明示例
1food_id字符串食物 idfood_0
2disease_id字符串疾病 iddisease_0
3related_prob浮点型食品与疾病预测为 1 的概率若 related_prob >= 0.5,评审计算 f1 得分时判定为类别 10.1

1.6 比赛传送门

比赛传送门

二、项目介绍

2.1 项目意义

在本次食物与疾病关系预测挑战赛中,参赛团队将获得脱敏后的大量数据集。这些数据集包含了各种类型的食物和相应的健康指标,例如血压、胆固醇等等。这些指标可以帮助我们了解不同种类的食物对人体健康产生的影响。在实际应用中,由于数据量巨大且高度稀疏,传统的特征提取方法难以有效地提取出有用信息。因此,在本次比赛中,参赛团队需要进一步挖掘、融合特征并设计模型,以预测食物与疾病的关系。

在这个 Baseline 中,我们分别尝试复杂的特征结合传统的机器学习模型。我们在解决机器学习问题时,一般会遵循以下流程:

2.2 模型介绍

LightGBM(LightGBM: Gradient Boosting of Big Data Applications)是一种基于梯度提升树模型(Gradient Boosting Decision Tree)的轻量级机器学习算法,用于解决大规模数据集的分类、回归和聚类问题。它的核心思想是利用梯度下降来优化模型参数,从而提高模型的预测准确率。LightGBM的主要特点包括以下几个方面:

  • 轻量级:LightGBM采用了基于梯度的优化方法,避免了过多的参数调整,从而减少了计算量和内存消耗。
  • 高效性:LightGBM的基本思想是利用梯度下降来不断逼近模型的最优参数,从而减少了搜索空间,提高了模型的效率。
  • 可解释性:LightGBM采用了可解释的数据预处理方法,即在训练过程中对数据进行归一化、剪枝等处理,从而提高了模型的可解释性和可靠性。
  • 易用性:LightGBM提供了灵活的模型定义方式,可以根据实际需求进行定制化开发,同时也提供了丰富的可视化工具,便于用户理解和操作。
  • 通用性:LightGBM可以适用于多种机器学习框架和数据存储格式,具有广泛的应用前景。
  • 总之,LightGBM是一种高效、可解释、易用的轻量级机器学习算法,适用于大规模数据集的分类、回归和聚类问题。

三、详细方案实现

3.1 数据分析

3.1.1 数据解压缩

!unzip  -d /home/aistudio/data/data200766/  /home/aistudio/data/data200766/初赛A榜测试集.zip
!unzip  -d /home/aistudio/data/data200766/  /home/aistudio/data/data200766/初赛B榜测试集.zip
!unzip  -d /home/aistudio/data/data200766/   /home/aistudio/data/data200766/训练集.zip
  • 1
  • 2
  • 3
Archive:  /home/aistudio/data/data200766/初赛A榜测试集.zip
   creating: /home/aistudio/data/data200766/初赛A榜测试集/
  inflating: /home/aistudio/data/data200766/初赛A榜测试集/preliminary_a_submit_sample.csv  
  inflating: /home/aistudio/data/data200766/初赛A榜测试集/preliminary_a_food.csv  
Archive:  /home/aistudio/data/data200766/初赛B榜测试集.zip
   creating: /home/aistudio/data/data200766/初赛B榜测试集/
  inflating: /home/aistudio/data/data200766/初赛B榜测试集/preliminary_b_submit_sample.csv  
  inflating: /home/aistudio/data/data200766/初赛B榜测试集/preliminary_b_food.csv  
Archive:  /home/aistudio/data/data200766/训练集.zip
   creating: /home/aistudio/data/data200766/训练集/
  inflating: /home/aistudio/data/data200766/训练集/disease_feature3.csv  
  inflating: /home/aistudio/data/data200766/训练集/disease_feature2.csv  
  inflating: /home/aistudio/data/data200766/训练集/disease_feature1.csv  
  inflating: /home/aistudio/data/data200766/训练集/train_answer.csv  
  inflating: /home/aistudio/data/data200766/训练集/train_food.csv  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

3.1.2 导入相关库

!pip install --upgrade pip
!git clone --recursive https://github.com/Microsoft/LightGBM
!cd LightGBM && rm -rf build && mkdir build && cd build && cmake -DUSE_GPU=1 ../../LightGBM && make -j4 && cd ../python-package && python3 setup.py install --precompile --gpu;


  • 1
  • 2
  • 3
  • 4
  • 5
import pandas as pd
import os
import gc
import lightgbm as lgb
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler

import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

3.1.3 加载数据

disease_feature1 = pd.read_csv("/home/aistudio/data/data200766/训练集/disease_feature1.csv")
disease_feature2 = pd.read_csv("/home/aistudio/data/data200766/训练集/disease_feature2.csv")
disease_feature3 = pd.read_csv("/home/aistudio/data/data200766/训练集/disease_feature3.csv")

train_answer = pd.read_csv("/home/aistudio/data/data200766/训练集/train_answer.csv")
train_food = pd.read_csv("/home/aistudio/data/data200766/训练集/train_food.csv")

preliminary_a_food = pd.read_csv("/home/aistudio/data/data200766/初赛B榜测试集/preliminary_b_food.csv")
preliminary_a_submit_sample = pd.read_csv("/home/aistudio/data/data200766/初赛B榜测试集/preliminary_b_submit_sample.csv")
pd.set_option('display.max_columns', None)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
del preliminary_a_submit_sample['related_prob']
data = pd.concat([train_answer, preliminary_a_submit_sample], axis = 0).reset_index(drop=True)
data.head()
  • 1
  • 2
  • 3
food_iddisease_idrelated
0food_0disease_9980.0
1food_0disease_8610.0
2food_0disease_5590.0
3food_0disease_8410.0
4food_0disease_810.0

这里直接使用每个变量后的数字进行编码,当然也可以使用labelencoder的方式。

data['food'] = data['food_id'].apply(lambda x : int(x.split('_')[1]))
data['disease'] = data['disease_id'].apply(lambda x : int(x.split('_')[1]))
  • 1
  • 2
food = pd.concat([train_food, preliminary_a_food], axis = 0).reset_index(drop=True)
food.head()
  • 1
  • 2
food_idN_0N_1N_2N_3N_4N_5N_6N_7N_8N_9N_10N_11N_12N_13N_14N_15N_16N_17N_18N_19N_20N_21N_22N_23N_24N_25N_26N_27N_28N_29N_30N_31N_32N_33N_34N_35N_36N_37N_38N_39N_40N_41N_42N_43N_44N_45N_46N_47N_48N_49N_50N_51N_52N_53N_54N_55N_56N_57N_58N_59N_60N_61N_62N_63N_64N_65N_66N_67N_68N_69N_70N_71N_72N_73N_74N_75N_76N_77N_78N_79N_80N_81N_82N_83N_84N_85N_86N_87N_88N_89N_90N_91N_92N_93N_94N_95N_96N_97N_98N_99N_100N_101N_102N_103N_104N_105N_106N_107N_108N_109N_110N_111N_112N_113N_114N_115N_116N_117N_118N_119N_120N_121N_122N_123N_124N_125N_126N_127N_128N_129N_130N_131N_132N_133N_134N_135N_136N_137N_138N_139N_140N_141N_142N_143N_144N_145N_146N_147N_148N_149N_150N_151N_152N_153N_154N_155N_156N_157N_158N_159N_160N_161N_162N_163N_164N_165N_166N_167N_168N_169N_170N_171N_172N_173N_174N_175N_176N_177N_178N_179N_180N_181N_182N_183N_184N_185N_186N_187N_188N_189N_190N_191N_192N_193N_194N_195N_196N_197N_198N_199N_200N_201N_202N_203N_204N_205N_206N_207N_208N_209N_210N_211
0food_0NaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaN0.032.0NaNNaN2.10NaN6.087.0NaN0.0NaNNaNNaNNaNNaN14.4NaNNaNNaNNaN0.157NaN6.0NaNNaNNaNNaNNaNNaN23.0NaNNaNNaNNaNNaN0.0560.4090.069NaNNaNNaNNaNNaNNaN1.9NaN36.036.036.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.96NaNNaNNaNNaN0.00.0NaN27.0NaNNaNNaNNaNNaNNaNNaN0.000NaNNaNNaN0.056NaN0.000NaN0.0NaNNaN0.481NaNNaNNaNNaN70.0NaNNaNNaNNaN79.0NaN3.990.234NaNNaNNaNNaN0.175NaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaN0.0NaNNaN0.0NaNNaNNaNNaNNaN0.0NaN0.0NaNNaN0.00.1260.6NaN0.0NaN0.000NaN0.002NaN0.059NaN0.008NaNNaNNaNNaN0.0000.00.0006.0NaNNaNNaN0.20NaNNaNNaNNaNNaNNaN0.00.076NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.69NaNNaNNaNNaNNaN8.00.00.00.0348.20.0NaNNaNNaNNaN0.020.0NaNNaN30.592.82NaN0.92
1food_1NaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaN0.0268.0NaNNaN21.01NaN0.01.0NaN0.0NaNNaNNaNNaNNaN52.1NaNNaNNaNNaN1.099NaN0.0NaNNaNNaNNaNNaNNaN598.0NaNNaNNaNNaNNaN33.07612.9554.092NaNNaNNaNNaNNaNNaN10.9NaN55.055.055.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN3.73NaNNaNNaNNaN1.00.0NaN279.0NaNNaNNaNNaNNaNNaNNaN0.259NaNNaNNaN32.754NaN0.007NaN0.0NaNNaN3.637NaNNaNNaNNaN471.0NaNNaNNaNNaN713.0NaN20.9612.945NaNNaNNaNNaN0.010NaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaN0.0NaNNaN0.0NaNNaNNaNNaNNaN0.0NaN0.0NaNNaN0.01.1972.0NaN0.0NaN0.000NaN0.019NaN3.348NaN0.704NaNNaNNaNNaN0.0000.00.0003.0NaNNaNNaN4.86NaNNaNNaNNaNNaNNaN0.00.077NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN52.54NaNNaNNaNNaNNaN0.00.00.00.1360.00.0NaNNaNNaNNaN23.900.0NaNNaN0.02.41NaN3.31
2food_4NaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaN0.062.0NaNNaN79.32NaN0.00.0NaN0.0NaNNaNNaNNaNNaN11.1NaNNaNNaNNaN0.272NaN0.0NaNNaNNaNNaNNaNNaN299.0NaNNaNNaNNaNNaN0.0240.0530.094NaNNaNNaNNaNNaNNaN4.5NaN5.05.05.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.79NaNNaNNaNNaN0.00.0NaN36.0NaNNaNNaNNaNNaNNaNNaN0.001NaNNaNNaN0.023NaN0.000NaN0.0NaNNaN0.766NaNNaNNaNNaN98.0NaNNaNNaNNaN744.0NaN3.300.039NaNNaNNaNNaN0.014NaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaN0.0NaNNaN0.0NaNNaNNaNNaNNaN0.0NaN0.0NaNNaN0.00.1250.6NaN0.0NaN0.001NaN0.004NaN0.056NaN0.013NaNNaNNaNNaN0.0070.00.00126.0NaNNaNNaN65.18NaNNaNNaNNaNNaNNaN0.00.106NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.25NaNNaNNaNNaNNaN0.00.00.00.1742.30.0NaNNaNNaNNaN0.120.0NaNNaN3.515.46NaN0.36
3food_5NaNNaNNaN0.0680.00.0450.750.314NaNNaNNaNNaNNaN0.013.0NaNNaN11.12NaN19.01094.0NaN0.0NaNNaNNaNNaNNaN2.8NaNNaNNaNNaN0.078NaN104.0NaN0.003NaNNaNNaNNaN124.5NaNNaNNaNNaNNaN0.1700.0770.0270.0NaNNaNNaNNaNNaN2.0NaN9.09.09.00.00.940.0NaNNaN2.370.1570.040NaN0.027NaNNaN0.390.0410.00.077NaN89.00.00.09710.0NaN0.060.0770.006NaNNaNNaN0.000NaNNaNNaN0.170NaN0.000NaN0.0NaNNaN0.600NaNNaN0.2400.05223.0NaNNaN18.0NaN259.00.1011.400.077NaNNaNNaNNaN0.000NaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaN0.0NaNNaN0.0NaNNaNNaNNaNNaN0.0NaN0.0NaNNaN0.00.0400.10.0830.0NaN0.000NaN0.000NaN0.024NaN0.003NaNNaNNaNNaN0.0000.00.0001.0NaNNaN5.879.24NaNNaNNaNNaNNaNNaN0.00.0300.047NaNNaNNaNNaNNaNNaNNaNNaNNaN0.39NaN0.0150.0290.0471926.096.00.00.00.05410.00.00.0NaNNaNNaN0.890.0NaNNaN3.386.35NaN0.20
4food_6NaNNaNNaN0.1150.00.0910.580.508NaNNaN0.6NaNNaN0.024.0NaNNaN3.88NaN9.0449.0NaN0.0NaNNaNNaNNaNNaN16.0NaNNaNNaNNaN0.189NaN0.0NaN0.031NaNNaNNaNNaN52.5NaNNaNNaNNaNNaN0.0000.0500.0400.0NaNNaNNaNNaNNaN2.1NaN52.052.052.00.01.000.0NaNNaN0.650.2330.093NaN0.049NaNNaN2.140.0750.00.128NaN710.00.00.10414.0NaN0.000.1580.0310.0NaN0.00.000NaN0.0NaN0.000NaN0.000NaN0.0NaNNaN0.978NaNNaN0.2740.07552.0NaNNaN24.0NaN202.00.0712.200.040NaNNaNNaNNaN0.010NaNNaNNaNNaN0.0NaN0.00.0NaNNaNNaN0.0NaNNaN0.0NaNNaNNaNNaNNaN0.0NaN0.0NaNNaN0.00.1412.30.1060.0NaN0.000NaN0.0000.00.0400.00.0000.0NaN0.00.00.0000.00.0002.0NaNNaN0.231.88NaNNaNNaNNaNNaNNaN0.00.1430.0840.00.00.09NaNNaNNaNNaNNaNNaN0.12NaN0.0270.0520.115756.038.00.00.00.0915.60.00.0NaNNaNNaN1.130.00.0NaN41.693.22NaN0.54

3.1.4 EDA

# 查看数据缺失情况
pd.set_option('display.max_rows', None)
((food.isnull().sum())/food.shape[0]).sort_values(ascending=False).map(lambda x:"{:.2%}".format(x))

  • 1
  • 2
  • 3
  • 4

#只保留缺失率少于10%的列
food=food[['N_198','N_33','N_211','N_82','N_101','N_42','N_111','N_165','N_177','N_146','N_17','N_113','N_106','N_14','N_74','N_209','N_188','food_id' ]] 
food.head(5)
  • 1
  • 2
  • 3
N_198N_33N_211N_82N_101N_42N_111N_165N_177N_146N_17N_113N_106N_14N_74N_209N_188food_id
08.20.1570.9227.00.48123.079.06.00.0760.1262.103.9970.032.00.9692.820.69food_0
10.01.0993.31279.03.637598.0713.03.00.0771.19721.0120.96471.0268.03.732.4152.54food_1
22.30.2720.3636.00.766299.0744.026.00.1060.12579.323.3098.062.01.7915.460.25food_4
310.00.0780.2010.00.600124.5259.01.00.0300.04011.121.4023.013.00.3986.350.39food_5
45.60.1890.5414.00.97852.5202.02.00.1430.1413.882.2052.024.02.1493.220.12food_6

3.2 数据处理

3.2.1 目标编码

由于本题只有两个离散变量food_id 和disease_id ,而测试集中都是新的foodid

cat_list = ['disease']
def stat(df, df_merge, group_by, agg):
    group = df.groupby(group_by).agg(agg)

    columns = []
    for on, methods in agg.items():
        for method in methods:
            columns.append('{}_{}_{}'.format('_'.join(group_by), on, method))
    group.columns = columns
    group.reset_index(inplace=True)
    df_merge = df_merge.merge(group, on=group_by, how='left')

    del (group)
    gc.collect()
    return df_merge


def statis_feat(df_know, df_unknow,cat_list):
    for f in tqdm(cat_list):
        df_unknow = stat(df_know, df_unknow, [f], {'related': ['mean']})

    return df_unknow


df_train = data[~data['related'].isnull()]
df_train = df_train.reset_index(drop=True)
df_test = data[data['related'].isnull()]

df_stas_feat = None
kf = StratifiedKFold(n_splits=5, random_state=2020, shuffle=True)
for train_index, val_index in kf.split(df_train, df_train['related']):
    df_fold_train = df_train.iloc[train_index]
    df_fold_val = df_train.iloc[val_index]

    df_fold_val = statis_feat(df_fold_train, df_fold_val,cat_list)
    df_stas_feat = pd.concat([df_stas_feat, df_fold_val], axis=0)

    del (df_fold_train)
    del (df_fold_val)
    gc.collect()

df_test = statis_feat(df_train, df_test,cat_list)
data = pd.concat([df_stas_feat, df_test], axis=0)
data = data.reset_index(drop=True)

del (df_stas_feat)
del (df_train)
del (df_test)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
100%|██████████| 1/1 [00:00<00:00,  7.57it/s]
100%|██████████| 1/1 [00:00<00:00,  9.00it/s]
100%|██████████| 1/1 [00:00<00:00,  9.45it/s]
100%|██████████| 1/1 [00:00<00:00,  8.77it/s]
100%|██████████| 1/1 [00:00<00:00,  8.65it/s]
100%|██████████| 1/1 [00:00<00:00,  9.16it/s]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

3.2.1 疾病特征处理

使用TruncatedSVD 的方法,对疾病特征进行降维,维度均为128

f_col = [col for col in disease_feature1.columns if 'F' in col]
  • 1
disease_feature_1_ = disease_feature1.copy()
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer,TfidfTransformer 
from sklearn.decomposition import TruncatedSVD, SparsePCA
disease_feature_1_ = disease_feature_1_.fillna(0)
decom=TruncatedSVD(n_components=128, n_iter = 20, random_state=2023) 

decom_x=decom.fit_transform(disease_feature_1_.iloc[:,1:]) 
decom_feas=pd.DataFrame(decom_x)
decom_feas.columns=['disease1_svd_'+str(i) for i in range(decom_x.shape[1])]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
disease_feature1 = disease_feature1[['disease_id']]
for col in decom_feas:
    disease_feature1[col] = decom_feas[col]
  • 1
  • 2
  • 3
f_col = [col for col in disease_feature2.columns if 'F' in col]
  • 1
disease_feature_2_ = disease_feature2.copy()
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer,TfidfTransformer 
from sklearn.decomposition import TruncatedSVD, SparsePCA
disease_feature_2_ = disease_feature_2_.fillna(0)
decom=TruncatedSVD(n_components=128, n_iter = 20, random_state=2023) 

decom_x=decom.fit_transform(disease_feature_2_.iloc[:,1:]) 
decom_feas=pd.DataFrame(decom_x)
decom_feas.columns=['disease2_svd_'+str(i) for i in range(decom_x.shape[1])]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
disease_feature2 = disease_feature2[['disease_id']]
for col in decom_feas:
    disease_feature2[col] = decom_feas[col]
  • 1
  • 2
  • 3
f_col = [col for col in disease_feature3.columns if 'F' in col]
  • 1
disease_feature_3_ = disease_feature3.copy()
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer,TfidfTransformer 
from sklearn.decomposition import TruncatedSVD, SparsePCA
disease_feature_3_ = disease_feature_3_.fillna(0)
decom=TruncatedSVD(n_components=128, n_iter = 20, random_state=2023) 

decom_x=decom.fit_transform(disease_feature_3_.iloc[:,1:]) 
decom_feas=pd.DataFrame(decom_x)
decom_feas.columns=['disease3_svd_'+str(i) for i in range(decom_x.shape[1])]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
disease_feature3 = disease_feature3[['disease_id']]
for col in decom_feas:
    disease_feature3[col] = decom_feas[col]
  • 1
  • 2
  • 3
data = data.merge(food, on = 'food_id', how = 'left')
data = data.merge(disease_feature1, on = 'disease_id', how = 'left')
data = data.merge(disease_feature2, on = 'disease_id', how = 'left')
data = data.merge(disease_feature3, on = 'disease_id', how = 'left')
data.head()
  • 1
  • 2
  • 3
  • 4
  • 5
food_iddisease_idrelatedfooddiseasedisease_related_meanN_198N_33N_211N_82N_101N_42N_111N_165N_177N_146N_17N_113N_106N_14N_74N_209N_188disease1_svd_0disease1_svd_1disease1_svd_2disease1_svd_3disease1_svd_4disease1_svd_5disease1_svd_6disease1_svd_7disease1_svd_8disease1_svd_9disease1_svd_10disease1_svd_11disease1_svd_12disease1_svd_13disease1_svd_14disease1_svd_15disease1_svd_16disease1_svd_17disease1_svd_18disease1_svd_19disease1_svd_20disease1_svd_21disease1_svd_22disease1_svd_23disease1_svd_24disease1_svd_25disease1_svd_26disease1_svd_27disease1_svd_28disease1_svd_29disease1_svd_30disease1_svd_31disease1_svd_32disease1_svd_33disease1_svd_34disease1_svd_35disease1_svd_36disease1_svd_37disease1_svd_38disease1_svd_39disease1_svd_40disease1_svd_41disease1_svd_42disease1_svd_43disease1_svd_44disease1_svd_45disease1_svd_46disease1_svd_47disease1_svd_48disease1_svd_49disease1_svd_50disease1_svd_51disease1_svd_52disease1_svd_53disease1_svd_54disease1_svd_55disease1_svd_56disease1_svd_57disease1_svd_58disease1_svd_59disease1_svd_60disease1_svd_61disease1_svd_62disease1_svd_63disease1_svd_64disease1_svd_65disease1_svd_66disease1_svd_67disease1_svd_68disease1_svd_69disease1_svd_70disease1_svd_71disease1_svd_72disease1_svd_73disease1_svd_74disease1_svd_75disease1_svd_76disease1_svd_77disease1_svd_78disease1_svd_79disease1_svd_80disease1_svd_81disease1_svd_82disease1_svd_83disease1_svd_84disease1_svd_85disease1_svd_86disease1_svd_87disease1_svd_88disease1_svd_89disease1_svd_90disease1_svd_91disease1_svd_92disease1_svd_93disease1_svd_94disease1_svd_95disease1_svd_96disease1_svd_97disease1_svd_98disease1_svd_99disease1_svd_100disease1_svd_101disease1_svd_102disease1_svd_103disease1_svd_104disease1_svd_105disease1_svd_106disease1_svd_107disease1_svd_108disease1_svd_109disease1_svd_110disease1_svd_111disease1_svd_112disease1_svd_113disease1_svd_114disease1_svd_115disease1_svd_116disease1_svd_117disease1_svd_118disease1_svd_119disease1_svd_120disease1_svd_121disease1_svd_122disease1_svd_123disease1_svd_124disease1_svd_125disease1_svd_126disease1_svd_127disease2_svd_0disease2_svd_1disease2_svd_2disease2_svd_3disease2_svd_4disease2_svd_5disease2_svd_6disease2_svd_7disease2_svd_8disease2_svd_9disease2_svd_10disease2_svd_11disease2_svd_12disease2_svd_13disease2_svd_14disease2_svd_15disease2_svd_16disease2_svd_17disease2_svd_18disease2_svd_19disease2_svd_20disease2_svd_21disease2_svd_22disease2_svd_23disease2_svd_24disease2_svd_25disease2_svd_26disease2_svd_27disease2_svd_28disease2_svd_29disease2_svd_30disease2_svd_31disease2_svd_32disease2_svd_33disease2_svd_34disease2_svd_35disease2_svd_36disease2_svd_37disease2_svd_38disease2_svd_39disease2_svd_40disease2_svd_41disease2_svd_42disease2_svd_43disease2_svd_44disease2_svd_45disease2_svd_46disease2_svd_47disease2_svd_48disease2_svd_49disease2_svd_50disease2_svd_51disease2_svd_52disease2_svd_53disease2_svd_54disease2_svd_55disease2_svd_56disease2_svd_57disease2_svd_58disease2_svd_59disease2_svd_60disease2_svd_61disease2_svd_62disease2_svd_63disease2_svd_64disease2_svd_65disease2_svd_66disease2_svd_67disease2_svd_68disease2_svd_69disease2_svd_70disease2_svd_71disease2_svd_72disease2_svd_73disease2_svd_74disease2_svd_75disease2_svd_76disease2_svd_77disease2_svd_78disease2_svd_79disease2_svd_80disease2_svd_81disease2_svd_82disease2_svd_83disease2_svd_84disease2_svd_85disease2_svd_86disease2_svd_87disease2_svd_88disease2_svd_89disease2_svd_90disease2_svd_91disease2_svd_92disease2_svd_93disease2_svd_94disease2_svd_95disease2_svd_96disease2_svd_97disease2_svd_98disease2_svd_99disease2_svd_100disease2_svd_101disease2_svd_102disease2_svd_103disease2_svd_104disease2_svd_105disease2_svd_106disease2_svd_107disease2_svd_108disease2_svd_109disease2_svd_110disease2_svd_111disease2_svd_112disease2_svd_113disease2_svd_114disease2_svd_115disease2_svd_116disease2_svd_117disease2_svd_118disease2_svd_119disease2_svd_120disease2_svd_121disease2_svd_122disease2_svd_123disease2_svd_124disease2_svd_125disease2_svd_126disease2_svd_127disease3_svd_0disease3_svd_1disease3_svd_2disease3_svd_3disease3_svd_4disease3_svd_5disease3_svd_6disease3_svd_7disease3_svd_8disease3_svd_9disease3_svd_10disease3_svd_11disease3_svd_12disease3_svd_13disease3_svd_14disease3_svd_15disease3_svd_16disease3_svd_17disease3_svd_18disease3_svd_19disease3_svd_20disease3_svd_21disease3_svd_22disease3_svd_23disease3_svd_24disease3_svd_25disease3_svd_26disease3_svd_27disease3_svd_28disease3_svd_29disease3_svd_30disease3_svd_31disease3_svd_32disease3_svd_33disease3_svd_34disease3_svd_35disease3_svd_36disease3_svd_37disease3_svd_38disease3_svd_39disease3_svd_40disease3_svd_41disease3_svd_42disease3_svd_43disease3_svd_44disease3_svd_45disease3_svd_46disease3_svd_47disease3_svd_48disease3_svd_49disease3_svd_50disease3_svd_51disease3_svd_52disease3_svd_53disease3_svd_54disease3_svd_55disease3_svd_56disease3_svd_57disease3_svd_58disease3_svd_59disease3_svd_60disease3_svd_61disease3_svd_62disease3_svd_63disease3_svd_64disease3_svd_65disease3_svd_66disease3_svd_67disease3_svd_68disease3_svd_69disease3_svd_70disease3_svd_71disease3_svd_72disease3_svd_73disease3_svd_74disease3_svd_75disease3_svd_76disease3_svd_77disease3_svd_78disease3_svd_79disease3_svd_80disease3_svd_81disease3_svd_82disease3_svd_83disease3_svd_84disease3_svd_85disease3_svd_86disease3_svd_87disease3_svd_88disease3_svd_89disease3_svd_90disease3_svd_91disease3_svd_92disease3_svd_93disease3_svd_94disease3_svd_95disease3_svd_96disease3_svd_97disease3_svd_98disease3_svd_99disease3_svd_100disease3_svd_101disease3_svd_102disease3_svd_103disease3_svd_104disease3_svd_105disease3_svd_106disease3_svd_107disease3_svd_108disease3_svd_109disease3_svd_110disease3_svd_111disease3_svd_112disease3_svd_113disease3_svd_114disease3_svd_115disease3_svd_116disease3_svd_117disease3_svd_118disease3_svd_119disease3_svd_120disease3_svd_121disease3_svd_122disease3_svd_123disease3_svd_124disease3_svd_125disease3_svd_126disease3_svd_127
0food_0disease_8610.008610.0035218.20.1570.9227.00.48123.079.06.00.0760.1262.13.9970.032.00.9692.820.69NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN21.2078930.4508600.247976-0.094030-0.010093-0.1657380.063872-0.106507-0.2066260.123055-0.0462910.1033080.114410-0.1504040.008024-0.0754720.328313-0.0540910.2103550.139096-0.0286110.080221-0.1513980.0341770.0419040.002051-0.1316260.076600-0.1311670.006474-0.0975480.0124240.0282560.0069850.0107230.017358-0.011925-0.0210330.0153930.006445-0.019345-0.033685-0.0301420.0208340.023872-0.0252860.0030810.0071000.0194400.003406-0.0156640.0073840.0046270.0101400.016964-0.001596-0.000890-0.0029590.004079-0.0013050.0004260.0002690.0003910.0001931.491195e-04-0.0000220.000075-0.000328-0.000104-0.000818-0.0005360.0004250.0006750.000525-0.0009090.000450-0.0007540.002798-0.001876-0.003088-0.002775-0.0010810.0004860.0049000.0005290.0041280.0093500.002341-0.007186-0.009234-0.0074800.0013550.003528-0.0176210.0059530.0103270.0043000.0245450.0168850.013065-0.0040080.020067-0.027640-0.0208630.002466-0.009659-0.004832-0.0063450.0070090.018090-0.005579-0.0019870.0377680.0128930.0029250.044773-0.0385410.0055720.0152530.031836-0.0011070.0592030.009907-0.0443290.045791-0.019859-0.008714-0.0571131.081144-0.0966310.6109730.2140940.2857240.0376802.0248240.396048-0.701736-0.903993-0.163829-0.665074-0.1461700.292248-0.2646660.2321140.1428070.0188160.0061400.2153350.123988-0.0077110.5037430.082692-0.1920490.0499910.284193-0.419728-0.2658260.219807-0.0832130.3653470.0145220.270022-0.100975-0.027756-0.005296-0.384114-0.131914-0.3861470.0372620.123429-0.1934230.233849-0.1353300.132876-0.2043440.048369-0.094278-0.219713-0.040830-0.112828-0.003695-0.1354920.027470-0.069589-0.1104940.046717-0.2144390.0461340.134354-0.104612-0.104119-0.0808470.1557090.017000-0.1693620.0201680.1202980.094487-0.1178660.041941-0.236982-0.0366410.0175550.0308660.0452570.070255-0.0355090.031680-0.041112-0.003302-0.0264330.150842-0.117318-0.0813970.013877-0.0309500.035688-0.022059-0.015251-0.046003-0.0035840.0589730.116162-0.102288-0.026451-0.065339-0.016272-0.0205090.050898-0.0880680.030054-0.0228570.045037-0.1008310.0041400.012740-0.0703830.056177-0.029545-0.0341830.0276490.0031190.028489-0.0580050.0604750.1592390.003554-0.050520-0.049823-0.017363-0.1161300.0858010.0728540.1203810.033087-0.025249
1food_0disease_8390.008390.0072998.20.1570.9227.00.48123.079.06.00.0760.1262.13.9970.032.00.9692.820.69NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN35.865549-0.229259-0.043399-0.001870-0.052036-0.012363-0.024822-0.016792-0.026163-0.001283-0.0019890.002888-0.0078760.0080970.001597-0.000355-0.000594-0.0020790.020368-0.0030790.021870-0.0089400.004237-0.0042790.0157700.0012300.004035-0.0019870.010776-0.003411-0.0098800.001792-0.013636-0.0005860.011546-0.0141590.000428-0.002775-0.0014160.0096330.003450-0.003883-0.0028260.0002510.009810-0.008360-0.0006370.003437-0.0022440.002247-0.0004110.0003580.0005660.002944-0.000523-0.000021-0.000101-0.0001750.0004250.0000680.0001470.000108-0.000043-0.0000211.726565e-05-0.000090-0.0000790.000011-0.0000410.000142-0.0000040.000041-0.000045-0.000070-0.0001940.0005160.0002100.000677-0.000440-0.000186-0.000202-0.000044-0.0003970.0017790.000118-0.0006870.0003810.0002230.000390-0.000753-0.000451-0.002448-0.001939-0.001105-0.000197-0.0016830.0049160.001206-0.000385-0.0043730.0045360.001937-0.003012-0.0015300.0021500.001011-0.0017640.003546-0.0018150.002938-0.003030-0.009093-0.002076-0.0044670.0075850.001704-0.000898-0.000604-0.0025890.0074370.0060400.0093140.007324-0.008239-0.007265-0.007121-0.002937-0.0008521.817120-1.784037-0.973115-0.3223120.9915502.330509-0.176316-0.130993-0.397610-0.2101860.5329730.3236500.0150210.0538100.220851-0.044089-0.1291210.357842-0.1188350.5331570.1976750.0331180.2126450.058697-0.138553-0.074376-0.0760360.418851-0.044005-0.1039220.011587-0.067979-0.088224-0.1908700.2343720.0141500.2955120.208233-0.0028930.0285250.142613-0.2352980.0718750.035765-0.0180800.0066520.1441740.1077610.102500-0.028223-0.166622-0.2272300.059775-0.0541510.1438290.0485320.072578-0.095288-0.0485020.0156620.030757-0.0508830.1476840.008403-0.135227-0.081102-0.040909-0.0558770.0089440.042602-0.086106-0.070377-0.0615520.005882-0.0135980.051109-0.0371140.007101-0.0441090.0043900.032941-0.026301-0.0147110.0214070.0171500.007859-0.012711-0.0817620.013885-0.030356-0.024276-0.0909710.031763-0.0721410.0624880.036716-0.080577-0.0541230.011628-0.0331640.0755650.0547230.0176670.0544610.015337-0.044796-0.028650-0.086811-0.0082690.086882-0.0571330.054998-0.053585-0.027943-0.0318300.0603040.009329-0.003838-0.0040570.011732-0.012960-0.0052640.024481-0.021813-0.0201030.0793060.056902-0.012589
2food_0disease_500.00500.0183828.20.1570.9227.00.48123.079.06.00.0760.1262.13.9970.032.00.9692.820.69NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN41.508018-0.271734-0.049130-0.031626-0.067875-0.005537-0.014704-0.017518-0.034036-0.000631-0.0040230.0138050.000356-0.0053350.0032290.0007130.000098-0.005014-0.005383-0.0055190.003013-0.002217-0.009330-0.0057020.001625-0.008246-0.001536-0.0069820.0135210.000661-0.0034440.004891-0.000032-0.000943-0.0014300.0042260.0008320.0026620.0035110.000865-0.0015110.0015980.003186-0.005701-0.0005640.0009990.0001640.0001420.0023120.0015230.002301-0.000096-0.0003410.0002670.001574-0.0000610.000018-0.000039-0.000093-0.000128-0.000114-0.0000450.000055-0.000003-8.250175e-060.0000300.000038-0.0000220.000025-0.000064-0.000052-0.0000100.000063-0.0000240.000188-0.000111-0.000098-0.000364-0.0000280.000122-0.0001640.000070-0.000014-0.000859-0.0002660.0002220.000684-0.000063-0.000192-0.0004420.0005110.0007770.000287-0.0004200.0009690.0015670.001200-0.0002560.000491-0.000133-0.000430-0.001268-0.002388-0.000176-0.002063-0.0020620.0011440.0016140.0019520.0020080.0000600.001320-0.000544-0.0009070.0036090.001488-0.000267-0.0028210.000381-0.0036250.001080-0.003746-0.0007400.0044680.0001850.005591-0.0018990.0015801.189615-0.1093290.0026650.1906740.706729-0.5260510.1510121.5356681.386908-0.1363520.122367-0.6314010.126397-0.2769030.329921-0.0584340.1500270.081792-0.1093530.098505-0.615293-0.0614010.0541870.1714040.067380-0.1788780.3166290.142819-0.221195-0.169382-0.2766870.346723-0.050159-0.171242-0.080506-0.1186710.0182560.047224-0.304223-0.1522870.122938-0.058284-0.407862-0.010772-0.1755050.1329740.0310090.221593-0.054744-0.009640-0.070451-0.2645160.1106010.2919280.102618-0.166557-0.0213850.0643980.245133-0.163323-0.205366-0.078624-0.113051-0.0866530.0572360.0897190.0225550.0330480.041558-0.154484-0.283011-0.139989-0.030849-0.023703-0.1959760.0474420.177886-0.031600-0.0450780.172449-0.196692-0.226955-0.0914210.1167240.0759020.1983560.069463-0.098914-0.0744650.0183250.0136630.0440860.006720-0.0924380.1068650.042334-0.1068310.047141-0.0526690.032439-0.020543-0.0671510.024235-0.023867-0.0229650.079472-0.0127050.0229780.045083-0.0166020.069385-0.034710-0.012632-0.055165-0.023180-0.105940-0.096701-0.1390670.082889-0.008238-0.0588660.072690-0.0409830.001735-0.0310110.012370-0.0165760.062942
3food_0disease_13700.0013700.2142868.20.1570.9227.00.48123.079.06.00.0760.1262.13.9970.032.00.9692.820.69NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN41.529790-0.260051-0.054846-0.039315-0.0499410.002167-0.012543-0.030178-0.019744-0.009011-0.0013260.0124450.004044-0.010835-0.0040460.0193010.002939-0.006936-0.004579-0.001534-0.0000850.007771-0.027475-0.0012320.0018750.000077-0.008815-0.020402-0.001688-0.008872-0.0056460.0162120.013411-0.0187930.016397-0.003017-0.0040650.003183-0.0066900.018216-0.004084-0.0097240.006243-0.004513-0.0002310.000521-0.0007970.003050-0.001297-0.004875-0.003577-0.001533-0.0014720.0013900.000372-0.000034-0.000062-0.0006050.000263-0.0001070.0000030.0000300.0000360.000024-8.291036e-07-0.000040-0.0000800.000041-0.000044-0.000096-0.0000570.0000170.000058-0.000022-0.0001320.0001070.0002080.000220-0.000585-0.000699-0.0006640.000296-0.0005170.0000280.000650-0.000914-0.0001900.000124-0.001585-0.000599-0.0009610.0006820.000873-0.0009530.001415-0.0042180.0041510.0035510.000414-0.001865-0.000427-0.0005610.0007120.000712-0.002766-0.0023780.000731-0.0007050.004565-0.001887-0.0059950.0035220.002315-0.0022990.004152-0.0077430.002656-0.005594-0.0013020.003646-0.0017100.0063240.006646-0.0061360.0037050.010991-0.0069330.0018961.070765-0.808722-0.295265-0.3775720.2244040.624164-0.171772-0.033262-0.001275-0.288270-0.3864950.1794180.970317-0.800248-1.0322550.141866-0.1304480.211681-0.161183-0.1334500.103317-0.330761-0.286104-0.1093260.1203950.1887760.892057-0.0536330.4029670.386931-0.284458-0.0343370.094869-0.405306-0.0196500.119014-0.220812-0.1573500.1134380.337078-0.105879-0.1919730.1821640.150012-0.0020310.214086-0.0012150.217673-0.046450-0.0453570.213626-0.0122320.0951000.0301390.068430-0.192951-0.1148270.035296-0.2161530.1586590.111714-0.154277-0.1350240.234255-0.0908580.0918080.030877-0.000725-0.186567-0.036827-0.0244780.017350-0.110065-0.012707-0.1910220.007126-0.0276300.0494330.232110-0.1077000.023428-0.015442-0.073996-0.168445-0.0696360.0626910.0745760.089530-0.082161-0.074518-0.0424370.031088-0.1550150.0196560.039321-0.055407-0.063896-0.0936850.2266170.113530-0.045031-0.174211-0.128222-0.114841-0.0317750.003285-0.002169-0.1083740.0021320.015701-0.0257890.049046-0.0747000.020771-0.0002060.0669360.028114-0.094354-0.1140830.041070-0.0608340.0315450.0206050.0458500.060787-0.054691-0.081137-0.041880
4food_0disease_10150.0010150.2027498.20.1570.9227.00.48123.079.06.00.0760.1262.13.9970.032.00.9692.820.69NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.1943240.0160600.021213-0.0195840.0536250.0130380.009994-0.0156060.062566-0.0614010.022027-0.038624-0.055178-0.0480190.7594250.3988780.1117630.0158370.054869-0.0477390.036110-0.0225430.029236-0.030270-0.006784-0.003725-0.0329710.0160450.015254-0.008581-0.012702-0.011345-0.0107330.009206-0.0092290.0006180.005975-0.003449-0.007818-0.0094030.0075910.002266-0.004004-0.002725-0.0038390.0028200.000835-0.0027780.000629-0.002743-0.0005500.0000520.000554-0.0013290.0009710.0000600.0000910.000130-0.000430-0.000074-0.000154-0.0000460.0000240.000015-4.640532e-06-0.0000040.0000060.0000350.000012-0.000102-0.000041-0.0000420.0000030.0000040.000252-0.000294-0.000062-0.0005600.000023-0.000038-0.0001450.0003410.000053-0.0011020.000195-0.000488-0.0002240.000328-0.0003700.0002850.001096-0.0001580.0013380.0004760.0022960.0005730.001374-0.001213-0.0005610.000960-0.000290-0.000561-0.001062-0.0016590.000637-0.004378-0.0027260.0020120.0005430.0019450.0050840.0014430.000259-0.0007840.001073-0.0024020.0051870.002465-0.000312-0.0006650.0023590.0035250.0000400.003171-0.006081-0.0048840.008354-0.0020901.286808-0.2675580.2931770.1799190.753527-0.548435-0.2621820.4559700.503655-0.740368-0.459004-0.4736600.816256-0.5389781.0834090.272414-0.283723-0.9219480.089385-0.305596-0.1414620.209232-0.1021360.361745-0.0730970.247010-0.2152790.622676-0.036221-0.100733-0.0953830.2526110.112919-0.032485-0.2692910.179181-0.179328-0.041581-0.208619-0.143577-0.258412-0.121063-0.4102990.081461-0.1186640.4328680.512245-0.0905770.357983-0.0832310.0627590.1069860.043856-0.294729-0.0217930.070910-0.007633-0.0709040.0971670.1627140.0523390.1022750.007705-0.093869-0.0084470.0463570.154401-0.124018-0.0491330.308927-0.1013890.1074640.126912-0.0873740.120835-0.130838-0.116480-0.070292-0.0493930.0286960.1049590.1818120.097010-0.0754810.084535-0.1225470.0467160.049885-0.0948280.062870-0.0629860.0305450.001237-0.047724-0.0229910.0045140.0558430.0488170.0448050.058769-0.0707730.0885680.020527-0.0167020.049144-0.057848-0.151351-0.1532550.0078000.0521510.0533660.039268-0.0067720.077348-0.028753-0.1188660.028853-0.078241-0.035130-0.033633-0.015799-0.0429210.0350310.071032-0.0972310.081445-0.0147320.010713

  • 1

3.2.3 交叉特征

这里按特征重要性选取靠前的部分特征进行交叉

topn = ['N_33', 'N_198', 'N_74','N_188','N_82','N_42','N_111','disease','food']
for i in range(len(topn)):
    for j in range(i + 1, len(topn)):
        data[f'{topn[i]}+{topn[j]}'] = data[topn[i]] + data[topn[j]]
        data[f'{topn[i]}-{topn[j]}'] = data[topn[i]] - data[topn[j]]
        data[f'{topn[i]}*{topn[j]}'] = data[topn[i]] * data[topn[j]]
        data[f'{topn[i]}/{topn[j]}'] = data[topn[i]] / (data[topn[j]]+1e-5)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
drop_cols = ['disease_id', 'food_id', 'related']
  • 1

3.2.3 特征筛选

去除掉只有单一取值的特征

for f in data.columns:
    if data[f].nunique() < 2:
        drop_cols.append(f)
  • 1
  • 2
  • 3
test_df = data[data["related"].isnull() == True].copy().reset_index(drop=True)
train_df = data[~data["related"].isnull() == True].copy().reset_index(drop=True)
  • 1
  • 2
feature_name = [f for f in train_df.columns if f not in drop_cols]
X_train = train_df[feature_name].reset_index(drop=True)
X_test = test_df[feature_name].reset_index(drop=True)
y = train_df['related'].reset_index(drop=True)
print(len(feature_name))
print(feature_name)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
548
['food', 'disease', 'disease_related_mean', 'N_198', 'N_33', 'N_211', 'N_82', 'N_101', 'N_42', 'N_111', 'N_165', 'N_177', 'N_146', 'N_17', 'N_113', 'N_106', 'N_14', 'N_74', 'N_209', 'N_188', 'disease1_svd_0', 'disease1_svd_1', 'disease1_svd_2', 'disease1_svd_3', 'disease1_svd_4', 'disease1_svd_5', 'disease1_svd_6', 'disease1_svd_7', 'disease1_svd_8', 'disease1_svd_9', 'disease1_svd_10', 'disease1_svd_11', 'disease1_svd_12', 'disease1_svd_13', 'disease1_svd_14', 'disease1_svd_15', 'disease1_svd_16', 'disease1_svd_17', 'disease1_svd_18', 'disease1_svd_19', 'disease1_svd_20', 'disease1_svd_21', 'disease1_svd_22', 'disease1_svd_23', 'disease1_svd_24', 'disease1_svd_25', 'disease1_svd_26', 'disease1_svd_27', 'disease1_svd_28', 'disease1_svd_29', 'disease1_svd_30', 'disease1_svd_31', 'disease1_svd_32', 'disease1_svd_33', 'disease1_svd_34', 'disease1_svd_35', 'disease1_svd_36', 'disease1_svd_37', 'disease1_svd_38', 'disease1_svd_39', 'disease1_svd_40', 'disease1_svd_41', 'disease1_svd_42', 'd
  • 1
  • 2
print(test_df.shape)
  • 1
(47212, 551)
  • 1

3.2.3 模型训练

本次仅使用lightgbm模型来训练。

train_pred = {}
test_pred = {}
  • 1
  • 2
seeds = [2]
num_model_seed = 1
oof = np.zeros(X_train.shape[0])
prediction = np.zeros(X_test.shape[0])
feat_imp_df = pd.DataFrame({'feats': feature_name, 'imp': 0})
parameters = {
    'learning_rate': 0.05,
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 63,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'seed': 2022,
    'bagging_seed': 1,
    'feature_fraction_seed': 7,
    'min_data_in_leaf': 20,
    'verbose': -1, 
    'n_jobs':8,
 
   
}
fold = 5
for model_seed in range(num_model_seed):
    print(seeds[model_seed],"--------------------------------------------------------------------------------------------")
    oof_cat = np.zeros(X_train.shape[0])
    prediction_cat = np.zeros(X_test.shape[0])
    skf = StratifiedKFold(n_splits=fold, random_state=seeds[model_seed], shuffle=True)
    for index, (train_index, test_index) in enumerate(skf.split(X_train, y)):
        train_x, test_x, train_y, test_y = X_train[feature_name].iloc[train_index], X_train[feature_name].iloc[test_index], y.iloc[train_index], y.iloc[test_index]
        dtrain = lgb.Dataset(train_x, label=train_y)
        dval = lgb.Dataset(test_x, label=test_y)
        lgb_model = lgb.train(
            parameters,
            dtrain,
            num_boost_round=10000,
            valid_sets=[dval],
            early_stopping_rounds=100,
            verbose_eval=100, )
        oof_cat[test_index] += lgb_model.predict(test_x,num_iteration=lgb_model.best_iteration)
        prediction_cat += lgb_model.predict(X_test,num_iteration=lgb_model.best_iteration) / fold
        feat_imp_df['imp'] += lgb_model.feature_importance()

        del train_x
        del test_x
        del train_y
        del test_y
        del lgb_model
    oof += oof_cat / num_model_seed
    prediction += prediction_cat / num_model_seed
gc.collect()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52

3.2.4 结果可视化

train_pred['lgb'] = oof
test_pred['lgb'] = prediction
  • 1
  • 2
print("lgb train auc: ", roc_auc_score(y, train_pred['lgb']))
  • 1
lgb train auc:  0.9778226537246766
  • 1
scores = []; thresholds = []
best_score = 0; best_threshold = 0

for threshold in np.arange(0.1,0.9,0.01):
    print(f'{threshold:.02f}, ',end='')
    preds = (train_pred['lgb'].reshape((-1)) > threshold).astype('int')
    m = f1_score(y.values.reshape((-1)), preds, average='binary')   
    scores.append(m)
    thresholds.append(threshold)
    if m>best_score:
        best_score = m
        best_threshold = threshold
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.30, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.40, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.50, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.60, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.70, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.80, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 
  • 1
import matplotlib.pyplot as plt

# PLOT THRESHOLD VS. F1_SCORE
plt.figure(figsize=(20,5))
plt.plot(thresholds,scores,'-o',color='blue')
plt.scatter([best_threshold], [best_score], color='blue', s=300, alpha=1)
plt.xlabel('Threshold',size=14)
plt.ylabel('Validation F1 Score',size=14)
plt.title(f'Threshold vs. F1_Score with Best F1_Score = {best_score:.3f} at Best Threshold = {best_threshold:.3}',size=18)
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JhOW2ktx-1682508721966)(main_files/main_77_0.png)]

auc = roc_auc_score(y, train_pred['lgb'])
f1 = best_score
print((auc + f1) / 2)
  • 1
  • 2
  • 3
0.8939347040505519
  • 1

`

0.8939347040505519
  • 1

3.2.5 生成提交结果

控制1的个数为4100个左右,最终结果:

# label=[1 if x >= 0.265+0.235 else 0 for x in prediction+0.235]
# np.sum(label)

label=[1 if x >= 0.26+0.24 else 0 for x in prediction+0.25]
np.sum(label)
  • 1
  • 2
  • 3
  • 4
  • 5
4032
  • 1
preliminary_a_submit_sample['related_prob'] = prediction+0.25
  • 1
preliminary_a_submit_sample.to_csv('submit.csv', index=False)
  • 1

四、项目总结

4.1 提交结果

4.2 优化思路

  1. 由于foodid只有训练集有,那么是否可以使用food侧的特征,做相似度模型,例如共现矩阵、tfidf、embedding等。

  2. 目标编码做了之后,线下会涨很多但是线上长得比较少,还是过拟合比较严重,是否可以考虑根据疾病特征做聚类,然后减轻这种情况。同理food侧特征也可以做聚类,用来解决测试集都是训练集未曾出现过的id的问题。

  3. 交叉特征里面,对于food侧只取了一部分,是否可以多取一点(进一步上述base可以通过筛选food特征提升至7976的分数,但是会很抖)。

  4. 特征筛选的地方,并没有剔除缺失率高的特征,也没有根据对抗验证进行筛选,或许可以进行尝试。

  5. 模型训练的参数,学习率太低,叶子节点数太高,导致模型过拟合比较严重,可以考虑调参(可以上分)。

  6. 目前仅使用了lightgbm模型,可以考虑xgboost,catboost模型,进行模型的集成(xgb貌似还不错,需要祖传参数)。

  7. 上述base是1的个数为4100个,可以调整不同的个数来测试,亲测不同的特征组合哪怕只是添加一个,最优的1的个数都是不同的。

  8. svd的维度是可以调整的,不一定每个都是一样的。以及还可以使用pca的方法,不过需要进行归一化。

4.3 作者介绍

本人是AI达人特训营第三期项目中的一名学员,非常有幸能与大家分享自己的所思所想。

作者:范远展 指导导师:黄灿桦

此文章为搬运
原项目链接

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/花生_TL007/article/detail/361150
推荐阅读
相关标签
  

闽ICP备14008679号