赞
踩
赛题背景
在当今科技日新月异的时代,人工智能(AI)技术正以前所未有的深度和广度渗透到科研领域,特别是在化学及药物研发中展现出了巨大潜力。精准预测分子性质有助于高效筛选出具有优异性能的候选药物。以PROTACs为例,它是一种三元复合物由目标蛋白配体、linker、E3连接酶配体组成,靶向降解目标蛋白质。本次大赛聚焦于运用先进的人工智能算法预测其降解效能,旨在激发参赛者创新思维,推动AI技术与化学生物学的深度融合,进一步提升药物研发效率与成功率,为人类健康事业贡献智慧力量。通过此次大赛,我们期待见证并孵化出更多精准、高效的分子性质预测模型,共同开启药物发现的新纪元。
赛事任务与数据
选手根据提供的demo数据集,可以基于demo数据集进行数据增强、自行搜集数据等方式扩充数据集,并自行划分数据。运用深度学习、强化学习或更加优秀人工智能的方法预测PROTACs的降解能力,若DC50>100nM且Dmax<80% ,则视为降解能力较差(demo数据集中Label=0);若DC50<=100nM或Dmax>=80%,则视为降解能力好(demo数据集中Label=1)。
大白话解释:
【训练分子性质分类预测模型】运用深度学习、强化学习或更加优秀人工智能的方法预测PROTACs的降解能力,分类为 降解能力较差/降解能力好 两种结论
评价指标
本次竞赛的评价标准采用f1_score,分数越高,效果越好
解题思路
参赛选手的任务是基于训练集的样本数据,构建一个模型来预测测试集中分子的性质情况。这是一个二分类任务,其中目标是根据分析相关信息以及结构信息等特征,预测该分子的性质标签。具体来说,选手需要利用给定的数据集进行特征工程、模型选择和训练,然后使用训练好的模型对测试集中的用户进行预测,并生成相应的预测结果。
import numpy as np import pandas as pd import joblib from catboost import CatBoostClassifier from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold from sklearn.metrics import f1_score from rdkit import Chem from rdkit.Chem import Descriptors,rdMolDescriptors,GraphDescriptors,Lipinski from rdkit.Chem.rdMolDescriptors import CalcMolFormula, CalcTPSA from rdkit.Chem.Crippen import MolLogP from sklearn.feature_extraction.text import TfidfVectorizer from openfe import OpenFE, tree_to_formula, transform, TwoStageFeatureSelector from gensim.models import Word2Vec import tqdm, sys, os, gc, re, argparse, warnings warnings.filterwarnings('ignore') pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None)
train = pd.read_excel('./dataset-new/traindata-new.xlsx')
test = pd.read_excel('./dataset-new/testdata-new.xlsx')
# test数据不包含 DC50 (nM) 和 Dmax (%)
train = train.drop(['DC50 (nM)', 'Dmax (%)'], axis=1)
# 定义了一个空列表drop_cols,用于存储在测试数据集中非空值小于10个的列名。
drop_cols = []
for f in test.columns:
if test[f].notnull().sum() < 10:
drop_cols.append(f)
# 使用drop方法从训练集和测试集中删除了这些列,以避免在后续的分析或建模中使用这些包含大量缺失值的列
train = train.drop(drop_cols, axis=1)
test = test.drop(drop_cols, axis=1)
# 使用pd.concat将清洗后的训练集和测试集合并成一个名为data的DataFrame,便于进行统一的特征工程处理
data = pd.concat([train, test], axis=0, ignore_index=True)
cols = data.columns[2:]
train_label = train.copy()
# 自然数编码()
def label_encode(series):
unique = list(series.unique())
return series.map(dict(zip(
unique, range(series.nunique())
)))
object_cols = train_label.select_dtypes(include=['object']).columns
for col in object_cols:
train_label[col] = label_encode(train_label[col])
features = train_label.columns[1:]
corr = []
for feat in features:
corr.append(abs(train_label[[feat, "Label"]].fillna(0).corr().values[0][1]))
se = pd.Series(corr, index=features).sort_values(ascending=False)
se
data = data.drop(se[-6:].index, axis=1)
DeepChem是一个用于科研的机器学习库。DeepChem最初专注于化学分子的研究,但随着版本更迭,现在其已能更广泛地支持所有类型的科学应用。我觉得这个模块做的比较好的几点在于:
import deepchem as dc
dc_smiles = data['Smiles']
rdkit_featurizer = dc.feat.RDKitDescriptors()
rdkit_feature = rdkit_featurizer.featurize(dc_smiles)
dc_feature = pd.DataFrame(rdkit_feature)
dc_feature.columns = [f'smiles_dc_{i}' for i in range(dc_feature.shape[1])]
zeros_count = dc_feature.eq(0).sum()
columns_to_drop = zeros_count[zeros_count >= 704].index.tolist()
smiles_feature = dc_feature.drop(columns=columns_to_drop)
atomic_masses = { 'H': 1.008, 'He': 4.002602, 'Li': 6.94, 'Be': 9.0122, 'B': 10.81, 'C': 12.01, 'N': 14.01, 'O': 16.00, 'F': 19.00, 'Ne': 20.180, 'Na': 22.990, 'Mg': 24.305, 'Al': 26.982, 'Si': 28.085, 'P': 30.97, 'S': 32.07, 'Cl': 35.45, 'Ar': 39.95, 'K': 39.10, 'Ca': 40.08, 'Sc': 44.956, 'Ti': 47.867, 'V': 50.942, 'Cr': 52.00, 'Mn': 54.938, 'Fe': 55.845, 'Co': 58.933, 'Ni': 58.69, 'Cu': 63.55, 'Zn': 65.38 } # 函数用于解析单个InChI字符串 def parse_inchi(row): inchi_str = row['InChI'] formula = '' molecular_weight = 0 element_counts = {} # 提取分子式 formula_match = re.search(r"InChI=1S/([^/]+)/c", inchi_str) if formula_match: formula = formula_match.group(1) # 计算分子量和原子计数 for element, count in re.findall(r"([A-Z][a-z]*)([0-9]*)", formula): count = int(count) if count else 1 element_mass = atomic_masses.get(element.upper(), 0) molecular_weight += element_mass * count element_counts[element.upper()] = count return pd.Series({ 'ElementCounts': element_counts }) # 应用函数到DataFrame的每一行 data[['ElementCounts']] = data.apply(parse_inchi, axis=1) # 定义存在的key keys = ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn'] # 创建一个空的DataFrame,列名为keys df_expanded = pd.DataFrame({key: pd.Series() for key in keys}) # 遍历数据,填充DataFrame for index, item in enumerate(data['ElementCounts'].values): for key in keys: # 将字典中的值填充到相应的列中 df_expanded.at[index, key] = item.get(key, 0) df_expanded = pd.DataFrame(df_expanded) zeros_count = df_expanded.eq(0).sum() columns_to_drop = zeros_count[zeros_count >= 704].index.tolist() inchi_keys = df_expanded.drop(columns=columns_to_drop)
from rdkit import Chem from rdkit.Chem import Descriptors, rdMolDescriptors, GraphDescriptors, Lipinski def calculate_descriptors(inchi): # 解析InChI字符串,提取分子信息 mol = Chem.MolFromInchi(inchi) # 氢键供体 h_donors = Descriptors.NumHDonors(mol) # 氢键受体 h_acceptors = Descriptors.NumHAcceptors(mol) # 旋转键个数 rotatable_bonds = Descriptors.NumRotatableBonds(mol) # 芳香环数 aromatic_ring_count = Descriptors.NumAromaticRings(mol) # 总极性表面积 (TPSA) tpsa = rdMolDescriptors.CalcTPSA(mol) # XLogP xlogp = Descriptors.MolLogP(mol) # 价电子数 num_valence_electrons = Descriptors.NumValenceElectrons(mol) # 平均信息含量 avg_ipc = GraphDescriptors.AvgIpc(mol) # Balaban's J balaban_j = GraphDescriptors.BalabanJ(mol) # BertzCT 复杂度 bertz_ct = GraphDescriptors.BertzCT(mol) # 重原子分子量 heavy_atom_mol_wt = Descriptors.HeavyAtomMolWt(mol) # 最大绝对部分电荷 max_abs_partial_charge = Descriptors.MaxAbsPartialCharge(mol) # 最大部分电荷 max_partial_charge = Descriptors.MaxPartialCharge(mol) # 最小绝对部分电荷 min_abs_partial_charge = Descriptors.MinAbsPartialCharge(mol) # 最小部分电荷 min_partial_charge = Descriptors.MinPartialCharge(mol) # 分子的Kappa1 kappa1 = rdMolDescriptors.CalcKappa1(mol) # 分子的Kappa2 kappa2 = rdMolDescriptors.CalcKappa2(mol) # 分子的Kappa3 kappa3 = rdMolDescriptors.CalcKappa3(mol) # 分子的Labute ASA labute_asa = rdMolDescriptors.CalcLabuteASA(mol) # 分子的Morgan指纹 morgan_fingerprint = rdMolDescriptors.GetMorganFingerprint(mol, 2) # 分子的自旋轨道耦合常数 kappa = rdMolDescriptors.CalcPhi(mol) # 分子的饱和碳环数 num_saturated_carbocycles = rdMolDescriptors.CalcNumSaturatedCarbocycles(mol) # 分子的饱和杂环数 num_saturated_heterocycles = rdMolDescriptors.CalcNumSaturatedHeterocycles(mol) # 分子的饱和环数 num_saturated_rings = rdMolDescriptors.CalcNumSaturatedRings(mol) # 分子的螺原子数 num_spiro_atoms = rdMolDescriptors.CalcNumSpiroAtoms(mol) # 分子的氧化数 rdMolDescriptors.CalcOxidationNumbers(mol) # 分子的CSP3分数 fraction_csp3 = Lipinski.FractionCSP3(mol) # 分子的NHOH计数 nhoh_count = Lipinski.NHOHCount(mol) # 分子的NO计数 no_count = Lipinski.NOCount(mol) # 分子的异原子数 num_heteroatoms = Lipinski.NumHeteroatoms(mol) # 分子的非芳香碳环数 num_aliphatic_carbocycles = Lipinski.NumAliphaticCarbocycles(mol) # 分子的非芳香杂环数 num_aliphatic_heterocycles = Lipinski.NumAliphaticHeterocycles(mol) # 分子的非芳香环数 num_aliphatic_rings = Lipinski.NumAliphaticRings(mol) # 分子的芳烃碳环数 num_aromatic_carbocycles = Lipinski.NumAromaticCarbocycles(mol) # 分子的芳烃杂环数 num_aromatic_heterocycles = Lipinski.NumAromaticHeterocycles(mol) # 分子的摩尔折射率 mol_refractivity = Descriptors.MolMR(mol) return { "H-Bond Donors": h_donors, "H-Bond Acceptors": h_acceptors, "Rotatable Bonds": rotatable_bonds, "Aromatic Ring Count": aromatic_ring_count, "TPSA": tpsa, "XLogP": xlogp, "Num Valence Electrons": num_valence_electrons, "Average Information Content": avg_ipc, "Balaban's J": balaban_j, "BertzCT Complexity": bertz_ct, "Heavy Atom Molecular Weight": heavy_atom_mol_wt, "Max Absolute Partial Charge": max_abs_partial_charge, "Max Partial Charge": max_partial_charge, "Min Absolute Partial Charge": min_abs_partial_charge, "Min Partial Charge": min_partial_charge, "Kappa1": kappa1, "Kappa2": kappa2, "Kappa3": kappa3, "Labute Accessible Surface Area": labute_asa, "Spin-Orbit Coupling Constant": kappa, "Saturated Carbocycles": num_saturated_carbocycles, "Saturated Heterocycles": num_saturated_heterocycles, "Saturated Rings": num_saturated_rings, "Spiro Atoms": num_spiro_atoms, "CSP3 Fraction": fraction_csp3, "NHOH Count": nhoh_count, "NO Count": no_count, "Heteroatoms": num_heteroatoms, "Aliphatic Carbocycles": num_aliphatic_carbocycles, "Aliphatic Heterocycles": num_aliphatic_heterocycles, "Aliphatic Rings": num_aliphatic_rings, "Aromatic Carbocycles": num_aromatic_carbocycles, "Aromatic Heterocycles": num_aromatic_heterocycles, "Molar Refractivity": mol_refractivity, } # 创建一个空的列表以存储提取的特征 features_list = [] # 提取特征并添加到列表中 for inchi in data['InChI']: features = calculate_descriptors(inchi) features_list.append(features) # 将列表转换为DataFrame inchi_features = pd.DataFrame(features_list)
# 将提取的特征添加到原始数据集
data = pd.concat([data, smiles_feature, inchi_keys, inchi_features], axis=1)
data[:4]
data = data.drop(['ElementCounts'], axis=1)
# 自然数编码()
def label_encode(series):
unique = list(series.unique())
return series.map(dict(zip(
unique, range(series.nunique())
)))
object_cols = data.select_dtypes(include=['object']).columns
for col in object_cols:
data[col] = label_encode(data[col])
train = data[data.Label.notnull()].reset_index(drop=True)
test = data[data.Label.isnull()].reset_index(drop=True)
features1 = train.columns[1:]
corr1 = []
for feat in features1:
corr1.append(abs(train[[feat, "Label"]].fillna(0).corr().values[0][1]))
se1 = pd.Series(corr1, index=features1).sort_values(ascending=False)
drop_se1 = se1.index[-4:]
# 使用drop方法从训练集和测试集中删除了这些列,以避免在后续的分析或建模中使用这些包含大量缺失值的列
train = train.drop(drop_se1, axis=1)
test = test.drop(drop_se1, axis=1)
train[:3]
# 特征筛选
features = [f for f in train.columns if f not in ['uuid','Label']]
# 构建训练集和测试集
x_train = train[features]
x_test = test[features]
# 训练集标签
y_train = train['Label'].astype(int)
x_train.info()
train.rename(columns=lambda x: re.sub(r'[^\w\s]', '_', x), inplace=True)
test.rename(columns=lambda x: re.sub(r'[^\w\s]', '_', x), inplace=True)
OpenFE,全称Open Feature Engineering,是一个开源的Python库,专门设计用于简化和自动化特征工程的过程。通过提供一系列的工具和函数,OpenFE使数据科学家和机器学习工程师能够更高效地创建、测试和部署特征。
ofe = OpenFE()
features = ofe.fit(data=x_train, label=y_train, n_jobs=6)joblib.dump(ofe,"ofe.pkl")for feature in ofe.new_features_list:
print(tree_to_formula(feature))x_train, x_test = transform(x_train, x_test, features, n_jobs=6)cat_columns = x_train.select_dtypes(include=['category']).columns
x_train[cat_columns] = x_train[cat_columns].astype(np.int32)
cat_columns = x_test.select_dtypes(include=['category']).columns
x_test[cat_columns] = x_test[cat_columns].astype(np.int32)
这里借鉴了《机器学习算法竞赛实战》的代码
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import KFold
from hyperopt import hp, fmin, tpe
from numpy.random import RandomState
from sklearn.metrics import mean_squared_error,f1_score
def feature_select_wrapper(train, test): """ :param train: :param test: :return: """ print('feature_select_wrapper...') label = 'Label' features = train.columns.tolist() features.remove('uuid') features.remove('Label') # 配置模型的训练参数 params_initial = { 'num_leaves': 31, 'learning_rate': 0.1, 'boosting': 'gbdt', 'min_child_samples': 20, 'bagging_seed': 2020, 'bagging_fraction': 0.7, 'bagging_freq': 1, 'feature_fraction': 0.7, 'max_depth': -1, 'metric': 'auc', 'reg_alpha': 0, 'reg_lambda': 1, 'objective': 'binary' } ESR = 30 NBR = 10000 VBE = 50 kf = KFold(n_splits=5, random_state=2020, shuffle=True) fse = pd.Series(0, index=features) callbacks = [lgb.early_stopping(stopping_rounds=30, verbose=50)] for train_part_index, eval_index in kf.split(train[features], train[label]): # 模型训练 train_part = lgb.Dataset(train[features].loc[train_part_index], train[label].loc[train_part_index]) eval1 = lgb.Dataset(train[features].loc[eval_index], train[label].loc[eval_index]) bst = lgb.train(params_initial, train_part, num_boost_round=10000, valid_sets=[train_part, eval1], valid_names=['train', 'valid'], callbacks=callbacks ) fse += pd.Series(bst.feature_importance(), features) feature_select = ['uuid'] + fse.sort_values(ascending=False).index.tolist()[:200] print('done') return train[feature_select + ['Label']], test[feature_select]
def params_append(params):
"""
:param params:
:return:
"""
params['objective'] = 'binary'
params['metric'] = 'auc'
params['bagging_seed'] = 2020
return params
def param_hyperopt(train): """ :param train: :return: """ label = 'Label' features = train.columns.tolist() features.remove('uuid') features.remove('Label') params1 = {'feature_pre_filter':False} train_data = lgb.Dataset(train[features], train[label], params = params1) callbacks1 = [lgb.early_stopping(stopping_rounds=20, verbose=False),lgb.log_evaluation(show_stdv=False)] def hyperopt_objective(params): """ :param params: :return: """ params = params_append(params) print(params) res = lgb.cv(params, train_data, 1000, nfold=2, stratified=False, shuffle=True, metrics='auc', seed=2020, callbacks=callbacks1) return min(res['valid auc-mean']) params_space = { 'learning_rate': hp.uniform('learning_rate', 1e-2, 5e-1), 'bagging_fraction': hp.uniform('bagging_fraction', 0.5, 1), 'feature_fraction': hp.uniform('feature_fraction', 0.5, 1), 'num_leaves': hp.choice('num_leaves', list(range(10, 300, 10))), 'reg_alpha': hp.randint('reg_alpha', 0, 10), 'reg_lambda': hp.uniform('reg_lambda', 0, 10), 'bagging_freq': hp.randint('bagging_freq', 1, 10), 'min_child_samples': hp.choice('min_child_samples', list(range(1, 30, 5))) } params_best = fmin( hyperopt_objective, space=params_space, algo=tpe.suggest, max_evals=100, rstate=np.random.default_rng(2020)) return params_best
def train_predict(train, test, params): """ :param train: :param test: :param params: :return: """ label = 'Label' features = train.columns.tolist() features.remove('uuid') features.remove('Label') params = params_append(params) kf = KFold(n_splits=5, random_state=2020, shuffle=True) prediction_test = 0 cv_score = [] prediction_train = pd.Series() ESR = 30 NBR = 10000 VBE = 50 callbacks = [lgb.early_stopping(stopping_rounds=30, verbose=50)] for train_part_index, eval_index in kf.split(train[features], train[label]): # 模型训练 train_part = lgb.Dataset(train[features].loc[train_part_index], train[label].loc[train_part_index]) eval = lgb.Dataset(train[features].loc[eval_index], train[label].loc[eval_index]) bst = lgb.train(params, train_part, num_boost_round=NBR, valid_sets=[train_part, eval], valid_names=['train', 'valid'], callbacks=callbacks) prediction_test += bst.predict(test[features]) prediction_train = prediction_train._append(pd.Series(bst.predict(train[features].loc[eval_index]), index=eval_index)) eval_pre = bst.predict(train[features].loc[eval_index]).astype(int) score = np.sqrt(f1_score(train[label].loc[eval_index].values, eval_pre)) cv_score.append(score) print(cv_score, sum(cv_score) / 5) pd.Series(prediction_train.sort_index().values).to_csv("train_lightgbm.csv", index=False) pd.Series(prediction_test / 5).to_csv("test_lightgbm.csv", index=False) test['Label'] = prediction_test / 5 test[['uuid', 'Label']].to_csv("submit_lightgbm.csv", index=False) return
train_select, test_select = feature_select_wrapper(train, test)
best_clf = param_hyperopt(train_select)
joblib.dump(best_clf,"best_clf.pkl")
best_clf = joblib.load('best_clf.pkl')
train_predict(train_select, test_select, best_clf)
import numpy as np
import pandas as pd
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import KFold
from hyperopt import hp, fmin, tpe
from scipy import sparse
from scipy.sparse import csr_matrix
from sklearn.feature_selection import f_regression,f_classif
from numpy.random import RandomState
from sklearn.metrics import mean_squared_error,f1_score
from bayes_opt import BayesianOptimization
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error,f1_score
def read_data1(debug=True):
features = train.columns.tolist()
features.remove('uuid')
features.remove('Label')
train_x = csr_matrix(train[features].astype(pd.SparseDtype("float64",0)).sparse.to_coo()).tocsr()
test_x = csr_matrix(test[features].astype(pd.SparseDtype("float64",0)).sparse.to_coo()).tocsr()
print("done")
return train_x, test_x
def params_append1(params):
"""
:param params:
:return:
"""
params['objective'] = 'binary:hinge'
params['eval_metric'] = 'auc'
params["min_child_weight"] = int(params["min_child_weight"])
params['max_depth'] = int(params['max_depth'])
return params
def param_beyesian1(train): """ :param train: :return: """ train_y = pd.read_excel("dataset-new/traindata-new.xlsx")['Label'].values train_data = xgb.DMatrix(train, train_y, silent=True) def xgb_cv(colsample_bytree, subsample, min_child_weight, max_depth, reg_alpha, eta, reg_lambda): """ :param colsample_bytree: :param subsample: :param min_child_weight: :param max_depth: :param reg_alpha: :param eta: :param reg_lambda: :return: """ params = {'objective': 'binary:hinge', 'early_stopping_round': 100, 'eval_metric': 'auc'} params['colsample_bytree'] = max(min(colsample_bytree, 1), 0) params['subsample'] = max(min(subsample, 1), 0) params["min_child_weight"] = int(min_child_weight) params['max_depth'] = int(max_depth) params['eta'] = float(eta) params['reg_alpha'] = max(reg_alpha, 0) params['reg_lambda'] = max(reg_lambda, 0) print(params) cv_result = xgb.cv(params, train_data, num_boost_round=10000, nfold=5, seed=2, stratified=False, shuffle=True, early_stopping_rounds=30, verbose_eval=False) return -min(cv_result['test-auc-mean']) xgb_bo = BayesianOptimization( xgb_cv, {'colsample_bytree': (0.5, 1), 'subsample': (0.5, 1), 'min_child_weight': (1, 30), 'max_depth': (5, 12), 'reg_alpha': (0, 5), 'eta':(0.02, 1), 'reg_lambda': (0, 5)} ) xgb_bo.maximize(init_points=21, n_iter=10) # init_points表示初始点,n_iter代表迭代次数(即采样数) print(xgb_bo.max['target'], xgb_bo.max['params']) return xgb_bo.max['params']
def train_predict1(train, test, params): """ :param train: :param test: :param params: :return: """ train_y = pd.read_excel("dataset-new/traindata-new.xlsx")['Label'] test_data = xgb.DMatrix(test) params = params_append1(params) kf = KFold(n_splits=5, random_state=2020, shuffle=True) prediction_test = 0 cv_score = [] prediction_train = pd.Series() ESR = 30 NBR = 10000 VBE = 50 for train_part_index, eval_index in kf.split(train, train_y): # 模型训练 train_part = xgb.DMatrix(train.tocsr()[train_part_index, :], train_y.loc[train_part_index]) eval2 = xgb.DMatrix(train.tocsr()[eval_index, :], train_y.loc[eval_index]) bst = xgb.train(params, train_part, NBR, [(train_part, 'train'), (eval2, 'eval')], verbose_eval=VBE, maximize=False, early_stopping_rounds=ESR, ) prediction_test += bst.predict(test_data) eval_pre = bst.predict(eval2) prediction_train = prediction_train._append(pd.Series(eval_pre, index=eval_index)) score = np.sqrt(f1_score(train_y.loc[eval_index].values, eval_pre)) cv_score.append(score) print(cv_score, sum(cv_score) / 5) pd.Series(prediction_train.sort_index().values).to_csv("train_xgboost.csv", index=False) pd.Series(prediction_test / 5).to_csv("test_xgboost.csv", index=False) test = pd.read_excel('dataset-new/testdata-new.xlsx') test['Label'] = prediction_test / 5 test[['uuid', 'Label']].to_csv("submission_xgboost.csv", index=False) return
train1, test1 = read_data1(debug=False)
best_clf1 = param_beyesian1(train1)
train_predict1(train1, test1, best_clf1)
def cv_model(clf, train_x, train_y, test_x, clf_name, seed=2024): kf = KFold(n_splits=5, shuffle=True, random_state=seed) train = np.zeros(train_x.shape[0]) test = np.zeros(test_x.shape[0]) cv_scores = [] for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)): print('************************************ {} {}************************************'.format(str(i+1), str(seed))) trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index] params = {'learning_rate': 0.1, 'depth': 6, 'l2_leaf_reg': 10, 'bootstrap_type':'Bernoulli','random_seed':seed, 'od_type': 'Iter', 'od_wait': 100, 'random_seed': 11, 'allow_writing_files': False, 'task_type':'CPU'} model = clf(iterations=20000, **params, eval_metric='auc') model.fit(trn_x, trn_y, eval_set=(val_x, val_y), metric_period=100, cat_features=[], use_best_model=True, verbose=1) val_pred = model.predict_proba(val_x)[:,1] test_pred = model.predict_proba(test_x)[:,1] train[valid_index] = val_pred test += test_pred / kf.n_splits cv_scores.append(f1_score(val_y, np.where(val_pred>0.5, 1, 0))) print(cv_scores) print("%s_score_list:" % clf_name, cv_scores) print("%s_score_mean:" % clf_name, np.mean(cv_scores)) print("%s_score_std:" % clf_name, np.std(cv_scores)) return train, test cat_train, cat_test = cv_model(CatBoostClassifier, x_train, y_train, x_test, "cat") pd.DataFrame( { 'uuid': test['uuid'], 'Label': np.where(cat_test>0.5, 1, 0) } ).to_csv('submit_v4.csv', index=None)
未完待续……
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。