当前位置:   article > 正文

论文代码阅读及部分复现:Revisiting Deep Learning Models for Tabular Data

revisiting deep learning models for tabular data

论文地址:https://arxiv.org/pdf/2106.11959.pdf

项目地址:GitHub - yandex-research/rtdl-revisiting-models: (NeurIPS 2021) Revisiting Deep Learning Models for Tabular Data

相关数据:https://www.dropbox.com/s/o53umyg6mn3zhxy/ 

2024年2月11日补充:

此处的LassoNet模型实际上只是带skip层MLP;实际的LassoNet作特征筛选时还需要更新lambda等值,耗时较长,此处仅取了中间的循环。

一、论文概述

现有的关于表格数据做深度学习的模型层出不穷,但是作者认为,由于在真实使用模型时有着不同的基准以及实验场合,这些提出的模型没有被很好地比较。因此,论文作者在论文中对各类模型进行了综述,并且自身提出了一个对Transformer作简单改进的模型:FT-Transformer,最终将ResNet-like类模型、Transformer-like类模型以及其他MLP模型在不同的数据集上训练、对比效果,最终确定了一个较好的衡量针对表格数据的深度学习模型的标准(bennchmark)。但是和梯度提升的决策树模型相比,还没有很好的基于深度学习的模型。

二、使用到的模型

1、MLP:也就是最常见的多层感知器,使用Relu激活函数与Dropout层

2、ResNet:残差神经网络,由残差块(ResNetBlock)组成,残差块可以理解为以下函数:H(x)=x+F(x) 其中F(x)=Dropout(Linear(Dropout(Linear(BatchNorm(x)))))

 3、FT-Transformer:这是论文作者提出的一个Transformer的简单变种,简而言之就是在传入Transformer之前加了一个Feature Tokenizer:将连续变量作线性变换,离散变量作Embedding,最后再加入一个CLS向量作为结果向量;将处理完的数据传入一个Encoder-only的Transformer中。

4、SNN:自归一化神经网络(注意不是“脉冲神经网络”),在原本的MLP模型基础上使用了SELU激活函数,能够训练更深的模型。

5、NODE:Neural oblivious decision ensembles,在神经网络中加入了决策树原理

6、TabNet:和NODE一样,也是在神经网络中加入了决策树原理

7、GrowNet:在神经网络中加入Boosting的思想

8、DCN:在Wide and Deep上改进,将线性模型部分换成Cross Network。在Cross Network中,每一层的输出都会乘以原始的输入特征。这个模型也是统一输入使用Embedding处理离散特征。

9、AutoInt:认为浅层模型收到交叉阶数限制,且DNN在高阶隐性交叉结果不好;这个模型加入了注意力机制。模型在输入时,会同时将离散特征与连续特征进行Embedding,将其分为三个矩阵:Query,Key,Values 将Query与Key作内积衡量相似度,使用Softmax得到attention,最终将Attention乘以values,得到一个head的结果。

10、CatBoost:使用了Ordered Target Statistics来处理多分类变量,避免了多分类变量作One-hot处理会产生维数爆炸的问题。

11、XGBoost:对损失函数进行了二阶泰勒展开,从而可以在训练时使用二阶导。随着版本迭代,现在的XGBoost也可以处理多分类变量了。

三、实验策略

首先,实验对每个模型在每个数据集上都使用了相同的预处理,大多数数据都是用了分位数转换处理,而数据集Helena和ALOI则使用了标准化(standard),Epsilon没有使用任何预处理。对于回归任务,所有的应变量都被做了标准化。

而对于每个模型,首先会使用Optuna作贝叶斯回归在验证集上调优获得一个“最优”的超参数,然后使用15个不同的随机数种子在,将这15个模型分成3组,每组平均单个预测模型。

调优流程

论文作者将所有调优的过程放到了tune.py这一个文件中,想要调优时只要运行这一个文件然后加上要调优的配置文件(toml)就可以了,github上的例子就是:

python bin/tune.py output/california_housing/mlp/tuning/reproduced.toml

以tf-transformer的toml配置文件为例:

  1. program = 'bin/ft_transformer.py'
  2. [base_config]
  3. seed = 0
  4. [base_config.data]
  5. normalization = 'quantile'
  6. path = 'data/california_housing'
  7. y_policy = 'mean_std'
  8. [base_config.model]
  9. activation = 'reglu'
  10. initialization = 'kaiming'
  11. n_heads = 8
  12. prenormalization = true
  13. [base_config.training]
  14. batch_size = 256
  15. eval_batch_size = 8192
  16. n_epochs = 1000000000
  17. optimizer = 'adamw'
  18. patience = 16
  19. [optimization.options]
  20. n_trials = 100
  21. [optimization.sampler]
  22. seed = 0
  23. [optimization.space.model]
  24. attention_dropout = [ 'uniform', 0.0, 0.5 ]
  25. d_ffn_factor = [ '$d_ffn_factor', 1.0, 4.0 ]
  26. d_token = [ '$d_token', 64, 512 ]
  27. ffn_dropout = [ 'uniform', 0.0, 0.5 ]
  28. n_layers = [ 'int', 1, 4 ]
  29. residual_dropout = [ '?uniform', 0.0, 0.0, 0.2 ]
  30. [optimization.space.training]
  31. lr = [ 'loguniform', 1e-05, 0.001 ]
  32. weight_decay = [ 'loguniform', 1e-06, 0.001 ]

其中,program表明模型定义以及训练函数定义的位置,base_config是在训练对应数据集的固定参数,里面包含有以下信息:

seed:模型训练的随机数种子

data:记录了一切对数据进行的预处理,如正则化,数据路径以及y值预处理操作

model:记录了模型固定的参数,通常是决定模型本身结构与深度的参数,这部分参数不会作调优

training:记录了使用模型作训练以及验证时的参数,如batch_size,epoch数量、优化器以及patience等。这里的patience指的是又连续多少个epoch在验证集上没有改进后才会停止训练。

optimization:使用Optuna调优时的所用到的参数,其中options与sampler是作贝叶斯回归时的参数以及抽样参数,space.model是模型的参数空间,Optuna会在这些参数中找到最优的组合,space.training是训练时的参数,如学习率/权重衰减等,也是Optuna所要进行抽样选取最优解的对象。

读取toml格式的文件会转化为字典。

下面是实验调优的代码。

  1. program = lib.get_path(args['program'])
  2. program_copy = program.with_name(
  3. program.stem + '___' + str(uuid.uuid4()).replace('-', '') + program.suffix
  4. )
  5. shutil.copyfile(program, program_copy)
  6. atexit.register(lambda: program_copy.unlink())
  7. checkpoint_path = output / 'checkpoint.pt'
  8. if checkpoint_path.exists():
  9. checkpoint = torch.load(checkpoint_path)
  10. trial_configs, trial_stats, study, stats, timer = (
  11. checkpoint['trial_configs'],
  12. checkpoint['trial_stats'],
  13. checkpoint['study'],
  14. checkpoint['stats'],
  15. checkpoint['timer'],
  16. )
  17. zero.set_random_state(checkpoint['random_state'])
  18. if 'n_trials' in args['optimization']['options']:
  19. args['optimization']['options']['n_trials'] -= len(study.trials)
  20. if 'timeout' in args['optimization']['options']:
  21. args['optimization']['options']['timeout'] -= timer()
  22. stats.setdefault('continuations', []).append(len(study.trials))
  23. print(f'Loading checkpoint ({len(study.trials)})')
  24. else:
  25. stats = lib.load_json(output / 'stats.json')
  26. trial_configs = []
  27. trial_stats = []
  28. timer = zero.Timer()
  29. study = optuna.create_study(
  30. direction='maximize',
  31. sampler=optuna.samplers.TPESampler(**args['optimization']['sampler']),
  32. )
  33. timer.run()
  34. # ignore the progress bar warning
  35. warnings.filterwarnings('ignore', category=optuna.exceptions.ExperimentalWarning)
  36. study.optimize(
  37. objective,
  38. **args['optimization']['options'],
  39. callbacks=[save_checkpoint],
  40. show_progress_bar=True,
  41. )
  42. best_trial_id = study.best_trial.number
  43. lib.dump_toml(trial_configs[best_trial_id], output / 'best.toml')
  44. stats['best_stats'] = trial_stats[best_trial_id]
  45. stats['time'] = lib.format_seconds(timer())
  46. lib.dump_stats(stats, output, True)
  47. lib.backup_output(output)

首先程序会临时拷贝一份模型的定义文件,等到进程结束之后删除。此处的目的我估计是为了避免有多个进程访问同一个py文件。

之后开始检查有没有checkpoint,如果有checkpoint的话,会在checkpoint的基础上继续做调优,减去原本已经做完的的trial个数,否则的话就从头开始进行调优。

之后开始运行study.optimize函数,开始进行调优,其中objective对象就是我们定义的要调优的对象。

  1. def sample_parameters(
  2. trial: optuna.trial.Trial,
  3. space: ty.Dict[str, ty.Any],
  4. base_config: ty.Dict[str, ty.Any],
  5. ) -> ty.Dict[str, ty.Any]:
  6. def get_distribution(distribution_name):
  7. return getattr(trial, f'suggest_{distribution_name}')
  8. result = {}
  9. for label, subspace in space.items():
  10. if isinstance(subspace, dict):
  11. result[label] = sample_parameters(trial, subspace, base_config)
  12. else:
  13. assert isinstance(subspace, list)
  14. distribution, *args = subspace
  15. if distribution.startswith('?'):
  16. default_value = args[0]
  17. result[label] = (
  18. get_distribution(distribution.lstrip('?'))(label, *args[1:])
  19. if trial.suggest_categorical(f'optional_{label}', [False, True])
  20. else default_value
  21. )
  22. elif distribution == '$mlp_d_layers':
  23. min_n_layers, max_n_layers, d_min, d_max = args
  24. n_layers = trial.suggest_int('n_layers', min_n_layers, max_n_layers)
  25. suggest_dim = lambda name: trial.suggest_int(name, d_min, d_max) # noqa
  26. d_first = [suggest_dim('d_first')] if n_layers else []
  27. d_middle = (
  28. [suggest_dim('d_middle')] * (n_layers - 2) if n_layers > 2 else []
  29. )
  30. d_last = [suggest_dim('d_last')] if n_layers > 1 else []
  31. result[label] = d_first + d_middle + d_last
  32. elif distribution == '$d_token':
  33. assert len(args) == 2
  34. try:
  35. n_heads = base_config['model']['n_heads']
  36. except KeyError:
  37. n_heads = base_config['model']['n_latent_heads']
  38. for x in args:
  39. assert x % n_heads == 0
  40. result[label] = trial.suggest_int('d_token', *args, n_heads) # type: ignore[code]
  41. elif distribution in ['$d_ffn_factor', '$d_hidden_factor']:
  42. if base_config['model']['activation'].endswith('glu'):
  43. args = (args[0] * 2 / 3, args[1] * 2 / 3)
  44. result[label] = trial.suggest_uniform('d_ffn_factor', *args)
  45. else:
  46. result[label] = get_distribution(distribution)(label, *args)
  47. return result
  48. def merge_sampled_parameters(config, sampled_parameters):
  49. for k, v in sampled_parameters.items():
  50. if isinstance(v, dict):
  51. merge_sampled_parameters(config.setdefault(k, {}), v)
  52. else:
  53. assert k not in config
  54. config[k] = v
  55. def objective(trial: optuna.trial.Trial) -> float:
  56. config = deepcopy(args['base_config'])
  57. merge_sampled_parameters(
  58. config, sample_parameters(trial, args['optimization']['space'], config)
  59. )
  60. if args.get('config_type') in ['trv2', 'trv4']:
  61. config['model']['d_token'] -= (
  62. config['model']['d_token'] % config['model']['n_heads']
  63. )
  64. if args.get('config_type') == 'trv4':
  65. if config['model']['activation'].endswith('glu'):
  66. # This adjustment is needed to keep the number of parameters roughly in the
  67. # same range as for non-glu activations
  68. config['model']['d_ffn_factor'] *= 2 / 3
  69. trial_configs.append(config)
  70. with tempfile.TemporaryDirectory() as dir_:
  71. dir_ = Path(dir_)
  72. out = dir_ / f'trial_{trial.number}'
  73. config_path = out.with_suffix('.toml')
  74. lib.dump_toml(config, config_path)
  75. python = Path('/miniconda3/envs/main/bin/python')
  76. subprocess.run(
  77. [
  78. str(python) if python.exists() else "python",
  79. str(program_copy),
  80. str(config_path),
  81. ],
  82. check=True,
  83. ) #训练 subprocess.run可以取得返回结果等信息
  84. stats = lib.load_json(out / 'stats.json')
  85. stats['algorithm'] = stats['algorithm'].rsplit('___', 1)[0]
  86. trial_stats.append(
  87. {
  88. **stats,
  89. 'trial_id': trial.number,
  90. 'tuning_time': lib.format_seconds(timer()),
  91. }
  92. )
  93. lib.dump_json(trial_stats, output / 'trial_stats.json', indent=4)
  94. lib.backup_output(output)
  95. print(f'Time: {lib.format_seconds(timer())}')
  96. return stats['metrics'][lib.VAL]['score']

trv2和trv4在这个项目中没有用到,应该是论文作者的团队在别的项目中用到的参数。

首先使用sample_parameters函数将toml中的那些参数空间转化为Optuna中trial的属性。注意suggest_uniform在Optuna3.0版本开始传入参数不再支持单个*args了,所以最后2个分支需要改为:

  1. result[label] = trial.suggest_float('d_ffn_factor', args[0], args[1])#trial.suggest_uniform('d_ffn_factor', *args)
  2. else:
  3. if distribution == "uniform":
  4. result[label] = trial.suggest_float(label,args[0], args[1])
  5. else:
  6. result[label] = get_distribution(distribution)(label, *args)
  7. return result

最终转换为trial中的需要调优的参数:

再使用merge_sampled_parameters将所有的参数放在一个config字典变量中,最后使用subprocess.run函数,将config对象传入模型定义的py文件进行训练,将结果保存并返回。这样,就能构建出用于Optuna优化的objective函数了。

注意此处某些参数的空间会有特殊处理:

如果参数在toml文件中标了"?",如dropout = [ '?uniform', 0.0, 0.0, 0.5 ],就意味着这个参数是“可选调优项”。此处的dropout的配置的含义是:首先在“要不要对这个参数调优”这个样本空间里进行抽样,如果不要调优的话,就分配其默认值0。如果要调优的话,就在后面0~0.5的均匀分布中抽样。

而对于$mlp_d_layers的特殊参数空间而言,会对以下4个部分进行参数空间的创建:需要多少层中间层,第一层,最后一层以及中间层分别要多大。

对于$d_token参数,通常是在Transformer-like的模型中,故而会检测其能否被注意力头数(n_head)整除。

剩下$d_ffn_factor和$d_hidden_factor这2个参数,当使用glu系的激活函数时,会进行特殊处理(除以3)。

下面纤细讲解一下数据预处理流程以及模型训练流程。

数据预处理流程

首先会创建一个数据集对象,读取数据文件中的npy文件。

  1. @dc.dataclass
  2. class Dataset:
  3. N: ty.Optional[ArrayDict]
  4. C: ty.Optional[ArrayDict]
  5. y: ArrayDict
  6. info: ty.Dict[str, ty.Any]
  7. folder: ty.Optional[Path]
  8. @classmethod
  9. def from_dir(cls, dir_: ty.Union[Path, str]) -> 'Dataset':
  10. dir_ = Path(dir_)
  11. def load(item) -> ArrayDict:
  12. return {
  13. x: ty.cast(np.ndarray, np.load(dir_ / f'{item}_{x}.npy')) # type: ignore[code]
  14. for x in ['train', 'val', 'test']
  15. }
  16. return Dataset(
  17. load('N') if dir_.joinpath('N_train.npy').exists() else None,
  18. load('C') if dir_.joinpath('C_train.npy').exists() else None,
  19. load('y'),
  20. util.load_json(dir_ / 'info.json'),
  21. dir_,
  22. )

其中,C是离散类变量,N是连续型变量,Y是因变量。 之后,分别对X和Y进行数据预处理。

  1. def normalize(
  2. X: ArrayDict, normalization: str, seed: int, noise: float = 1e-3
  3. ) -> ArrayDict:
  4. X_train = X['train'].copy()
  5. if normalization == 'standard':
  6. normalizer = sklearn.preprocessing.StandardScaler()
  7. elif normalization == 'quantile':
  8. normalizer = sklearn.preprocessing.QuantileTransformer(
  9. output_distribution='normal',
  10. n_quantiles=max(min(X['train'].shape[0] // 30, 1000), 10),
  11. subsample=1e9,
  12. random_state=seed,
  13. )
  14. if noise:
  15. stds = np.std(X_train, axis=0, keepdims=True)
  16. noise_std = noise / np.maximum(stds, noise) # type: ignore[code]
  17. X_train += noise_std * np.random.default_rng(seed).standard_normal( # type: ignore[code]
  18. X_train.shape
  19. )
  20. else:
  21. util.raise_unknown('normalization', normalization)
  22. normalizer.fit(X_train)
  23. return {k: normalizer.transform(v) for k, v in X.items()} # type: ignore[code]
  24. def build_X(
  25. self,
  26. *,
  27. normalization: ty.Optional[str],
  28. num_nan_policy: str,
  29. cat_nan_policy: str,
  30. cat_policy: str,
  31. cat_min_frequency: float = 0.0,
  32. seed: int,
  33. ) -> ty.Union[ArrayDict, ty.Tuple[ArrayDict, ArrayDict]]:
  34. cache_path = (
  35. self.folder
  36. / f'build_X__{normalization}__{num_nan_policy}__{cat_nan_policy}__{cat_policy}__{seed}.pickle' # noqa
  37. if self.folder
  38. else None
  39. )
  40. if cache_path and cat_min_frequency:
  41. cache_path = cache_path.with_name(
  42. cache_path.name.replace('.pickle', f'__{cat_min_frequency}.pickle')
  43. )
  44. if cache_path and cache_path.exists():
  45. print(f'Using cached X: {cache_path}')
  46. with open(cache_path, 'rb') as f:
  47. return pickle.load(f)
  48. def save_result(x):
  49. if cache_path:
  50. with open(cache_path, 'wb') as f:
  51. pickle.dump(x, f)
  52. if self.N:
  53. N = deepcopy(self.N)
  54. num_nan_masks = {k: np.isnan(v) for k, v in N.items()}
  55. if any(x.any() for x in num_nan_masks.values()): # type: ignore[code]
  56. if num_nan_policy == 'mean':
  57. num_new_values = np.nanmean(self.N['train'], axis=0)
  58. else:
  59. util.raise_unknown('numerical NaN policy', num_nan_policy)
  60. for k, v in N.items():
  61. num_nan_indices = np.where(num_nan_masks[k])
  62. v[num_nan_indices] = np.take(num_new_values, num_nan_indices[1])
  63. if normalization:
  64. N = normalize(N, normalization, seed)
  65. else:
  66. N = None
  67. if cat_policy == 'drop' or not self.C:
  68. assert N is not None
  69. save_result(N)
  70. return N
  71. C = deepcopy(self.C)
  72. cat_nan_masks = {k: v == 'nan' for k, v in C.items()}
  73. if any(x.any() for x in cat_nan_masks.values()): # type: ignore[code]
  74. if cat_nan_policy == 'new':
  75. cat_new_value = '___null___'
  76. imputer = None
  77. elif cat_nan_policy == 'most_frequent':
  78. cat_new_value = None
  79. imputer = SimpleImputer(strategy=cat_nan_policy) # type: ignore[code]
  80. imputer.fit(C['train'])
  81. else:
  82. util.raise_unknown('categorical NaN policy', cat_nan_policy)
  83. if imputer:
  84. C = {k: imputer.transform(v) for k, v in C.items()}
  85. else:
  86. for k, v in C.items():
  87. cat_nan_indices = np.where(cat_nan_masks[k])
  88. v[cat_nan_indices] = cat_new_value
  89. if cat_min_frequency:
  90. C = ty.cast(ArrayDict, C)
  91. min_count = round(len(C['train']) * cat_min_frequency)
  92. rare_value = '___rare___'
  93. C_new = {x: [] for x in C}
  94. for column_idx in range(C['train'].shape[1]):
  95. counter = Counter(C['train'][:, column_idx].tolist())
  96. popular_categories = {k for k, v in counter.items() if v >= min_count}
  97. for part in C_new:
  98. C_new[part].append(
  99. [
  100. (x if x in popular_categories else rare_value)
  101. for x in C[part][:, column_idx].tolist()
  102. ]
  103. )
  104. C = {k: np.array(v).T for k, v in C_new.items()}
  105. unknown_value = np.iinfo('int64').max - 3
  106. encoder = sklearn.preprocessing.OrdinalEncoder(
  107. handle_unknown='use_encoded_value', # type: ignore[code]
  108. unknown_value=unknown_value, # type: ignore[code]
  109. dtype='int64', # type: ignore[code]
  110. ).fit(C['train'])
  111. C = {k: encoder.transform(v) for k, v in C.items()}
  112. max_values = C['train'].max(axis=0)
  113. for part in ['val', 'test']:
  114. for column_idx in range(C[part].shape[1]):
  115. C[part][C[part][:, column_idx] == unknown_value, column_idx] = (
  116. max_values[column_idx] + 1
  117. )
  118. if cat_policy == 'indices':
  119. result = (N, C)
  120. elif cat_policy == 'ohe':
  121. ohe = sklearn.preprocessing.OneHotEncoder(
  122. handle_unknown='ignore', sparse=False, dtype='float32' # type: ignore[code]
  123. )
  124. ohe.fit(C['train'])
  125. C = {k: ohe.transform(v) for k, v in C.items()}
  126. result = C if N is None else {x: np.hstack((N[x], C[x])) for x in N}
  127. elif cat_policy == 'counter':
  128. assert seed is not None
  129. loo = LeaveOneOutEncoder(sigma=0.1, random_state=seed, return_df=False)
  130. loo.fit(C['train'], self.y['train'])
  131. C = {k: loo.transform(v).astype('float32') for k, v in C.items()} # type: ignore[code]
  132. if not isinstance(C['train'], np.ndarray):
  133. C = {k: v.values for k, v in C.items()} # type: ignore[code]
  134. if normalization:
  135. C = normalize(C, normalization, seed, inplace=True) # type: ignore[code]
  136. result = C if N is None else {x: np.hstack((N[x], C[x])) for x in N}
  137. else:
  138. util.raise_unknown('categorical policy', cat_policy)
  139. save_result(result)
  140. return result # type: ignore[code]

在最前面的那个cache_path是用来储存相同处理参数下预处理过后的自变量的,在使用相同参数时可以直接拉取出来不用再做一遍预处理过程。之后分别对连续型变量和离散型变量作预处理:先做缺失值填充,再作数据标准化。连续变量的缺失值填充只有平均值填充一个策略,而对于离散型变量,缺失值有2个处理策略:一种是当做一个新的类别作为处理,另外一种则是使用对应特征中种类最多的类别进行填充。还有就是有一个cat_min_frequency参数,可以指定将出现低于cat_min_frequency频率的那些类别进行合并作为'___rare___'类。之后使用OrdinalEncoder将类别变量转换为数字。最后还有个cat_policy决定最终的输出结果:indices表示直接使用类别变量对应的index,one代表使用One-hotEncoder来处理类别变量,而counter则代表使用LeaveOneOutEncoder:将对应行的因变量排除后,其他对应这一类别的特征变量的均值作encoder。

  1. def build_y(
  2. self, policy: ty.Optional[str]
  3. ) -> ty.Tuple[ArrayDict, ty.Optional[ty.Dict[str, ty.Any]]]:
  4. if self.is_regression:
  5. assert policy == 'mean_std'
  6. y = deepcopy(self.y)
  7. if policy:
  8. if not self.is_regression:
  9. warnings.warn('y_policy is not None, but the task is NOT regression')
  10. info = None
  11. elif policy == 'mean_std':
  12. mean, std = self.y['train'].mean(), self.y['train'].std()
  13. y = {k: (v - mean) / std for k, v in y.items()}
  14. info = {'policy': policy, 'mean': mean, 'std': std}
  15. else:
  16. util.raise_unknown('y policy', policy)
  17. else:
  18. info = None
  19. return y, info

而对于y值的处理,连续变量用单纯地标准化(减去平均值除以方差),离散变量则不使用任何的处理方式(使用了会给出一个warning,而且离散变量作标准化实际上也没有意义)。

训练流程

 此处以论文作者所作的FT-Transformer举例。

  1. if __name__ == "__main__":
  2. args, output = lib.load_config()
  3. args['model'].setdefault('token_bias', True)
  4. args['model'].setdefault('kv_compression', None)
  5. args['model'].setdefault('kv_compression_sharing', None)
  6. # %%
  7. zero.set_randomness(args['seed'])
  8. dataset_dir = lib.get_path(args['data']['path'])
  9. stats: ty.Dict[str, ty.Any] = {
  10. 'dataset': dataset_dir.name,
  11. 'algorithm': Path(__file__).stem,
  12. **lib.load_json(output / 'stats.json'), #**用以扩展字典,将另一个字典中的键值对传入
  13. }
  14. timer = zero.Timer()
  15. timer.run()
  16. D = lib.Dataset.from_dir(dataset_dir)
  17. X = D.build_X(
  18. normalization=args['data'].get('normalization'),
  19. num_nan_policy='mean',
  20. cat_nan_policy='new',
  21. cat_policy=args['data'].get('cat_policy', 'indices'),
  22. cat_min_frequency=args['data'].get('cat_min_frequency', 0.0),
  23. seed=args['seed'],
  24. )
  25. if not isinstance(X, tuple):
  26. X = (X, None)
  27. zero.set_randomness(args['seed'])
  28. Y, y_info = D.build_y(args['data'].get('y_policy'))
  29. lib.dump_pickle(y_info, output / 'y_info.pickle')
  30. X = tuple(None if x is None else lib.to_tensors(x) for x in X)
  31. Y = lib.to_tensors(Y)
  32. device = lib.get_device()
  33. if device.type != 'cpu':
  34. X = tuple(
  35. None if x is None else {k: v.to(device) for k, v in x.items()} for x in X
  36. )
  37. Y_device = {k: v.to(device) for k, v in Y.items()}
  38. else:
  39. Y_device = Y
  40. X_num, X_cat = X
  41. del X
  42. if not D.is_multiclass:
  43. Y_device = {k: v.float() for k, v in Y_device.items()}
  44. train_size = D.size(lib.TRAIN)
  45. batch_size = args['training']['batch_size']
  46. epoch_size = stats['epoch_size'] = math.ceil(train_size / batch_size)
  47. eval_batch_size = args['training']['eval_batch_size']
  48. chunk_size = None
  49. loss_fn = (
  50. F.binary_cross_entropy_with_logits
  51. if D.is_binclass
  52. else F.cross_entropy
  53. if D.is_multiclass
  54. else F.mse_loss
  55. )
  56. model = Transformer(
  57. d_numerical=0 if X_num is None else X_num['train'].shape[1],
  58. categories=lib.get_categories(X_cat),
  59. d_out=D.info['n_classes'] if D.is_multiclass else 1,
  60. **args['model'],
  61. ).to(device)
  62. if torch.cuda.device_count() > 1: # type: ignore[code]
  63. print('Using nn.DataParallel')
  64. model = nn.DataParallel(model)
  65. stats['n_parameters'] = lib.get_n_parameters(model)
  66. def needs_wd(name):
  67. return all(x not in name for x in ['tokenizer', '.norm', '.bias'])
  68. for x in ['tokenizer', '.norm', '.bias']:
  69. assert any(x in a for a in (b[0] for b in model.named_parameters()))
  70. parameters_with_wd = [v for k, v in model.named_parameters() if needs_wd(k)]
  71. parameters_without_wd = [v for k, v in model.named_parameters() if not needs_wd(k)]
  72. optimizer = lib.make_optimizer(
  73. args['training']['optimizer'],
  74. (
  75. [
  76. {'params': parameters_with_wd},
  77. {'params': parameters_without_wd, 'weight_decay': 0.0},
  78. ]
  79. ),
  80. args['training']['lr'],
  81. args['training']['weight_decay'],
  82. )
  83. stream = zero.Stream(lib.IndexLoader(train_size, batch_size, True, device))
  84. progress = zero.ProgressTracker(args['training']['patience'])
  85. training_log = {lib.TRAIN: [], lib.VAL: [], lib.TEST: []}
  86. timer = zero.Timer()
  87. checkpoint_path = output / 'checkpoint.pt'
  88. def print_epoch_info():
  89. print(f'\n>>> Epoch {stream.epoch} | {lib.format_seconds(timer())} | {output}')
  90. print(
  91. ' | '.join(
  92. f'{k} = {v}'
  93. for k, v in {
  94. 'lr': lib.get_lr(optimizer),
  95. 'batch_size': batch_size,
  96. 'chunk_size': chunk_size,
  97. 'epoch_size': stats['epoch_size'],
  98. 'n_parameters': stats['n_parameters'],
  99. }.items()
  100. )
  101. )
  102. def apply_model(part, idx):
  103. return model(
  104. None if X_num is None else X_num[part][idx],
  105. None if X_cat is None else X_cat[part][idx],
  106. )
  107. @torch.no_grad()
  108. def evaluate(parts):
  109. global eval_batch_size
  110. model.eval()
  111. metrics = {}
  112. predictions = {}
  113. for part in parts:
  114. while eval_batch_size:
  115. try:
  116. predictions[part] = (
  117. torch.cat(
  118. [
  119. apply_model(part, idx)
  120. for idx in lib.IndexLoader(
  121. D.size(part), eval_batch_size, False, device
  122. )
  123. ]
  124. )
  125. .cpu()
  126. .numpy()
  127. )
  128. except RuntimeError as err:
  129. if not lib.is_oom_exception(err):
  130. raise
  131. eval_batch_size //= 2
  132. print('New eval batch size:', eval_batch_size)
  133. stats['eval_batch_size'] = eval_batch_size
  134. else:
  135. break
  136. if not eval_batch_size:
  137. RuntimeError('Not enough memory even for eval_batch_size=1')
  138. metrics[part] = lib.calculate_metrics(
  139. D.info['task_type'],
  140. Y[part].numpy(), # type: ignore[code]
  141. predictions[part], # type: ignore[code]
  142. 'logits',
  143. y_info,
  144. )
  145. for part, part_metrics in metrics.items():
  146. print(f'[{part:<5}]', lib.make_summary(part_metrics))
  147. return metrics, predictions
  148. def save_checkpoint(final):
  149. torch.save(
  150. {
  151. 'model': model.state_dict(),
  152. 'optimizer': optimizer.state_dict(),
  153. 'stream': stream.state_dict(),
  154. 'random_state': zero.get_random_state(),
  155. **{
  156. x: globals()[x]
  157. for x in [
  158. 'progress',
  159. 'stats',
  160. 'timer',
  161. 'training_log',
  162. ]
  163. },
  164. },
  165. checkpoint_path,
  166. )
  167. lib.dump_stats(stats, output, final)
  168. lib.backup_output(output)
  169. # %%
  170. timer.run()
  171. for epoch in stream.epochs(args['training']['n_epochs']):
  172. print_epoch_info()
  173. model.train()
  174. epoch_losses = []
  175. for batch_idx in epoch:
  176. loss, new_chunk_size = lib.train_with_auto_virtual_batch( #一次训练的代码
  177. optimizer,
  178. loss_fn,
  179. lambda x: (apply_model(lib.TRAIN, x), Y_device[lib.TRAIN][x]),
  180. batch_idx,
  181. chunk_size or batch_size,
  182. )
  183. epoch_losses.append(loss.detach())
  184. if new_chunk_size and new_chunk_size < (chunk_size or batch_size):
  185. stats['chunk_size'] = chunk_size = new_chunk_size
  186. print('New chunk size:', chunk_size)
  187. epoch_losses = torch.stack(epoch_losses).tolist()
  188. training_log[lib.TRAIN].extend(epoch_losses)
  189. print(f'[{lib.TRAIN}] loss = {round(sum(epoch_losses) / len(epoch_losses), 3)}')
  190. metrics, predictions = evaluate([lib.VAL, lib.TEST])
  191. for k, v in metrics.items():
  192. training_log[k].append(v)
  193. progress.update(metrics[lib.VAL]['score'])
  194. if progress.success:
  195. print('New best epoch!')
  196. stats['best_epoch'] = stream.epoch
  197. stats['metrics'] = metrics
  198. save_checkpoint(False)
  199. for k, v in predictions.items():
  200. np.save(output / f'p_{k}.npy', v)
  201. elif progress.fail:
  202. break
  203. # %%
  204. print('\nRunning the final evaluation...')
  205. model.load_state_dict(torch.load(checkpoint_path)['model'])
  206. stats['metrics'], predictions = evaluate(lib.PARTS)
  207. for k, v in predictions.items():
  208. np.save(output / f'p_{k}.npy', v)
  209. stats['time'] = lib.format_seconds(timer())
  210. save_checkpoint(True)
  211. print('Done!')

首先使用定义在util.py中的load_config函数将命令行中传入的toml文件读入进来,然后生成一个stats.json文件,之后使用data.py中的方法构造并预处理X和Y。之后除了定义在每个py文件中的模型架构外,还会有用一些共有方法对模型训练进行汇总:

对于优化器,会使用一个make_optimizer方法来构建,其中对于特定的参数会加入权重衰减从策略weight_decay。

作者使用了libzero包中的Stream方法来维护每个epoch与batch_size的循环。这个Stream的作用就是能够随时存储与回复循环中的状态并自定义epoch。同时使用这个包下的ProgressTracker来设定模型的early stop。当16个epoch后验证集上依然没有改进时,进入progress.fail分支提前结束训练。

对于训练部分,作者构造了train_with_auto_virtual_batch函数,此处的chunk_size当一个batch的size对内存而言过大时,会更新成原本的1/2

  1. def train_with_auto_virtual_batch(
  2. optimizer,
  3. loss_fn,
  4. step,
  5. batch,
  6. chunk_size: int,
  7. ) -> ty.Tuple[Tensor, int]:
  8. batch_size = len(batch)
  9. random_state = zero.get_random_state()
  10. while chunk_size != 0:
  11. try:
  12. zero.set_random_state(random_state)
  13. optimizer.zero_grad()
  14. if batch_size <= chunk_size:
  15. loss = loss_fn(*step(batch))
  16. loss.backward()
  17. else:
  18. loss = None
  19. for chunk in zero.iter_batches(batch, chunk_size):
  20. chunk_loss = loss_fn(*step(chunk))
  21. chunk_loss = chunk_loss * (len(chunk) / batch_size)
  22. chunk_loss.backward()
  23. if loss is None:
  24. loss = chunk_loss.detach()
  25. else:
  26. loss += chunk_loss.detach()
  27. except RuntimeError as err:
  28. if not is_oom_exception(err):
  29. raise
  30. chunk_size //= 2
  31. else:
  32. break
  33. if not chunk_size:
  34. raise RuntimeError('Not enough memory even for batch_size=1')
  35. optimizer.step()
  36. return loss, chunk_size # type: ignore[code]

zero.get/set_random_state()可以全局地给numpy/torch/random赋予相同的随机数种子。注意这个方法在0.0.8版本中已经没有了,如果需要和代码一样调试的话记得按照requirements.txt中的版本:

pip install libzero==0.0.3.dev7

zero.iter_batches的作用上和把数据放到DataLoader里面一样,但是zero这个包里说zero.iter_batches函数更好,因为它是基于batch的索引而非DataLoader那样的基于项的索引。

而对于CatBoost和XGBoost而言,由于fit函数已经有了,所以直接将toml的各个参数传入就可以了,唯一需要注意的就是XGBoost不会自动保存验证集上效果最好的模型,需要我们传入early_stop参数去控制。

四、部分实验内容复现

由于笔者本身的硬件资源有限,只针对Adult数据集选取了FT-Transformer,ResNet,LightGBM,XGBoost并使用作者给出的超参数进行实验。其中FT-Transformer同时在自己的电脑上进行了调优以作对比组。除此以外,还加入了LassoNet进行了调优以及作为对照组。实验的详细代码见第五部分。

各个模型结果:

modelaccuracyrecall
FT_Transformer0.8590340480.646160513
FT_Transformer自己调优的0.8601764840.609707055
lassoNet0.8566263330.593898423
lightGBM0.868099830.645103137
ResNet0.8534119530.639781591
xgboost0.872313330.640006934

 从结果中可以看出,就准确率而言集成树模型(XGBoost和LightGBM)依旧要优于各个深度学习模型;而在各个深度学习模型中,作者给出的FT-Transformer成绩较优。此外,由于Adult数据集本身0-1的比例约为77:23;所以我还记录了各个模型对于正例(1)的召回率。从结果中可以看出,召回率依旧是集成树模型更高。尽管其中的FT-Transformer也已经和树模型的结果相近,但是由于准确率的差距更大,训练的时间更长耗能更多,故而不能说它比树模型的效果更好。

五、复现代码

尽管论文作者已经给出了实验代码,然而笔者的Linux系统似乎有些问题,Windows下的虚拟环境搭建也由于网络问题需要我花一段时间解决,故而在原来的代码上进行了稍微的修改。

FT-Transformer的超参数调优

  1. #模型定义部分与上文一样,此处省略
  2. #读取设定
  3. import pytomlpp as toml
  4. ArrayDict = ty.Dict[str, np.ndarray]
  5. def normalize(
  6. X, normalization,seed,noise=1e-3
  7. ):
  8. X_train = X['train'].copy()
  9. if normalization == 'standard':
  10. normalizer = sklearn.preprocessing.StandardScaler()
  11. elif normalization == 'quantile':
  12. normalizer = sklearn.preprocessing.QuantileTransformer(
  13. output_distribution='normal',
  14. n_quantiles=max(min(X['train'].shape[0] // 30, 1000), 10),
  15. subsample=int(1e9),
  16. random_state=seed,
  17. )
  18. if noise:
  19. stds = np.std(X_train, axis=0, keepdims=True)
  20. noise_std = noise / np.maximum(stds, noise) # type: ignore[code]
  21. X_train += noise_std * np.random.default_rng(seed).standard_normal( # type: ignore[code]
  22. X_train.shape
  23. )
  24. else:
  25. raise ValueError('Unknow normalization')
  26. normalizer.fit(X_train)
  27. return {k: normalizer.transform(v) for k, v in X.items()} # type: ignore[code]
  28. class CustomDataset(Dataset):
  29. def __init__(self,dir_,data_part,normalization,num_nan_policy,cat_nan_policy,
  30. cat_policy,seed,
  31. y_poicy=None,cat_min_frequency=0
  32. ):
  33. super(CustomDataset,self).__init__()
  34. dir_ = Path(dir_)
  35. def load(item) -> ArrayDict:
  36. return {
  37. x: ty.cast(np.ndarray, np.load(dir_ / f'{item}_{x}.npy')) # type: ignore[code]
  38. for x in ['train', 'val', 'test']
  39. }
  40. self.N = load('N') if dir_.joinpath('N_train.npy').exists() else None
  41. self.C = load('C') if dir_.joinpath('C_train.npy').exists() else None
  42. self.y = load('y')
  43. self.info = json.loads((dir_ / 'info.json').read_text())
  44. #pre-process
  45. cache_path = f"build_dataset_{normalization}__{num_nan_policy}__{cat_nan_policy}__{cat_policy}__{seed}.pickle"
  46. if cat_min_frequency>0:
  47. cache_path = cache_path.replace('.pickle', f'__{cat_min_frequency}.pickle')
  48. cache_path = Path(cache_path)
  49. if cache_path.exists():
  50. print("Using cache")
  51. with open(cache_path, 'rb') as f:
  52. data = pickle.load(f)
  53. self.x = data
  54. else:
  55. def save_result(x):
  56. if cache_path:
  57. with open(cache_path, 'wb') as f:
  58. pickle.dump(x, f)
  59. if self.N:
  60. N = deepcopy(self.N)
  61. num_nan_masks = {k: np.isnan(v) for k, v in N.items()}
  62. if any(x.any() for x in num_nan_masks.values()): # type: ignore[code]
  63. if num_nan_policy == 'mean':
  64. num_new_values = np.nanmean(self.N['train'], axis=0)
  65. else:
  66. raise ValueError('Unknown numerical NaN policy')
  67. for k, v in N.items():
  68. num_nan_indices = np.where(num_nan_masks[k])
  69. v[num_nan_indices] = np.take(num_new_values, num_nan_indices[1])
  70. if normalization:
  71. N = normalize(N, normalization, seed)
  72. else:
  73. N = None
  74. C = deepcopy(self.C)
  75. cat_nan_masks = {k: v == 'nan' for k, v in C.items()}
  76. if any(x.any() for x in cat_nan_masks.values()): # type: ignore[code]
  77. if cat_nan_policy == 'new':
  78. cat_new_value = '___null___'
  79. imputer = None
  80. elif cat_nan_policy == 'most_frequent':
  81. cat_new_value = None
  82. imputer = SimpleImputer(strategy=cat_nan_policy) # type: ignore[code]
  83. imputer.fit(C['train'])
  84. else:
  85. raise ValueError('Unknown categorical NaN policy')
  86. if imputer:
  87. C = {k: imputer.transform(v) for k, v in C.items()}
  88. else:
  89. for k, v in C.items():
  90. cat_nan_indices = np.where(cat_nan_masks[k])
  91. v[cat_nan_indices] = cat_new_value
  92. if cat_min_frequency:
  93. C = ty.cast(ArrayDict, C)
  94. min_count = round(len(C['train']) * cat_min_frequency)
  95. rare_value = '___rare___'
  96. C_new = {x: [] for x in C}
  97. for column_idx in range(C['train'].shape[1]):
  98. counter = Counter(C['train'][:, column_idx].tolist())
  99. popular_categories = {k for k, v in counter.items() if v >= min_count}
  100. for part in C_new:
  101. C_new[part].append(
  102. [
  103. (x if x in popular_categories else rare_value)
  104. for x in C[part][:, column_idx].tolist()
  105. ]
  106. )
  107. C = {k: np.array(v).T for k, v in C_new.items()}
  108. unknown_value = np.iinfo('int64').max - 3
  109. encoder = sklearn.preprocessing.OrdinalEncoder(
  110. handle_unknown='use_encoded_value', # type: ignore[code]
  111. unknown_value=unknown_value, # type: ignore[code]
  112. dtype='int64', # type: ignore[code]
  113. ).fit(C['train'])
  114. C = {k: encoder.transform(v) for k, v in C.items()}
  115. max_values = C['train'].max(axis=0)
  116. for part in ['val', 'test']:
  117. for column_idx in range(C[part].shape[1]):
  118. C[part][C[part][:, column_idx] == unknown_value, column_idx] = (
  119. max_values[column_idx] + 1
  120. )
  121. if cat_policy == 'indices':
  122. result = (N, C)
  123. elif cat_policy == 'ohe':
  124. ohe = sklearn.preprocessing.OneHotEncoder(
  125. handle_unknown='ignore', sparse=False, dtype='float32' # type: ignore[code]
  126. )
  127. ohe.fit(C['train'])
  128. C = {k: ohe.transform(v) for k, v in C.items()}
  129. result = C if N is None else {x: np.hstack((N[x], C[x])) for x in N}
  130. elif cat_policy == 'counter':
  131. assert seed is not None
  132. loo = LeaveOneOutEncoder(sigma=0.1, random_state=seed, return_df=False)
  133. loo.fit(C['train'], self.y['train'])
  134. C = {k: loo.transform(v).astype('float32') for k, v in C.items()} # type: ignore[code]
  135. if not isinstance(C['train'], np.ndarray):
  136. C = {k: v.values for k, v in C.items()} # type: ignore[code]
  137. if normalization:
  138. C = normalize(C, normalization, seed, inplace=True) # type: ignore[code]
  139. result = C if N is None else {x: np.hstack((N[x], C[x])) for x in N}
  140. else:
  141. raise ValueError('Unknow categorical policy')
  142. save_result(result)
  143. self.x = result
  144. self.X_num,self.X_cat = self.x
  145. self.X_num = None if self.X_num is None else self.X_num[data_part]
  146. self.X_cat = None if self.X_cat is None else self.X_cat[data_part]
  147. # build Y
  148. if self.info['task_type'] == 'regression':
  149. assert policy == 'mean_std'
  150. y = deepcopy(self.y)
  151. if y_poicy:
  152. if not self.info['task_type'] == 'regression':
  153. warnings.warn('y_policy is not None, but the task is NOT regression')
  154. info = None
  155. elif y_poicy == 'mean_std':
  156. mean, std = self.y['train'].mean(), self.y['train'].std()
  157. y = {k: (v - mean) / std for k, v in y.items()}
  158. info = {'policy': policy, 'mean': mean, 'std': std}
  159. else:
  160. raise ValueError('Unknow y policy')
  161. else:
  162. info = None
  163. self.y = y[data_part]
  164. if len(self.y.shape)==1:
  165. self.y = self.y.reshape((self.y.shape[0],1))
  166. self.y_info = info
  167. def __len__(self):
  168. X = self.X_num if self.X_num is not None else self.X_cat
  169. return len(X)
  170. def __getitem__(self,idx):
  171. return torch.FloatTensor(self.X_num[idx]).to(device),torch.IntTensor(self.X_cat[idx]).to(device),torch.FloatTensor(self.y[idx]).to(device)
  172. data_path_father = "D:/rtdl_data.tar/rtdl_data/data/"
  173. configs = toml.load("D:/rtdl-revisiting-models-main/output/adult/ft_transformer/tuning/0.toml")
  174. data_configs = configs["base_config"]["data"]
  175. configs["base_config"]["model"].setdefault('token_bias', True)
  176. configs["base_config"]["model"].setdefault('kv_compression', None)
  177. configs["base_config"]["model"].setdefault('kv_compression_sharing', None)
  178. D_train = CustomDataset(
  179. data_path_father+"adult"
  180. ,data_part="train"
  181. ,normalization=data_configs["normalization"]
  182. ,num_nan_policy="mean"
  183. ,cat_nan_policy="new"
  184. ,cat_policy=data_configs.get("cat_policy", 'indices')
  185. ,seed=configs["base_config"]["seed"]
  186. ,y_poicy=data_configs.get("y_policy"),cat_min_frequency=0
  187. )
  188. D_valid = CustomDataset(
  189. data_path_father+"adult"
  190. ,data_part="val"
  191. ,normalization=data_configs["normalization"]
  192. ,num_nan_policy="mean"
  193. ,cat_nan_policy="new"
  194. ,cat_policy=data_configs.get("cat_policy", 'indices')
  195. ,seed=configs["base_config"]["seed"]
  196. ,y_poicy=data_configs.get("y_policy"),cat_min_frequency=0
  197. )
  198. D_test = CustomDataset(
  199. data_path_father+"adult"
  200. ,data_part="test"
  201. ,normalization=data_configs["normalization"]
  202. ,num_nan_policy="mean"
  203. ,cat_nan_policy="new"
  204. ,cat_policy=data_configs.get("cat_policy", 'indices')
  205. ,seed=configs["base_config"]["seed"]
  206. ,y_poicy=data_configs.get("y_policy"),cat_min_frequency=0
  207. )
  208. dl_train = DataLoader(D_train,batch_size=configs["base_config"]["training"]["batch_size"])
  209. dl_val = DataLoader(D_valid,batch_size=len(D_valid))
  210. dl_test = DataLoader(D_test,batch_size=len(D_test))
  211. def make_optimizer(
  212. optimizer: str,
  213. parameter_groups,
  214. lr: float,
  215. weight_decay: float,
  216. ) -> optim.Optimizer:
  217. Optimizer = {
  218. 'adam': optim.Adam,
  219. 'adamw': optim.AdamW,
  220. 'sgd': optim.SGD,
  221. }[optimizer]
  222. momentum = (0.9,) if Optimizer is optim.SGD else ()
  223. return Optimizer(parameter_groups, lr, *momentum, weight_decay=weight_decay)
  224. def needs_wd(name):
  225. return all(x not in name for x in ['tokenizer', '.norm', '.bias'])
  226. import optuna
  227. def sample_parameters(trial,space,base_config):
  228. def get_distribution(distribution_name):
  229. return getattr(trial, f'suggest_{distribution_name}')
  230. result = {}
  231. for label, subspace in space.items():
  232. if isinstance(subspace, dict):
  233. result[label] = sample_parameters(trial, subspace, base_config)
  234. else:
  235. assert isinstance(subspace, list)
  236. distribution, *args = subspace
  237. if distribution.startswith('?'): #此处我个人的理解是:这个参数在原本的调试范围基础上还要增加一个"optional_"作取舍,可以理解为:先取舍是否使用默认值,然后在后面给定范围内作调优看看哪个更好。
  238. default_value = args[0]
  239. result[label] = (
  240. get_distribution(distribution.lstrip('?'))(label, *args[1:])
  241. if trial.suggest_categorical(f'optional_{label}', [False, True])
  242. else default_value
  243. )
  244. elif distribution == '$mlp_d_layers': #格式特殊
  245. min_n_layers, max_n_layers, d_min, d_max = args
  246. n_layers = trial.suggest_int('n_layers', min_n_layers, max_n_layers)
  247. suggest_dim = lambda name: trial.suggest_int(name, d_min, d_max) # noqa
  248. d_first = [suggest_dim('d_first')] if n_layers else []
  249. d_middle = (
  250. [suggest_dim('d_middle')] * (n_layers - 2) if n_layers > 2 else []
  251. )
  252. d_last = [suggest_dim('d_last')] if n_layers > 1 else []
  253. result[label] = d_first + d_middle + d_last
  254. elif distribution == '$d_token':#多了一个检测的步骤
  255. assert len(args) == 2
  256. try:
  257. n_heads = base_config['model']['n_heads']
  258. except KeyError:
  259. n_heads = base_config['model']['n_latent_heads']
  260. for x in args:
  261. assert x % n_heads == 0
  262. result[label] = trial.suggest_int('d_token', *args, n_heads)# n_heads是步长,确保d_token能够被n_heads整除 # type: ignore[code]
  263. elif distribution in ['$d_ffn_factor', '$d_hidden_factor']: #对于glu系激活函数这2个参数特殊处理特殊处理
  264. if base_config['model']['activation'].endswith('glu'):
  265. args = (args[0] * 2 / 3, args[1] * 2 / 3)
  266. result[label] = trial.suggest_uniform('d_ffn_factor', *args)
  267. else:
  268. result[label] = get_distribution(distribution)(label, *args)
  269. return result
  270. def merge_sampled_parameters(config, sampled_parameters):
  271. for k, v in sampled_parameters.items():
  272. if isinstance(v, dict):
  273. merge_sampled_parameters(config.setdefault(k, {}), v)
  274. else:
  275. assert k not in config
  276. config[k] = v
  277. def objective(trial):
  278. config = deepcopy(configs['base_config'])
  279. merge_sampled_parameters(
  280. config, sample_parameters(trial, configs['optimization']['space'], config)
  281. )
  282. model = Transformer(
  283. d_num=0 if D_train.X_num is None else D_train.X_num.shape[1],
  284. categories = None if D_train.X_cat is None else [len(set(D_train.X_cat[:, i].tolist())) for i in range(D_train.X_cat.shape[1])],
  285. d_out=D_train.info['n_classes'] if D_train.info["task_type"]=="multiclass" else 1
  286. ,**config['model']
  287. ).to(device)
  288. parameters_with_wd = [v for k, v in model.named_parameters() if needs_wd(k)]
  289. parameters_without_wd = [v for k, v in model.named_parameters() if not needs_wd(k)]
  290. loss_fn = (
  291. F.binary_cross_entropy_with_logits
  292. if D_train.info["task_type"]=="binclass"
  293. else F.cross_entropy
  294. if D_train.info["task_type"]=="multiclass"
  295. else F.mse_loss
  296. )
  297. optimizer = make_optimizer(
  298. config["training"]["optimizer"],
  299. (
  300. [
  301. {'params': parameters_with_wd},
  302. {'params': parameters_without_wd, 'weight_decay': 0.0},
  303. ]
  304. ),
  305. config["training"]["lr"],#to be trained in optuna
  306. config["training"]["weight_decay"]#to be trained in optuna
  307. )
  308. loss_best = np.nan
  309. best_epoch = -1
  310. patience = 0
  311. def save_state():
  312. torch.save(
  313. model.state_dict(),os.path.join(os.getcwd(),"ft_transformer_state.pickle")
  314. )
  315. with open(os.path.join(os.getcwd(),"best_state_ft_transformer.json"),"w") as f:
  316. json.dump(config,f)
  317. #dl_train.batch_size=config['training']['batch_size']
  318. for epoch in range(config['training']['n_epochs']):
  319. model.train()
  320. for i,(x_num,x_cat,y) in enumerate(dl_train):
  321. optimizer.zero_grad()
  322. y_batch = model(x_num,x_cat)
  323. loss = loss_fn(y_batch.reshape((y_batch.shape[0],1)),y)
  324. loss.backward()
  325. optimizer.step()
  326. model.eval()
  327. with torch.no_grad():
  328. for i,(x_num,x_cat,y) in enumerate(dl_val): #只有1个迭代
  329. y_batch = model(x_num,x_cat)
  330. loss = loss_fn(y_batch.reshape((y_batch.shape[0],1)),y)
  331. new_loss = loss.detach()
  332. if np.isnan(loss_best) or new_loss.cpu().numpy() < loss_best:
  333. patience = 0
  334. best_epoch = epoch
  335. loss_best = new_loss.cpu().numpy()
  336. save_state()
  337. else:
  338. patience+=1
  339. if patience>= config['training']['patience']:
  340. break
  341. return loss_best
  342. study = optuna.create_study(
  343. direction="minimize",
  344. sampler=optuna.samplers.TPESampler(**configs['optimization']['sampler']),
  345. )
  346. study.optimize(
  347. objective,
  348. **configs['optimization']['options'],
  349. #callbacks=[save_checkpoint],
  350. show_progress_bar=True,
  351. )

LassoNet的模型定义与超参数调优

首先,定义toml文件:

  1. [base_config]
  2. seed = 0
  3. [base_config.data]
  4. normalization = 'quantile'
  5. path = 'data/adult'
  6. cat_policy = 'indices'
  7. [base_config.model]
  8. [base_config.training]
  9. batch_size = 256
  10. eval_batch_size = 8192
  11. n_epochs = 1000000000
  12. optimizer = 'adamw'
  13. patience = 16
  14. [optimization.options]
  15. n_trials = 100
  16. [optimization.sampler]
  17. seed = 0
  18. [optimization.space.model]
  19. dims = [ '$mlp_d_layers', 1, 8, 1, 512 ]
  20. d_embedding = ['int', 64, 512]
  21. dropout = [ '?uniform', 0.0, 0.0, 0.5 ]
  22. gamma = [ '?loguniform', 0, 1e-08, 100.0 ]
  23. lambda_ = [ '?loguniform', 0, 1e-08, 100.0 ]
  24. M = [ 'int', 10, 50 ]
  25. gamma_skip = [ '?loguniform', 0, 1e-08, 100.0 ]
  26. [optimization.space.training]
  27. lr = [ 'loguniform', 1e-05, 0.01 ]
  28. weight_decay = [ '?loguniform', 0.0, 1e-06, 0.001 ]
  1. from itertools import islice
  2. def soft_threshold(l, x):
  3. return torch.sign(x) * torch.relu(torch.abs(x) - l)
  4. def sign_binary(x):
  5. ones = torch.ones_like(x)
  6. return torch.where(x >= 0, ones, -ones)
  7. def prox(v, u, *, lambda_, lambda_bar, M):
  8. """
  9. v has shape (m,) or (m, batches)
  10. u has shape (k,) or (k, batches)
  11. supports GPU tensors
  12. """
  13. onedim = len(v.shape) == 1
  14. if onedim:
  15. v = v.unsqueeze(-1)
  16. u = u.unsqueeze(-1)
  17. u_abs_sorted = torch.sort(u.abs(), dim=0, descending=True).values
  18. k, batch = u.shape
  19. s = torch.arange(k + 1.0).view(-1, 1).to(v)
  20. zeros = torch.zeros(1, batch).to(u)
  21. a_s = lambda_ - M * torch.cat(
  22. [zeros, torch.cumsum(u_abs_sorted - lambda_bar, dim=0)]
  23. )
  24. norm_v = torch.norm(v, p=2, dim=0)
  25. x = F.relu(1 - a_s / norm_v) / (1 + s * M ** 2)
  26. w = M * x * norm_v
  27. intervals = soft_threshold(lambda_bar, u_abs_sorted)
  28. lower = torch.cat([intervals, zeros])
  29. idx = torch.sum(lower > w, dim=0).unsqueeze(0)
  30. x_star = torch.gather(x, 0, idx).view(1, batch)
  31. w_star = torch.gather(w, 0, idx).view(1, batch)
  32. beta_star = x_star * v
  33. theta_star = sign_binary(u) * torch.min(soft_threshold(lambda_bar, u.abs()), w_star)
  34. if onedim:
  35. beta_star.squeeze_(-1)
  36. theta_star.squeeze_(-1)
  37. return beta_star, theta_star
  38. def inplace_prox(beta, theta, lambda_, lambda_bar, M):
  39. beta.weight.data, theta.weight.data = prox(
  40. beta.weight.data, theta.weight.data, lambda_=lambda_, lambda_bar=lambda_bar, M=M
  41. )
  42. def inplace_group_prox(groups, beta, theta, lambda_, lambda_bar, M):
  43. """
  44. groups is an iterable such that group[i] contains the indices of features in group i
  45. """
  46. beta_ = beta.weight.data
  47. theta_ = theta.weight.data
  48. beta_ans = torch.empty_like(beta_)
  49. theta_ans = torch.empty_like(theta_)
  50. for g in groups:
  51. group_beta = beta_[:, g]
  52. group_beta_shape = group_beta.shape
  53. group_theta = theta_[:, g]
  54. group_theta_shape = group_theta.shape
  55. group_beta, group_theta = prox(
  56. group_beta.reshape(-1),
  57. group_theta.reshape(-1),
  58. lambda_=lambda_,
  59. lambda_bar=lambda_bar,
  60. M=M,
  61. )
  62. beta_ans[:, g] = group_beta.reshape(*group_beta_shape)
  63. theta_ans[:, g] = group_theta.reshape(*group_theta_shape)
  64. beta.weight.data, theta.weight.data = beta_ans, theta_ans
  65. class LassoNet(nn.Module):
  66. def __init__(self,d_numerical,categories,d_out,d_embedding, dims,gamma,gamma_skip,lambda_,M, groups=None, dropout=None):
  67. """
  68. first dimension is input
  69. last dimension is output
  70. `groups` is a list of list such that `groups[i]`
  71. contains the indices of the features in the i-th group
  72. """
  73. #assert len(dims) > 2
  74. if groups is not None:
  75. n_inputs = dims[0]
  76. all_indices = []
  77. for g in groups:
  78. for i in g:
  79. all_indices.append(i)
  80. assert len(all_indices) == n_inputs and set(all_indices) == set(
  81. range(n_inputs)
  82. ), f"Groups must be a partition of range(n_inputs={n_inputs})"
  83. self.groups = groups
  84. super().__init__()
  85. #加入numerical和categories的输入处理
  86. d_in = d_numerical
  87. if categories is not None:
  88. d_in += len(categories) * d_embedding
  89. category_offsets = torch.tensor([0] + categories[:-1]).cumsum(0)
  90. self.register_buffer('category_offsets', category_offsets)
  91. self.category_embeddings = nn.Embedding(sum(categories), d_embedding)
  92. nn.init.kaiming_uniform_(self.category_embeddings.weight, a=math.sqrt(5))
  93. print(f'{self.category_embeddings.weight.shape=}')
  94. dims = [d_in]+dims+[d_out]
  95. self.gamma = gamma
  96. self.gamma_skip = gamma_skip
  97. self.lambda_ = lambda_
  98. self.M = M
  99. # 新增部分结束
  100. self.dropout = nn.Dropout(p=dropout) if dropout is not None else None
  101. self.layers = nn.ModuleList(
  102. [nn.Linear(dims[i], dims[i + 1]) for i in range(len(dims) - 1)]
  103. )
  104. self.skip = nn.Linear(dims[0], dims[-1], bias=False)
  105. def forward(self, x_num,x_cat):
  106. inp = []
  107. if x_num is not None:
  108. inp.append(x_num)
  109. if x_cat is not None:
  110. inp.append(
  111. self.category_embeddings(x_cat + self.category_offsets[None]).view(
  112. x_cat.size(0), -1
  113. )
  114. )
  115. inp = torch.cat(inp, dim=-1)
  116. current_layer = inp
  117. result = self.skip(inp)
  118. for theta in self.layers:
  119. current_layer = theta(current_layer)
  120. if theta is not self.layers[-1]:
  121. if self.dropout is not None:
  122. current_layer = self.dropout(current_layer)
  123. current_layer = F.relu(current_layer)
  124. return result + current_layer
  125. def prox(self, *, lambda_, lambda_bar=0, M=1):
  126. if self.groups is None:
  127. with torch.no_grad():
  128. inplace_prox(
  129. beta=self.skip,
  130. theta=self.layers[0],
  131. lambda_=lambda_,
  132. lambda_bar=lambda_bar,
  133. M=M,
  134. )
  135. else:
  136. with torch.no_grad():
  137. inplace_group_prox(
  138. groups=self.groups,
  139. beta=self.skip,
  140. theta=self.layers[0],
  141. lambda_=lambda_,
  142. lambda_bar=lambda_bar,
  143. M=M,
  144. )
  145. def lambda_start(
  146. self,
  147. M=1,
  148. lambda_bar=0,
  149. factor=2,
  150. ):
  151. """Estimate when the model will start to sparsify."""
  152. def is_sparse(lambda_):
  153. with torch.no_grad():
  154. beta = self.skip.weight.data
  155. theta = self.layers[0].weight.data
  156. for _ in range(10000):
  157. new_beta, theta = prox(
  158. beta,
  159. theta,
  160. lambda_=lambda_,
  161. lambda_bar=lambda_bar,
  162. M=M,
  163. )
  164. if torch.abs(beta - new_beta).max() < 1e-5:
  165. break
  166. beta = new_beta
  167. return (torch.norm(beta, p=2, dim=0) == 0).sum()
  168. start = 1e-6
  169. while not is_sparse(factor * start):
  170. start *= factor
  171. return start
  172. def l2_regularization(self):
  173. """
  174. L2 regulatization of the MLP without the first layer
  175. which is bounded by the skip connection
  176. """
  177. ans = 0
  178. for layer in islice(self.layers, 1, None):
  179. ans += (
  180. torch.norm(
  181. layer.weight.data,
  182. p=2,
  183. )
  184. ** 2
  185. )
  186. return ans
  187. def l1_regularization_skip(self):
  188. return torch.norm(self.skip.weight.data, p=2, dim=0).sum()
  189. def l2_regularization_skip(self):
  190. return torch.norm(self.skip.weight.data, p=2)
  191. def input_mask(self):
  192. with torch.no_grad():
  193. return torch.norm(self.skip.weight.data, p=2, dim=0) != 0
  194. def selected_count(self):
  195. return self.input_mask().sum().item()
  196. def cpu_state_dict(self):
  197. return {k: v.detach().clone().cpu() for k, v in self.state_dict().items()}
  198. configs = toml.load("D:/rtdl-revisiting-models-main/output/adult/lassoNet/tunning/0.toml")
  199. def objective(trial):
  200. config = deepcopy(configs['base_config'])
  201. merge_sampled_parameters(
  202. config, sample_parameters(trial, configs['optimization']['space'], config)
  203. )
  204. model = LassoNet(
  205. d_numerical=0 if D_train.X_num is None else D_train.X_num.shape[1],
  206. categories = None if D_train.X_cat is None else [len(set(D_train.X_cat[:, i].tolist())) for i in range(D_train.X_cat.shape[1])],
  207. d_out=D_train.info['n_classes'] if D_train.info["task_type"]=="multiclass" else 1
  208. ,**config['model']
  209. ).to(device)
  210. parameters_with_wd = [v for k, v in model.named_parameters() if needs_wd(k)]
  211. parameters_without_wd = [v for k, v in model.named_parameters() if not needs_wd(k)]
  212. loss_fn = (
  213. F.binary_cross_entropy_with_logits
  214. if D_train.info["task_type"]=="binclass"
  215. else F.cross_entropy
  216. if D_train.info["task_type"]=="multiclass"
  217. else F.mse_loss
  218. )
  219. optimizer = make_optimizer(
  220. config["training"]["optimizer"],
  221. (
  222. [
  223. {'params': parameters_with_wd},
  224. {'params': parameters_without_wd, 'weight_decay': 0.0},
  225. ]
  226. ),
  227. config["training"]["lr"],#to be trained in optuna
  228. config["training"]["weight_decay"]#to be trained in optuna
  229. )
  230. loss_best = np.nan
  231. best_epoch = -1
  232. patience = 0
  233. def save_state():
  234. torch.save(
  235. model.state_dict(),os.path.join(os.getcwd(),"lassonet_state.pickle")
  236. )
  237. with open(os.path.join(os.getcwd(),"best_state_lassonet.json"),"w") as f:
  238. json.dump(config,f)
  239. #dl_train.batch_size=config['training']['batch_size']
  240. for epoch in range(config['training']['n_epochs']):
  241. model.train()
  242. for i,(x_num,x_cat,y) in enumerate(dl_train):
  243. optimizer.zero_grad()
  244. # y_batch = model(x_num,x_cat)
  245. loss = 0
  246. def closure():
  247. nonlocal loss
  248. optimizer.zero_grad()
  249. ans = (
  250. loss_fn(model(x_num,x_cat), y)
  251. + model.gamma * model.l2_regularization()
  252. + model.gamma_skip * model.l2_regularization_skip()
  253. )
  254. ans.backward() #相当于第7行Compute gradient of the loss
  255. loss += ans.item()# * len(batch) / n_train
  256. return ans
  257. optimizer.step(closure)
  258. model.prox(lambda_=model.lambda_ * optimizer.param_groups[0]["lr"], M=model.M) #Hier-Prox算法
  259. model.eval()
  260. with torch.no_grad():
  261. for i,(x_num,x_cat,y) in enumerate(dl_val): #只有1个迭代
  262. y_batch = model(x_num,x_cat)#.reshape((y_batch.shape[0],1))
  263. loss = (
  264. loss_fn(y_batch.reshape((y_batch.shape[0],1)), y).item()
  265. + model.gamma * model.l2_regularization().item()
  266. + model.gamma_skip * model.l2_regularization_skip().item()
  267. + model.lambda_ * model.l1_regularization_skip().item()
  268. )
  269. new_loss = loss#.detach()
  270. if np.isnan(loss_best) or new_loss < loss_best:
  271. patience = 0
  272. best_epoch = epoch
  273. loss_best = new_loss#.cpu().numpy()
  274. save_state()
  275. else:
  276. patience+=1
  277. if patience>= config['training']['patience']:
  278. break
  279. return loss_best
  280. study = optuna.create_study(
  281. direction="minimize",
  282. sampler=optuna.samplers.TPESampler(**configs['optimization']['sampler']),
  283. )
  284. study.optimize(
  285. objective,
  286. **configs['optimization']['options'],
  287. #callbacks=[save_checkpoint],
  288. show_progress_bar=True,
  289. )

对15个随机数种子进行建模并记录结果

  1. ## 模型定义部分同上,不再赘述
  2. ArrayDict = ty.Dict[str, np.ndarray]
  3. def normalize(
  4. X, normalization,seed,noise=1e-3
  5. ):
  6. X_train = X['train'].copy()
  7. if normalization == 'standard':
  8. normalizer = sklearn.preprocessing.StandardScaler()
  9. elif normalization == 'quantile':
  10. normalizer = sklearn.preprocessing.QuantileTransformer(
  11. output_distribution='normal',
  12. n_quantiles=max(min(X['train'].shape[0] // 30, 1000), 10),
  13. subsample=int(1e9),
  14. random_state=seed,
  15. )
  16. if noise:
  17. stds = np.std(X_train, axis=0, keepdims=True)
  18. noise_std = noise / np.maximum(stds, noise) # type: ignore[code]
  19. X_train += noise_std * np.random.default_rng(seed).standard_normal( # type: ignore[code]
  20. X_train.shape
  21. )
  22. else:
  23. raise ValueError('Unknow normalization')
  24. normalizer.fit(X_train)
  25. return {k: normalizer.transform(v) for k, v in X.items()} # type: ignore[code]
  26. class CustomDataset(Dataset):
  27. def __init__(self,dir_,data_part,normalization,num_nan_policy,cat_nan_policy,
  28. cat_policy,seed,
  29. y_poicy=None,cat_min_frequency=0
  30. ):
  31. super(CustomDataset,self).__init__()
  32. dir_ = Path(dir_)
  33. def load(item) -> ArrayDict:
  34. return {
  35. x: ty.cast(np.ndarray, np.load(dir_ / f'{item}_{x}.npy')) # type: ignore[code]
  36. for x in ['train', 'val', 'test']
  37. }
  38. self.N = load('N') if dir_.joinpath('N_train.npy').exists() else None
  39. self.C = load('C') if dir_.joinpath('C_train.npy').exists() else None
  40. self.y = load('y')
  41. self.info = json.loads((dir_ / 'info.json').read_text())
  42. #pre-process
  43. cache_path = f"build_dataset_{normalization}__{num_nan_policy}__{cat_nan_policy}__{cat_policy}__{seed}.pickle"
  44. if cat_min_frequency>0:
  45. cache_path = cache_path.replace('.pickle', f'__{cat_min_frequency}.pickle')
  46. cache_path = Path(cache_path)
  47. if cache_path.exists():
  48. print("Using cache")
  49. with open(cache_path, 'rb') as f:
  50. data = pickle.load(f)
  51. self.x = data
  52. else:
  53. def save_result(x):
  54. if cache_path:
  55. with open(cache_path, 'wb') as f:
  56. pickle.dump(x, f)
  57. if self.N:
  58. N = deepcopy(self.N)
  59. num_nan_masks = {k: np.isnan(v) for k, v in N.items()}
  60. if any(x.any() for x in num_nan_masks.values()): # type: ignore[code]
  61. if num_nan_policy == 'mean':
  62. num_new_values = np.nanmean(self.N['train'], axis=0)
  63. else:
  64. raise ValueError('Unknown numerical NaN policy')
  65. for k, v in N.items():
  66. num_nan_indices = np.where(num_nan_masks[k])
  67. v[num_nan_indices] = np.take(num_new_values, num_nan_indices[1])
  68. if normalization:
  69. N = normalize(N, normalization, seed)
  70. else:
  71. N = None
  72. C = deepcopy(self.C)
  73. cat_nan_masks = {k: v == 'nan' for k, v in C.items()}
  74. if any(x.any() for x in cat_nan_masks.values()): # type: ignore[code]
  75. if cat_nan_policy == 'new':
  76. cat_new_value = '___null___'
  77. imputer = None
  78. elif cat_nan_policy == 'most_frequent':
  79. cat_new_value = None
  80. imputer = SimpleImputer(strategy=cat_nan_policy) # type: ignore[code]
  81. imputer.fit(C['train'])
  82. else:
  83. raise ValueError('Unknown categorical NaN policy')
  84. if imputer:
  85. C = {k: imputer.transform(v) for k, v in C.items()}
  86. else:
  87. for k, v in C.items():
  88. cat_nan_indices = np.where(cat_nan_masks[k])
  89. v[cat_nan_indices] = cat_new_value
  90. if cat_min_frequency:
  91. C = ty.cast(ArrayDict, C)
  92. min_count = round(len(C['train']) * cat_min_frequency)
  93. rare_value = '___rare___'
  94. C_new = {x: [] for x in C}
  95. for column_idx in range(C['train'].shape[1]):
  96. counter = Counter(C['train'][:, column_idx].tolist())
  97. popular_categories = {k for k, v in counter.items() if v >= min_count}
  98. for part in C_new:
  99. C_new[part].append(
  100. [
  101. (x if x in popular_categories else rare_value)
  102. for x in C[part][:, column_idx].tolist()
  103. ]
  104. )
  105. C = {k: np.array(v).T for k, v in C_new.items()}
  106. unknown_value = np.iinfo('int64').max - 3
  107. encoder = sklearn.preprocessing.OrdinalEncoder(
  108. handle_unknown='use_encoded_value', # type: ignore[code]
  109. unknown_value=unknown_value, # type: ignore[code]
  110. dtype='int64', # type: ignore[code]
  111. ).fit(C['train'])
  112. C = {k: encoder.transform(v) for k, v in C.items()}
  113. max_values = C['train'].max(axis=0)
  114. for part in ['val', 'test']:
  115. for column_idx in range(C[part].shape[1]):
  116. C[part][C[part][:, column_idx] == unknown_value, column_idx] = (
  117. max_values[column_idx] + 1
  118. )
  119. if cat_policy == 'indices':
  120. result = (N, C)
  121. elif cat_policy == 'ohe':
  122. ohe = sklearn.preprocessing.OneHotEncoder(
  123. handle_unknown='ignore', sparse=False, dtype='float32' # type: ignore[code]
  124. )
  125. ohe.fit(C['train'])
  126. C = {k: ohe.transform(v) for k, v in C.items()}
  127. result = (N, C)
  128. #result = C if N is None else {x: np.hstack((N[x], C[x])) for x in N}
  129. elif cat_policy == 'counter':
  130. assert seed is not None
  131. loo = LeaveOneOutEncoder(sigma=0.1, random_state=seed, return_df=False)
  132. loo.fit(C['train'], self.y['train'])
  133. C = {k: loo.transform(v).astype('float32') for k, v in C.items()} # type: ignore[code]
  134. if not isinstance(C['train'], np.ndarray):
  135. C = {k: v.values for k, v in C.items()} # type: ignore[code]
  136. if normalization:
  137. C = normalize(C, normalization, seed, inplace=True) # type: ignore[code]
  138. result = (N, C)
  139. #result = C if N is None else {x: np.hstack((N[x], C[x])) for x in N}
  140. else:
  141. raise ValueError('Unknow categorical policy')
  142. save_result(result)
  143. self.x = result
  144. self.X_num,self.X_cat = self.x
  145. self.X_num = None if self.X_num is None else self.X_num[data_part]
  146. self.X_cat = None if self.X_cat is None else self.X_cat[data_part]
  147. # build Y
  148. if self.info['task_type'] == 'regression':
  149. assert policy == 'mean_std'
  150. y = deepcopy(self.y)
  151. if y_poicy:
  152. if not self.info['task_type'] == 'regression':
  153. warnings.warn('y_policy is not None, but the task is NOT regression')
  154. info = None
  155. elif y_poicy == 'mean_std':
  156. mean, std = self.y['train'].mean(), self.y['train'].std()
  157. y = {k: (v - mean) / std for k, v in y.items()}
  158. info = {'policy': policy, 'mean': mean, 'std': std}
  159. else:
  160. raise ValueError('Unknow y policy')
  161. else:
  162. info = None
  163. self.y = y[data_part]
  164. if len(self.y.shape)==1:
  165. self.y = self.y.reshape((self.y.shape[0],1))
  166. self.y_info = info
  167. def __len__(self):
  168. X = self.X_num if self.X_num is not None else self.X_cat
  169. return len(X)
  170. def __getitem__(self,idx):
  171. return torch.FloatTensor(self.X_num[idx]).to(device),torch.IntTensor(self.X_cat[idx]).to(device),torch.FloatTensor(self.y[idx]).to(device)
  172. ##读设定文件
  173. import pytomlpp as toml
  174. xgboost_config = toml.load("xgboost.toml")
  175. lightGBM_config = toml.load("lightgbm.toml")
  176. ft_transformer_config = toml.load("FT_TRANSFORMER.toml")
  177. ft_transformer_mine_config = toml.load("FT_TRANSFORMER_MINE.toml")
  178. resNet_config = toml.load("resnet.toml")
  179. LassoNet_config = toml.load("LassoNet.toml")
  180. def needs_wd(name):
  181. return all(x not in name for x in ['tokenizer', '.norm', '.bias'])
  182. def make_optimizer(
  183. optimizer: str,
  184. parameter_groups,
  185. lr: float,
  186. weight_decay: float,
  187. ) -> optim.Optimizer:
  188. Optimizer = {
  189. 'adam': optim.Adam,
  190. 'adamw': optim.AdamW,
  191. 'sgd': optim.SGD,
  192. }[optimizer]
  193. momentum = (0.9,) if Optimizer is optim.SGD else ()
  194. return Optimizer(parameter_groups, lr, *momentum, weight_decay=weight_decay)
  195. def train_model_xgboost(model,fit_kwargs,dataset_train,dataset_valid,dataset_test,seed):
  196. model_state_dict_path = os.path.join(os.getcwd(),f"xgboost_state_seed_{seed}.pickle")
  197. model_result_records = "xgboost_Result.txt"
  198. feature_importance_record_path = f"xgboost_feature_importance_{seed}.npy"
  199. if os.path.exists(model_state_dict_path):
  200. return
  201. X_train = dataset_train.X_cat if dataset_train.X_num is None else np.hstack((dataset_train.X_num, dataset_train.X_cat))
  202. Y_train = dataset_train.y
  203. X_valid = dataset_valid.X_cat if dataset_valid.X_num is None else np.hstack((dataset_valid.X_num, dataset_valid.X_cat))
  204. Y_valid = dataset_valid.y
  205. X_test = dataset_test.X_cat if dataset_test.X_num is None else np.hstack((dataset_test.X_num, dataset_test.X_cat))
  206. Y_test = dataset_test.y
  207. fit_kwargs['eval_set'] = [(X_valid,Y_valid)]
  208. model.fit(X_train, Y_train, **fit_kwargs)
  209. prediction = model.predict(X_test)
  210. result = skm.classification_report(Y_test, prediction, output_dict=True)
  211. model.save_model(model_state_dict_path)
  212. recall = result["1"]["recall"]
  213. acc = result['accuracy']
  214. with open(model_result_records,"a") as f:
  215. f.write(f"seed{seed} accuracy is:{acc} and the recall is :{recall}\n")
  216. np.save(feature_importance_record_path, model.feature_importances_)
  217. def train_model_lightGBM(model,fit_kwargs,dataset_train,dataset_valid,dataset_test,seed):
  218. model_state_dict_path = os.path.join(os.getcwd(),f"lightGBM_state_seed_{seed}.pickle")
  219. model_result_records = "lightGBM_Result.txt"
  220. feature_importance_record_path = f"lightGBM_feature_importance_{seed}.npy"
  221. if os.path.exists(model_state_dict_path):
  222. return
  223. X_train = dataset_train.X_cat if dataset_train.X_num is None else np.hstack((dataset_train.X_num, dataset_train.X_cat))
  224. Y_train = dataset_train.y
  225. X_valid = dataset_valid.X_cat if dataset_valid.X_num is None else np.hstack((dataset_valid.X_num, dataset_valid.X_cat))
  226. Y_valid = dataset_valid.y
  227. X_test = dataset_test.X_cat if dataset_test.X_num is None else np.hstack((dataset_test.X_num, dataset_test.X_cat))
  228. Y_test = dataset_test.y
  229. n_num_features = dataset_train.X_num.shape[1]
  230. n_features = dataset_train.X_num.shape[1]+dataset_train.X_cat.shape[1]
  231. fit_kwargs['categorical_feature'] = list(range(n_num_features, n_features))
  232. model.fit(X_train, Y_train, **fit_kwargs,eval_set=(X_valid, Y_valid))
  233. prediction = model.predict(X_test)
  234. result = skm.classification_report(Y_test, prediction, output_dict=True)
  235. recall = result["1"]["recall"]
  236. acc = result['accuracy']
  237. # joblib.dump(model, model_state_dict_path)
  238. with open(model_result_records,"a") as f:
  239. f.write(f"seed{seed} accuracy is:{acc} and the recall is :{recall}\n")
  240. np.save(feature_importance_record_path, model.feature_importances_)
  241. def train_model(model,config,dl_train,dl_valid,dl_test,seed,model_type,is_mine=False):
  242. model_state_dict_path = os.path.join(os.getcwd(),f"{model_type}_state_seed_{seed}.pickle")
  243. model_result_records = f"{model_type}_Result.txt"
  244. if is_mine:
  245. model_state_dict_path = model_state_dict_path.replace(".pickle","_mine.pickle")
  246. model_result_records = model_result_records.replace(".txt","_mine.txt")
  247. if os.path.exists(model_state_dict_path):
  248. return
  249. parameters_with_wd = [v for k, v in model.named_parameters() if needs_wd(k)]
  250. parameters_without_wd = [v for k, v in model.named_parameters() if not needs_wd(k)]
  251. loss_fn = F.binary_cross_entropy_with_logits
  252. optimizer = make_optimizer(
  253. config["training"]["optimizer"],
  254. (
  255. [
  256. {'params': parameters_with_wd},
  257. {'params': parameters_without_wd, 'weight_decay': 0.0},
  258. ]
  259. ),
  260. config["training"]["lr"],#to be trained in optuna
  261. config["training"]["weight_decay"]#to be trained in optuna
  262. )
  263. loss_best = np.nan
  264. best_epoch = -1
  265. patience = 0
  266. def save_state():
  267. torch.save(
  268. model.state_dict(),model_state_dict_path
  269. )
  270. #dl_train.batch_size=config['training']['batch_size']
  271. for epoch in range(config['training']['n_epochs']):
  272. model.train()
  273. for i,(x_num,x_cat,y) in enumerate(dl_train):
  274. optimizer.zero_grad()
  275. y_batch = model(x_num,x_cat)
  276. loss = loss_fn(y_batch.reshape((y_batch.shape[0],1)),y)
  277. loss.backward()
  278. optimizer.step()
  279. model.eval()
  280. with torch.no_grad():
  281. for i,(x_num,x_cat,y) in enumerate(dl_valid): #只有1个迭代
  282. y_batch = model(x_num,x_cat)
  283. loss = loss_fn(y_batch.reshape((y_batch.shape[0],1)),y)
  284. new_loss = loss.detach()
  285. if np.isnan(loss_best) or new_loss.cpu().numpy() < loss_best:
  286. patience = 0
  287. best_epoch = epoch
  288. loss_best = new_loss.cpu().numpy()
  289. save_state()
  290. else:
  291. patience+=1
  292. if patience>= config['training']['patience']:
  293. break
  294. #读取state_dict
  295. model.load_state_dict(
  296. torch.load(model_state_dict_path)
  297. )
  298. model.eval()
  299. with torch.no_grad():
  300. for i,(x_num,x_cat,y) in enumerate(dl_test): #只有1个迭代
  301. y_batch = model(x_num,x_cat)
  302. prediction = y_batch.cpu().numpy()
  303. prediction = np.round(scipy.special.expit(prediction)).astype('int64')
  304. result = skm.classification_report(y.cpu().numpy(), prediction, output_dict=True)
  305. recall = result["1.0"]["recall"]
  306. acc = result['accuracy']
  307. with open(model_result_records,"a") as f:
  308. f.write(f"seed{seed} accuracy is:{acc} and the recall is :{recall}\n")
  309. def train_model_LassoNet(model,config,dl_train,dl_valid,dl_test,seed):
  310. model_state_dict_path = os.path.join(os.getcwd(),f"lassoNet_state_seed_{seed}.pickle")
  311. model_result_records = f"lassoNet_Result.txt"
  312. if os.path.exists(model_state_dict_path):
  313. return
  314. parameters_with_wd = [v for k, v in model.named_parameters() if needs_wd(k)]
  315. parameters_without_wd = [v for k, v in model.named_parameters() if not needs_wd(k)]
  316. loss_fn = F.binary_cross_entropy_with_logits
  317. optimizer = make_optimizer(
  318. config["training"]["optimizer"],
  319. (
  320. [
  321. {'params': parameters_with_wd},
  322. {'params': parameters_without_wd, 'weight_decay': 0.0},
  323. ]
  324. ),
  325. config["training"]["lr"],#to be trained in optuna
  326. config["training"]["weight_decay"]#to be trained in optuna
  327. )
  328. loss_best = np.nan
  329. best_epoch = -1
  330. patience = 0
  331. def save_state():
  332. torch.save(
  333. model.state_dict(),model_state_dict_path
  334. )
  335. #dl_train.batch_size=config['training']['batch_size']
  336. for epoch in range(config['training']['n_epochs']):
  337. model.train()
  338. for i,(x_num,x_cat,y) in enumerate(dl_train):
  339. optimizer.zero_grad()
  340. # y_batch = model(x_num,x_cat)
  341. loss = 0
  342. def closure():
  343. nonlocal loss
  344. optimizer.zero_grad()
  345. ans = (
  346. loss_fn(model(x_num,x_cat), y)
  347. + model.gamma * model.l2_regularization()
  348. + model.gamma_skip * model.l2_regularization_skip()
  349. )
  350. ans.backward() #相当于第7行Compute gradient of the loss
  351. loss += ans.item()# * len(batch) / n_train
  352. return ans
  353. optimizer.step(closure)
  354. model.prox(lambda_=model.lambda_ * optimizer.param_groups[0]["lr"], M=model.M) #Hier-Prox算法
  355. model.eval()
  356. with torch.no_grad():
  357. for i,(x_num,x_cat,y) in enumerate(dl_valid): #只有1个迭代
  358. y_batch = model(x_num,x_cat)#.reshape((y_batch.shape[0],1))
  359. loss = (
  360. loss_fn(y_batch.reshape((y_batch.shape[0],1)), y).item()
  361. + model.gamma * model.l2_regularization().item()
  362. + model.gamma_skip * model.l2_regularization_skip().item()
  363. + model.lambda_ * model.l1_regularization_skip().item()
  364. )
  365. new_loss = loss#.detach()
  366. if np.isnan(loss_best) or new_loss < loss_best:
  367. patience = 0
  368. best_epoch = epoch
  369. loss_best = new_loss#.cpu().numpy()
  370. save_state()
  371. else:
  372. patience+=1
  373. if patience>= config['training']['patience']:
  374. break
  375. #读取state_dict
  376. model.load_state_dict(
  377. torch.load(model_state_dict_path)
  378. )
  379. model.eval()
  380. with torch.no_grad():
  381. for i,(x_num,x_cat,y) in enumerate(dl_test): #只有1个迭代
  382. y_batch = model(x_num,x_cat)
  383. prediction = y_batch.cpu().numpy()
  384. prediction = np.round(scipy.special.expit(prediction)).astype('int64')
  385. result = skm.classification_report(y.cpu().numpy(), prediction, output_dict=True)
  386. recall = result["1.0"]["recall"]
  387. acc = result['accuracy']
  388. with open(model_result_records,"a") as f:
  389. f.write(f"seed{seed} accuracy is:{acc} and the recall is :{recall}\n")
  390. import sklearn.metrics as skm
  391. ## 设定训练参数
  392. def train_by_model(configs,seed,model_type):
  393. zero.set_randomness(seed)
  394. def build_dataloaders(configs):
  395. data_configs = configs["data"]
  396. D_train = CustomDataset(
  397. data_path_father+"adult"
  398. ,data_part="train"
  399. ,normalization=data_configs.get("normalization")
  400. ,num_nan_policy="mean"
  401. ,cat_nan_policy="new"
  402. ,cat_policy=data_configs.get("cat_policy", 'indices')
  403. ,seed=seed
  404. ,y_poicy=data_configs.get("y_policy"),cat_min_frequency=0
  405. )
  406. D_valid = CustomDataset(
  407. data_path_father+"adult"
  408. ,data_part="val"
  409. ,normalization=data_configs.get("normalization")
  410. ,num_nan_policy="mean"
  411. ,cat_nan_policy="new"
  412. ,cat_policy=data_configs.get("cat_policy", 'indices')
  413. ,seed=seed
  414. ,y_poicy=data_configs.get("y_policy"),cat_min_frequency=0
  415. )
  416. D_test = CustomDataset(
  417. data_path_father+"adult"
  418. ,data_part="test"
  419. ,normalization=data_configs.get("normalization")
  420. ,num_nan_policy="mean"
  421. ,cat_nan_policy="new"
  422. ,cat_policy=data_configs.get("cat_policy", 'indices')
  423. ,seed=seed
  424. ,y_poicy=data_configs.get("y_policy"),cat_min_frequency=0
  425. )
  426. dl_train = DataLoader(D_train,batch_size=256)
  427. dl_val = DataLoader(D_valid,batch_size=len(D_valid))
  428. dl_test = DataLoader(D_test,batch_size=len(D_test))
  429. return D_train,D_valid,D_test,dl_train,dl_val,dl_test
  430. D_train,D_valid,D_test,dl_train,dl_valid,dl_test = build_dataloaders(configs)
  431. if "FT_Transformer" in model_type:
  432. configs["model"].setdefault('token_bias', True)
  433. configs["model"].setdefault('kv_compression', None)
  434. configs["model"].setdefault('kv_compression_sharing', None)
  435. model = Transformer(d_numerical=0 if D_train.X_num is None else D_train.X_num.shape[1],
  436. categories = None if D_train.X_cat is None else [len(set(D_train.X_cat[:, i].tolist())) for i in range(D_train.X_cat.shape[1])],
  437. d_out=D_train.info['n_classes'] if D_train.info["task_type"]=="multiclass" else 1
  438. ,**configs['model']).to(device)
  439. is_mine = model_type.endswith("_mine")
  440. train_model(model,configs,dl_train,dl_valid,dl_test,seed,"FT_Transformer",is_mine=is_mine)
  441. elif model_type == "LassoNet":
  442. model = LassoNet(
  443. d_numerical=0 if D_train.X_num is None else D_train.X_num.shape[1],
  444. categories = None if D_train.X_cat is None else [len(set(D_train.X_cat[:, i].tolist())) for i in range(D_train.X_cat.shape[1])],
  445. d_out=D_train.info['n_classes'] if D_train.info["task_type"]=="multiclass" else 1
  446. ,**configs['model']
  447. ).to(device)
  448. train_model_LassoNet(model,configs,dl_train,dl_valid,dl_test,seed)
  449. elif model_type == "ResNet":
  450. model = ResNet(
  451. d_numerical=0 if D_train.X_num is None else D_train.X_num.shape[1],
  452. categories = None if D_train.X_cat is None else [len(set(D_train.X_cat[:, i].tolist())) for i in range(D_train.X_cat.shape[1])],
  453. d_out=D_train.info['n_classes'] if D_train.info["task_type"]=="multiclass" else 1,
  454. **configs['model'],
  455. ).to(device)
  456. train_model(model,configs,dl_train,dl_valid,dl_test,seed,"ResNet",is_mine=False)
  457. elif model_type == "xgboost":
  458. fit_kwargs = deepcopy(configs["fit"])
  459. configs["model"]['random_state'] = seed
  460. fit_kwargs['eval_metric'] = 'error'
  461. model = XGBClassifier(**configs["model"])
  462. train_model_xgboost(model,fit_kwargs,D_train,D_valid,D_test,seed)
  463. elif model_type == "lightGBM":
  464. model_kwargs = deepcopy(configs['model'])
  465. model_kwargs['random_state'] = seed
  466. fit_kwargs = deepcopy(configs['fit'])
  467. early_stop_rounds = fit_kwargs.get("early_stopping_rounds")
  468. del fit_kwargs["early_stopping_rounds"]
  469. del fit_kwargs["verbose"]
  470. fit_kwargs['eval_metric'] = 'binary_error'
  471. ES = early_stopping(early_stop_rounds)
  472. verbose = log_evaluation(10**8)
  473. model = LGBMClassifier(**model_kwargs,callbacks = [ES,verbose])
  474. train_model_lightGBM(model,fit_kwargs,D_train,D_valid,D_test,seed)
  475. else:
  476. raise ValueError("model_type not recognized")
  477. seeds=[6368,1658,8366,8641,7052,7600,297,5829,9295,1698,2157,3318,8312,7741,9570]
  478. for i,seed in enumerate(seeds):
  479. train_by_model(xgboost_config,seed,"xgboost")
  480. train_by_model(lightGBM_config,seed,"lightGBM")
  481. train_by_model(ft_transformer_config,seed,"FT_Transformer")
  482. train_by_model(ft_transformer_mine_config,seed,"FT_Transformer_mine")
  483. train_by_model(resNet_config,seed,"ResNet")
  484. train_by_model(LassoNet_config,seed,"LassoNet")

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/秋刀鱼在做梦/article/detail/781905
推荐阅读
相关标签
  

闽ICP备14008679号