赞
踩
adam优化器是经常使用到的模型训练时的优化器,但是在bert的训练中不起作用,具体表现是,模型的f1上不来。
transformers
库实现了基于权重衰减的优化器,AdamW
,这个优化器初始化时有6个参数,第一个是params
,可以是torch的Parameter,也可以是一个grouped参数。betas是Adam的beta参数,b1和b2。eps也是Adam为了数值稳定的参数。correct_bias,如果应用到tf的模型上时需要设置为False
class AdamW(Optimizer): """ Implements Adam algorithm with weight decay fix as introduced in `Decoupled Weight Decay Regularization <https://arxiv.org/abs/1711.05101>`__. Parameters: params (:obj:`Iterable[torch.nn.parameter.Parameter]`): Iterable of parameters to optimize or dictionaries defining parameter groups. lr (:obj:`float`, `optional`, defaults to 1e-3): The learning rate to use. betas (:obj:`Tuple[float,float]`, `optional`, defaults to (0.9, 0.999)): Adam's betas parameters (b1, b2). eps (:obj:`float`, `optional`, defaults to 1e-6): Adam's epsilon for numerical stability. weight_decay (:obj:`float`, `optional`, defaults to 0): Decoupled weight decay to apply. correct_bias (:obj:`bool`, `optional`, defaults to `True`): Whether ot not to correct bias in Adam (for instance, in Bert TF repository they use :obj:`False`). """ def __init__( self, params: Iterable[torch.nn.parameter.Parameter], lr: float = 1e-3, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-6, weight_decay: float = 0.0, correct_bias: bool = True, ):
# Parameters: lr = 1e-3 max_grad_norm = 1.0 num_training_steps = 1000 num_warmup_steps = 100 warmup_proportion = float(num_warmup_steps) / float(num_training_steps) # 0.1 ### Previously BertAdam optimizer was instantiated like this: optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, t_total=num_training_steps) ### and used like this: for batch in train_data: loss = model(batch) loss.backward() optimizer.step() ### In Transformers, optimizer and schedules are splitted and instantiated like this: optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False) # To reproduce BertAdam specific behavior set correct_bias=False scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps) # PyTorch scheduler ### and used like this: for batch in train_data: model.train() loss = model(batch) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) # Gradient clipping is not in AdamW anymore (so you can use amp without issue) optimizer.step() scheduler.step() optimizer.zero_grad()
https://github.com/huggingface/transformers
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。