赞
踩
梯度裁剪函数,我觉得称为梯度正则化也可以,为了防止梯度爆炸或梯度消失,常用在RNN中。
先贴代码:
def clip_norm(g, c, n): if c <= 0: # if clipnorm == 0 no need to add ops to the graph return g # tf require using a special op to multiply IndexedSliced by scalar if K.backend() == 'tensorflow': condition = n >= c then_expression = tf.scalar_mul(c / n, g) else_expression = g # saving the shape to avoid converting sparse tensor to dense if isinstance(then_expression, tf.Tensor): g_shape = copy.copy(then_expression.get_shape()) elif isinstance(then_expression, tf.IndexedSlices): g_shape = copy.copy(then_expression.dense_shape) if condition.dtype != tf.bool: condition = tf.cast(condition, 'bool') g = tf.cond(condition, lambda: then_expression, lambda: else_expression) if isinstance(then_expression, tf.Tensor): g.set_shape(g_shape) elif isinstance(then_expression, tf.IndexedSlices): g._dense_shape = g_shape else: g = K.switch(K.greater_equal(n, c), g * c / n, g) return g
g是要正则的梯度,c是裁剪的阈值,大于这个阈值就进行裁剪,为了防止出现梯度爆炸设置的。n是g的L2范数。
这个类是所有keras优化器的父类
在get_gradients()函数中,
norm = K.sqrt(sum([K.sum(K.square(g)) for g in grads]))
计算所有梯度的L2范数
如果是最简单的SGD,不包含momentum等参数,是这样的:
class SGD(Optimizer): def __init__(self, learning_rate=0.01): learning_rate = kwargs.pop('lr', learning_rate) with K.name_scope(self.__class__.__name__): self.iterations = K.variable(0, dtype='int64', name='iterations') self.learning_rate = K.variable(learning_rate, name='learning_rate') def get_updates(self, loss, params): grads = self.get_gradients(loss, params) self.updates = [K.update_add(self.iterations, 1)] lr = self.learning_rate shapes = [K.int_shape(p) for p in params] self.weights = [self.iterations] new_p = p + v self.updates.append(K.update(p, new_p)) return self.updates
更新参数时,使用随机采样的小批量数据。
优点:更新的时间不依赖于训练样本总数,使计算成本下降。
缺点:很难选择一个合适的学习率,对于网络中的每个参数更新时只能选择同一个学习率进行更新。在训练中有时会陷入鞍点中。
在原始SGD中可以添加momentum动量,
1.解决Hessian矩阵的病态条件问题(对初始扰动极其敏感,哪怕是在输入添加轻微的噪声,都会造成输出有很大的变化)
2.加速学习,使参数更新不仅受当前的梯度影响,还受上一次更新时的方向影响。
3.一般设为0.5, 0.9, 0.99,训练速度为原始SGD的2,10,100倍。
还可以在其中添加Nesterov,属于mometum的变种,与momentum的区别是,前者在更新梯度前,先算一遍速度
然后用这个速度先更新一遍参数,最后用这个参数去计算梯度:
优点:当梯度方向改变的时候,mometum可以减慢梯度下降的速度和震荡。当梯度方向保持不变的时候,可以加速参数的更新。解决了陷入局部最优点的问题。
缺点:很难选择一个合适的学习率。
这个是更新参数完整过程:
def get_updates(self, loss, params): grads = self.get_gradients(loss, params) self.updates = [K.update_add(self.iterations, 1)] lr = self.learning_rate if self.initial_decay > 0: # 随着训练次数,学习率按一定decay下降 lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay)))) # momentum shapes = [K.int_shape(p) for p in params] moments = [K.zeros(shape, name='moment_' + str(i)) for (i, shape) in enumerate(shapes)] self.weights = [self.iterations] + moments for p, g, m in zip(params, grads, moments): v = self.momentum * m - lr * g # 计算速度 self.updates.append(K.update(m, v)) if self.nesterov: new_p = p + self.momentum * v - lr * g # 更新参数 else: new_p = p + v # Apply constraints. if getattr(p, 'constraint', None) is not None: new_p = p.constraint(new_p) self.updates.append(K.update(p, new_p)) return self.updates
在keras中实现的源码
class Adagrad(Optimizer): def __init__(self, learning_rate=0.01, **kwargs): self.initial_decay = kwargs.pop('decay', 0.0) self.epsilon = kwargs.pop('epsilon', K.epsilon()) #这个值就是公式中的epsilon learning_rate = kwargs.pop('lr', learning_rate) super(Adagrad, self).__init__(**kwargs) with K.name_scope(self.__class__.__name__): self.learning_rate = K.variable(learning_rate, name='learning_rate') self.decay = K.variable(self.initial_decay, name='decay') self.iterations = K.variable(0, dtype='int64', name='iterations') @interfaces.legacy_get_updates_support @K.symbolic def get_updates(self, loss, params): grads = self.get_gradients(loss, params) shapes = [K.int_shape(p) for p in params] accumulators = [K.zeros(shape, name='accumulator_' + str(i)) for (i, shape) in enumerate(shapes)] self.weights = [self.iterations] + accumulators self.updates = [K.update_add(self.iterations, 1)] lr = self.learning_rate if self.initial_decay > 0: lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay)))) for p, g, a in zip(params, grads, accumulators): new_a = a + K.square(g) # update accumulator self.updates.append(K.update(a, new_a)) new_p = p - lr * g / (K.sqrt(new_a) + self.epsilon) # 计算Vt # Apply constraints. if getattr(p, 'constraint', None) is not None: new_p = p.constraint(new_p) self.updates.append(K.update(p, new_p)) return self.updates # 运行model.load()会加载这部分 def set_weights(self, weights): params = self.weights # Override set_weights for backward compatibility of Keras 2.2.4 optimizer # since it does not include iteration at head of the weight list. Set # iteration to 0. if len(params) == len(weights) + 1: weights = [np.array(0)] + weights super(Adagrad, self).set_weights(weights) def get_config(self): config = {'learning_rate': float(K.get_value(self.learning_rate)), 'decay': float(K.get_value(self.decay)), 'epsilon': self.epsilon} base_config = super(Adagrad, self).get_config() return dict(list(base_config.items()) + list(config.items()))
Vt是参数在每次迭代时计算的梯度的累积和。AdaGrad的学习率会逐参数的除以历史梯度的平方和的开方,使每个参数的学习率根据自身的特点变化而变化。目的是在参数空间更为平缓的方向取得更大的进步。
优点:在训练初期,由于累积的梯度较小,学习率较大所以学习速度很快。这个方法适用于解决稀疏数据的梯度问题。而且每个参数的学习率可以自适应调整。
缺点:随着训练时间的增加,累积的梯度将变得越来越大,使得计算出的学习率逐渐趋向于零,极端情况就是参数不再更新或更新速度极其缓慢。在训练的适当时候手动调整学习率仍然很重要。这个方法不适合解决非凸问题。
class RMSprop(Optimizer): """RMSProp optimizer. It is recommended to leave the parameters of this optimizer at their default values (except the learning rate, which can be freely tuned). # Arguments learning_rate: float >= 0. Learning rate. rho: float >= 0. # References - [rmsprop: Divide the gradient by a running average of its recent magnitude ](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) """ def __init__(self, learning_rate=0.001, rho=0.9, **kwargs): self.initial_decay = kwargs.pop('decay', 0.0) self.epsilon = kwargs.pop('epsilon', K.epsilon()) learning_rate = kwargs.pop('lr', learning_rate) super(RMSprop, self).__init__(**kwargs) with K.name_scope(self.__class__.__name__): self.learning_rate = K.variable(learning_rate, name='learning_rate') self.rho = K.variable(rho, name='rho') self.decay = K.variable(self.initial_decay, name='decay') self.iterations = K.variable(0, dtype='int64', name='iterations') @interfaces.legacy_get_updates_support @K.symbolic def get_updates(self, loss, params): grads = self.get_gradients(loss, params) accumulators = [K.zeros(K.int_shape(p), dtype=K.dtype(p), name='accumulator_' + str(i)) for (i, p) in enumerate(params)] self.weights = [self.iterations] + accumulators self.updates = [K.update_add(self.iterations, 1)] lr = self.learning_rate if self.initial_decay > 0: lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay)))) for p, g, a in zip(params, grads, accumulators): # update accumulator new_a = self.rho * a + (1. - self.rho) * K.square(g) self.updates.append(K.update(a, new_a)) new_p = p - lr * g / (K.sqrt(new_a) + self.epsilon) # Apply constraints. if getattr(p, 'constraint', None) is not None: new_p = p.constraint(new_p) self.updates.append(K.update(p, new_p)) return self.updates def set_weights(self, weights): params = self.weights # Override set_weights for backward compatibility of Keras 2.2.4 optimizer # since it does not include iteration at head of the weight list. Set # iteration to 0. if len(params) == len(weights) + 1: weights = [np.array(0)] + weights super(RMSprop, self).set_weights(weights) def get_config(self): config = {'learning_rate': float(K.get_value(self.learning_rate)), 'rho': float(K.get_value(self.rho)), 'decay': float(K.get_value(self.decay)), 'epsilon': self.epsilon} base_config = super(RMSprop, self).get_config() return dict(list(base_config.items()) + list(config.items()))
这个算法的提出主要是为了解决学习率最后趋近于零的问题。想法是不考虑所有的历史梯度累积,只关注在一段时间内某个窗口的梯度值。则速度值计算:
这个思想与AdaDelta算法的思想是相同的。
优点:提高了AdaGrad最后阶段的无效学习问题,适合解决非稳定性和非凸优化问题
缺点:在最后训练阶段,可能会陷入局部最小点
keras源码
class RMSprop(Optimizer): def __init__(self, learning_rate=0.001, rho=0.9, **kwargs): self.initial_decay = kwargs.pop('decay', 0.0) self.epsilon = kwargs.pop('epsilon', K.epsilon()) learning_rate = kwargs.pop('lr', learning_rate) super(RMSprop, self).__init__(**kwargs) with K.name_scope(self.__class__.__name__): self.learning_rate = K.variable(learning_rate, name='learning_rate') self.rho = K.variable(rho, name='rho') self.decay = K.variable(self.initial_decay, name='decay') self.iterations = K.variable(0, dtype='int64', name='iterations') @interfaces.legacy_get_updates_support @K.symbolic def get_updates(self, loss, params): grads = self.get_gradients(loss, params) accumulators = [K.zeros(K.int_shape(p), dtype=K.dtype(p), name='accumulator_' + str(i)) for (i, p) in enumerate(params)] self.weights = [self.iterations] + accumulators self.updates = [K.update_add(self.iterations, 1)] lr = self.learning_rate if self.initial_decay > 0: lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay)))) for p, g, a in zip(params, grads, accumulators): # update accumulator new_a = self.rho * a + (1. - self.rho) * K.square(g) # 解决“最后一公里”问题的方法 self.updates.append(K.update(a, new_a)) new_p = p - lr * g / (K.sqrt(new_a) + self.epsilon) # Apply constraints. if getattr(p, 'constraint', None) is not None: new_p = p.constraint(new_p) self.updates.append(K.update(p, new_p)) return self.updates def set_weights(self, weights): params = self.weights # Override set_weights for backward compatibility of Keras 2.2.4 optimizer # since it does not include iteration at head of the weight list. Set # iteration to 0. if len(params) == len(weights) + 1: weights = [np.array(0)] + weights super(RMSprop, self).set_weights(weights) def get_config(self): config = {'learning_rate': float(K.get_value(self.learning_rate)), 'rho': float(K.get_value(self.rho)), 'decay': float(K.get_value(self.decay)), 'epsilon': self.epsilon} base_config = super(RMSprop, self).get_config() return dict(list(base_config.items()) + list(config.items()))
该方法是另一种SGD高级算法,引入一种针对每个参数引入自适应学习率。结合了自适应学习方法和momentum方法。使用了梯度的一阶矩估计和二阶矩估计去动态的调整每个参数的学习率。
class Adam(Optimizer): """Adam optimizer. Default parameters follow those provided in the original paper. # Arguments learning_rate: float >= 0. Learning rate. beta_1: float, 0 < beta < 1. Generally close to 1. beta_2: float, 0 < beta < 1. Generally close to 1. amsgrad: boolean. Whether to apply the AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and Beyond". # References - [Adam - A Method for Stochastic Optimization]( https://arxiv.org/abs/1412.6980v8) - [On the Convergence of Adam and Beyond]( https://openreview.net/forum?id=ryQu7f-RZ) """ def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False, **kwargs): self.initial_decay = kwargs.pop('decay', 0.0) self.epsilon = kwargs.pop('epsilon', K.epsilon()) learning_rate = kwargs.pop('lr', learning_rate) super(Adam, self).__init__(**kwargs) with K.name_scope(self.__class__.__name__): self.iterations = K.variable(0, dtype='int64', name='iterations') self.learning_rate = K.variable(learning_rate, name='learning_rate') self.beta_1 = K.variable(beta_1, name='beta_1') self.beta_2 = K.variable(beta_2, name='beta_2') self.decay = K.variable(self.initial_decay, name='decay') self.amsgrad = amsgrad @interfaces.legacy_get_updates_support @K.symbolic def get_updates(self, loss, params): grads = self.get_gradients(loss, params) self.updates = [K.update_add(self.iterations, 1)] lr = self.learning_rate if self.initial_decay > 0: lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay)))) t = K.cast(self.iterations, K.floatx()) + 1 lr_t = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) / (1. - K.pow(self.beta_1, t))) ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p), name='m_' + str(i)) for (i, p) in enumerate(params)] vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p), name='v_' + str(i)) for (i, p) in enumerate(params)] if self.amsgrad: vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p), name='vhat_' + str(i)) for (i, p) in enumerate(params)] else: vhats = [K.zeros(1, name='vhat_' + str(i)) for i in range(len(params))] self.weights = [self.iterations] + ms + vs + vhats for p, g, m, v, vhat in zip(params, grads, ms, vs, vhats): m_t = (self.beta_1 * m) + (1. - self.beta_1) * g v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g) if self.amsgrad: vhat_t = K.maximum(vhat, v_t) p_t = p - lr_t * m_t / (K.sqrt(vhat_t) + self.epsilon) self.updates.append(K.update(vhat, vhat_t)) else: p_t = p - lr_t * m_t / (K.sqrt(v_t) + self.epsilon) self.updates.append(K.update(m, m_t)) self.updates.append(K.update(v, v_t)) new_p = p_t # Apply constraints. if getattr(p, 'constraint', None) is not None: new_p = p.constraint(new_p) self.updates.append(K.update(p, new_p)) return self.updates def get_config(self): config = {'learning_rate': float(K.get_value(self.learning_rate)), 'beta_1': float(K.get_value(self.beta_1)), 'beta_2': float(K.get_value(self.beta_2)), 'decay': float(K.get_value(self.decay)), 'epsilon': self.epsilon, 'amsgrad': self.amsgrad} base_config = super(Adam, self).get_config() return dict(list(base_config.items()) + list(config.items()))
mt过去梯度的指数衰减平均数,采用momentum的方法。
这是最后更新的计算公式。
优点:梯度下降过程是相对平稳的,它适用于在大数据集和高维空间中的大多数非凸优化问题。
缺点:在一些情况下可能不收敛
参考文献:
1.arXiv:1906.06821v2
2.H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, pp. 400–407, 1951
3.https://mlfromscratch.com/optimizers-explained/
4.Y. Nesterov, “A method for unconstrained convex minimization
problem with the rate of convergence O( 1/k2 ),” Doklady Akademii NaukSSSR, vol. 269, pp. 543–547, 1983.
5.J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.
6.https://www.jiqizhixin.com/articles/2017-12-06
7.M. D. Zeiler, “AdaDelta: An adaptive learning rate method,” arXiv
preprint arXiv:1212.5701, 2012.
8. T. Tieleman and G. Hinton, “Divide the gradient by a running average
of its recent magnitude,” COURSERA: Neural Networks for Machine
Learning, pp. 26–31, 2012.
9.D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in International Conference on Learning Representations, 2014, pp. 1–
15.
10.Published as a conference paper at ICLR 2018,ON THE CONVERGENCE OF ADAM AND BEYOND
Sashank J. Reddi, Satyen Kale & Sanjiv Kumar
Google New York
New York, NY 10011, USA
{sashank,satyenkale,sanjivk}@google.com
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。