当前位置:   article > 正文

python 手动实现 SGD, Adam, RMSprop 优化器_如何手写adam优化器

如何手写adam优化器

首先直接进入主题,三个优化器,对着公式很容易理解:

  1. SGD
    v t = α v t − 1 − ( 1 − α ) g t x t = x t − 1 + η v t vt=αvt1(1α)gtxt=xt1+ηvt vtxt=αvt1(1α)gt=xt1+ηvt
    其中, η \eta η 为设定的梯度步长,下同; α \alpha α 为平滑因子,可以理解为对上一步梯度的保持作用。

  2. RMSprop
    s t = γ s t − 1 + ( 1 − γ ) g t 2 x t = x t − 1 − η g t / ( ϵ + s t ) st=γst1+(1γ)g2txt=xt1ηgt/(ϵ+st) stxt=γst1+(1γ)gt2=xt1ηgt/(ϵ+st )
    主要特点是使用梯度的二阶矩来调整步长。

  3. Adam
    m t = ( 1 − β 1 ) g t 2 + β 1 m t − 1 m ^ t = m t / ( 1 − β 1 t ) v t = ( 1 − β 2 ) g t + β 2 v t − 1 v ^ t = v t / ( 1 − β 2 t ) x t = x t − 1 − η v ^ t / ( ϵ + m ^ t ) mt=(1β1)g2t+β1mt1ˆmt=mt/(1βt1)vt=(1β2)gt+β2vt1ˆvt=vt/(1βt2)xt=xt1ηˆvt/(ϵ+ˆmt) mtm^tvtv^txt=(1β1)gt2+β1mt1=mt/(1β1t)=(1β2)gt+β2vt1=vt/(1β2t)=xt1ηv^t/(ϵ+m^t )

@unflatten_optimizer_step
def sgd_step(value_and_grad, x, itr, state=None, step_size=0.1, mass=0.9):
    # Stochastic gradient descent with momentum.
    velocity = state if state is not None else np.zeros(len(x))
    val, g = value_and_grad(x)
    velocity = mass * velocity - (1.0 - mass) * g
    x = x + step_size * velocity
    return x, val, g, velocity

@unflatten_optimizer_step
def rmsprop_step(value_and_grad, x, itr, state=None, step_size=0.1, gamma=0.9, eps=10**-8):
    # Root mean squared prop: See Adagrad paper for details.
    avg_sq_grad = np.ones(len(x)) if state is None else state
    val, g = value_and_grad(x)
    avg_sq_grad = avg_sq_grad * gamma + g**2 * (1 - gamma)
    x = x - (step_size * g) / (np.sqrt(avg_sq_grad) + eps)
    return x, val, g, avg_sq_grad


@unflatten_optimizer_step
def adam_step(value_and_grad, x, itr, state=None, step_size=0.001, b1=0.9, b2=0.999, eps=10**-8):
    """
    Adam as described in http://arxiv.org/pdf/1412.6980.pdf.
    It's basically RMSprop with momentum and some correction terms.
    """
    m, v = (np.zeros(len(x)), np.zeros(len(x))) if state is None else state
    val, g = value_and_grad(x)
    m = (1 - b1) * g      + b1 * m    # First  moment estimate.
    v = (1 - b2) * (g**2) + b2 * v    # Second moment estimate.
    mhat = m / (1 - b1**(itr + 1))    # Bias correction.
    vhat = v / (1 - b2**(itr + 1))
    x = x - (step_size * mhat) / (np.sqrt(vhat) + eps)
    return x, val, g, (m, v)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33

注意:

  1. step 函数的第一个参数为函数,该函数返回优化变量 x x x在当前值的代价(损失)函数值梯度
  2. state 参数用来传递前一步的状态信息,如速度、动量等;
  3. 给每个函数都加了一个装饰器 unflatten_optimizer_step,这个装饰器的作用是:保证 step 函数内的梯度向量都是 1 维的数组! 如果待训练的参数是个矩阵,这个装饰器会先把它拉平到 1 维,函数返回时再还原到原来的形状。

函数装饰器的实现细节如下:

from functools import partial

def flatten(x):
    original_shape = x.shape
    return x.flatten(), partial(np.reshape, newshape=original_shape)

def unflatten_optimizer_step(step):
    """
    Wrap an optimizer step function that operates on flat 1D arrays
    with a version that handles trees of nested containers,
    i.e. (lists/tuples/dicts), with arrays/scalars at the leaves.
    """
    # 装饰后的 step 函数
    def _step(value_and_grad, x, itr, state=None, *args, **kwargs):
        _x, unflatten = flatten(x)
        
        def _value_and_grad(x):
            v, g = value_and_grad(unflatten(x))
            return v, flatten(g)[0]
        
        _next_x, _next_val, _next_g, _next_state = step(_value_and_grad, _x, itr, state=state, *args, **kwargs)
        return unflatten(_next_x), _next_val, _next_g, _next_state
    
    return _step
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24

flatten 函数之前有介绍,返回拉平后的 1 维数组和恢复原来形状的函数。

这个实现非常巧妙,值得细品!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/人工智能uu/article/detail/884226
推荐阅读
相关标签