赞
踩
首先直接进入主题,三个优化器,对着公式很容易理解:
SGD
v
t
=
α
v
t
−
1
−
(
1
−
α
)
g
t
x
t
=
x
t
−
1
+
η
v
t
vt=αvt−1−(1−α)gtxt=xt−1+ηvt
vtxt=αvt−1−(1−α)gt=xt−1+ηvt
其中,
η
\eta
η 为设定的梯度步长,下同;
α
\alpha
α 为平滑因子,可以理解为对上一步梯度的保持作用。
RMSprop
s
t
=
γ
s
t
−
1
+
(
1
−
γ
)
g
t
2
x
t
=
x
t
−
1
−
η
g
t
/
(
ϵ
+
s
t
)
st=γst−1+(1−γ)g2txt=xt−1−ηgt/(ϵ+√st)
stxt=γst−1+(1−γ)gt2=xt−1−ηgt/(ϵ+st
)
主要特点是使用梯度的二阶矩来调整步长。
Adam
m
t
=
(
1
−
β
1
)
g
t
2
+
β
1
m
t
−
1
m
^
t
=
m
t
/
(
1
−
β
1
t
)
v
t
=
(
1
−
β
2
)
g
t
+
β
2
v
t
−
1
v
^
t
=
v
t
/
(
1
−
β
2
t
)
x
t
=
x
t
−
1
−
η
v
^
t
/
(
ϵ
+
m
^
t
)
mt=(1−β1)g2t+β1mt−1ˆmt=mt/(1−βt1)vt=(1−β2)gt+β2vt−1ˆvt=vt/(1−βt2)xt=xt−1−ηˆvt/(ϵ+√ˆmt)
mtm^tvtv^txt=(1−β1)gt2+β1mt−1=mt/(1−β1t)=(1−β2)gt+β2vt−1=vt/(1−β2t)=xt−1−ηv^t/(ϵ+m^t
)
@unflatten_optimizer_step def sgd_step(value_and_grad, x, itr, state=None, step_size=0.1, mass=0.9): # Stochastic gradient descent with momentum. velocity = state if state is not None else np.zeros(len(x)) val, g = value_and_grad(x) velocity = mass * velocity - (1.0 - mass) * g x = x + step_size * velocity return x, val, g, velocity @unflatten_optimizer_step def rmsprop_step(value_and_grad, x, itr, state=None, step_size=0.1, gamma=0.9, eps=10**-8): # Root mean squared prop: See Adagrad paper for details. avg_sq_grad = np.ones(len(x)) if state is None else state val, g = value_and_grad(x) avg_sq_grad = avg_sq_grad * gamma + g**2 * (1 - gamma) x = x - (step_size * g) / (np.sqrt(avg_sq_grad) + eps) return x, val, g, avg_sq_grad @unflatten_optimizer_step def adam_step(value_and_grad, x, itr, state=None, step_size=0.001, b1=0.9, b2=0.999, eps=10**-8): """ Adam as described in http://arxiv.org/pdf/1412.6980.pdf. It's basically RMSprop with momentum and some correction terms. """ m, v = (np.zeros(len(x)), np.zeros(len(x))) if state is None else state val, g = value_and_grad(x) m = (1 - b1) * g + b1 * m # First moment estimate. v = (1 - b2) * (g**2) + b2 * v # Second moment estimate. mhat = m / (1 - b1**(itr + 1)) # Bias correction. vhat = v / (1 - b2**(itr + 1)) x = x - (step_size * mhat) / (np.sqrt(vhat) + eps) return x, val, g, (m, v)
注意:
unflatten_optimizer_step
,这个装饰器的作用是:保证 step 函数内的梯度向量都是 1 维的数组! 如果待训练的参数是个矩阵,这个装饰器会先把它拉平到 1 维,函数返回时再还原到原来的形状。函数装饰器的实现细节如下:
from functools import partial def flatten(x): original_shape = x.shape return x.flatten(), partial(np.reshape, newshape=original_shape) def unflatten_optimizer_step(step): """ Wrap an optimizer step function that operates on flat 1D arrays with a version that handles trees of nested containers, i.e. (lists/tuples/dicts), with arrays/scalars at the leaves. """ # 装饰后的 step 函数 def _step(value_and_grad, x, itr, state=None, *args, **kwargs): _x, unflatten = flatten(x) def _value_and_grad(x): v, g = value_and_grad(unflatten(x)) return v, flatten(g)[0] _next_x, _next_val, _next_g, _next_state = step(_value_and_grad, _x, itr, state=state, *args, **kwargs) return unflatten(_next_x), _next_val, _next_g, _next_state return _step
flatten 函数之前有介绍,返回拉平后的 1 维数组和恢复原来形状的函数。
这个实现非常巧妙,值得细品!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。