赞
踩
前面我们介绍了 BP
神经网络和卷积神经网络CNN
,那么为什么还需要循环神经网络 RNN
呢?
BP
神经网络和卷积神经网络CNN
的输入输出都是相互独立的,但是在实际应用中有些场景输出内容和之前的内容是有关联的
BP
神经网络和卷积神经网络CNN
有一个特点,就是假设输入是一个独立的没有上下文联系的单位,比如输入是一张图片,网络识别是狗还是猫。但是对于一些有明显的上下文特征的序列化输入,比如预测视频中下一帧的播放内容,那么很明显这样的输出必须依赖以前的输入, 也就是说网络必须拥有一定的”记忆能力”。为了赋予网络这样的记忆力,一种特殊结构的神经网络——循环神经网络(Recurrent NeuralNetwork
)便应运而生了。
RNN
引入“记忆”的概念,循环指其每一个元素都执行相同的任务,但是输出依赖于输入和“记忆”RNN应用场景:自然语言处理、机器翻译、语音识别等
RNN
(循环神经网络)循环神经网络是一类用于处理序列数据的神经网络,就像卷积神经网络是专门用于处理网格化数据(如一张图像)的神经网络,循环神经网络时专门用于处理序列 x ( 1 ) , . . . , x ( T ) x^{(1)},...,x^{(T)} x(1),...,x(T)的神经网络。
RNN
网络结构如下:
循环神经网络的结果相比于卷积神经网络较简单,通常循环神经网络只包含输入层、隐藏层和输出层,加上输入输出层最多也就5层
将序列按时间展开就可以得到RNN的结构,如下图:
网络某一时刻的输入 x t x_t xt,和之前介绍的BP神经网络的输入一样, x t x_t xt 是一个 n n n 维向量,不同的是递归网络的输入将是一整个序列,也就是 x = [ x 1 , . . . , x t − 1 , x t , x t + 1 , . . . x T ] x=[x_1,...,x_{t-1},x_t,x_{t+1},...x_T] x=[x1,...,xt−1,xt,xt+1,...xT],对于语言模型,每一个 x t x_t xt将代表一个词向量,一整个序列就代表一句话。
- h t h_t ht代表时刻 t t t隐神经元对于线性转换值
- s t s_t st代表时刻 t t t的隐藏状态, 即:“记忆”
- o t o_t ot代表时刻 t t t 的输出,
- 输入层到隐藏层直接的权重由 U U U表示
- 隐藏层到隐藏层的权重 W W W,它是网络的记忆控制者,负责调度记忆。
- 隐藏层到输出层的权重 V V V
BPTT
RNN
的训练和 CNN/ANN
训练一样,同样适用 BP算法误差反向传播算法
。
区别在于:
RNN
中的参数U\V\W
是共享的,并且在随机梯度下降算法中,每一步的输出不仅仅依赖当前步的网络,并且还需要前若干步网络的状态,那么这种BP改版的算法叫做Backpropagation Through Time(BPTT)
;BPTT算法
和BP算法
一样,在多层训练过程中(长时依赖<即当前的输出和前面很长的一段序列有关,一般超过10步>),可能产生梯度消失和梯度爆炸的问题。BPTT
和BP算法
思路一样,都是求偏导,区别在于需要考虑时间对step的影响RNN
正向传播阶段在 t = 1 t=1 t=1的时刻, U , V , W U,V,W U,V,W都被随机初始化好, s 0 s_0 s0通常初始化为0,然后进行如下计算:
在 t = 2 t=2 t=2的时刻,,此时的状态 s 1 s_1 s1 作为时刻1的记忆状态将参与下一个时刻的预测活动,也就是:
以此类推,可得:
其中 f f f 可以是
tanh
,relu
,sigmoid
等激活函数, g g g 通常是softmax
也可以是其他
- 值得注意的是,我们说递归神经网络拥有记忆能力,而这种能力就是通过 W W W 将以往的输入状态进行总结,而作为下次输入的辅助
- 可以这样理解隐藏状态: h = f ( 现 有 的 输 入 + 过 去 记 忆 总 结 ) h=f(现有的输入+过去记忆总结) h=f(现有的输入+过去记忆总结)
RNN
反向传播阶段 BP神经网络
用到的误差反向传播 方法将输出层的误差总和,对各个权重的梯度
∇
U
\nabla U
∇U,
∇
V
\nabla V
∇V,
∇
W
\nabla W
∇W,求偏导数,然后利用梯度下降法更新各个权重。
对于每一时刻
t
t
t 的RNN网络
,网络的输出
o
t
o_t
ot 都会产生一定误差
e
t
e_t
et,误差的损失函数,可以是交叉熵也可以是平方误差等等。那么总的误差为
E
=
∑
t
e
t
E=\sum_t e_t
E=∑tet,我们的目标就是要求取:
E = ∑ t e t E=\sum_t e_t E=t∑et
∇ U = ∂ E ∂ U = ∑ t ∂ e t ∂ U \nabla U = \frac{\partial E}{\partial U} = \sum_t\frac{\partial e_t}{\partial U} ∇U=∂U∂E=t∑∂U∂et
∇ V = ∂ E ∂ V = ∑ t ∂ e t ∂ V \nabla V = \frac{\partial E}{\partial V} = \sum_t\frac{\partial e_t}{\partial V} ∇V=∂V∂E=t∑∂V∂et
∇ W = ∂ E ∂ W = ∑ t ∂ e t ∂ W \nabla W = \frac{\partial E}{\partial W} = \sum_t\frac{\partial e_t}{\partial W} ∇W=∂W∂E=t∑∂W∂et
下面我们以 t = 3 t=3 t=3 为例:
假设使用均方误差,且真实值为 y i y_i yi,那么:
e 3 = 1 2 ( o 3 − y 3 ) 2 e_3 = \frac{1}{2}(o_3-y_3)^2 e3=21(o3−y3)2
o 3 = g ( V s 3 ) o_3=g(Vs_3) o3=g(Vs3)
e 3 = 1 2 ( g ( V s 3 ) − y 3 ) 2 e_3=\frac{1}{2}(g(Vs_3)-y_3)^2 e3=21(g(Vs3)−y3)2
s 3 = f ( U x 3 + W s 2 ) s_3=f(Ux_3+Ws_2) s3=f(Ux3+Ws2)
e 3 = 1 2 ( g ( V f ( U x 3 + W s 2 ) ) ) − y 3 ) 2 e_3=\frac{1}{2}(g(Vf(Ux_3+Ws_2)))-y_3)^2 e3=21(g(Vf(Ux3+Ws2)))−y3)2
求解 W W W 的偏导数:
上式和 W W W 有关的是 W s 2 Ws_2 Ws2,很显然这是个复合函数
我们便可以根据复合函数的求导方式,链式法则:
∂ e 3 ∂ W = ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ s 3 ∂ W \frac{\partial e_3}{\partial W} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial W} ∂W∂e3=∂o3∂e3∂s3∂o3∂W∂s3
下面便依次求解(如果使用均方差损失),那么:
e 3 = 1 2 ( o 3 − y 3 ) 2 e_3 = \frac{1}{2}(o_3-y_3)^2 e3=21(o3−y3)2
∂ e 3 ∂ o 3 = o 3 − y 3 \frac{\partial e_3}{\partial o_3} = o_3 - y_3 ∂o3∂e3=o3−y3
o 3 = g ( V s 3 ) o_3=g(Vs_3) o3=g(Vs3)
∂ o 3 ∂ s 3 = g ′ V \frac{\partial o_3}{\partial s_3}=g'V ∂s3∂o3=g′V
g ′ g' g′ 表示函数 g 的导数
前面两个比较简单,重要的是第三项:
根据公式 :
s t = f ( U x t + W s t − 1 ) s_t = f(Ux_t+Ws_{t-1}) st=f(Uxt+Wst−1)
我们会发现, s 3 s_3 s3 除了和 W W W 有关之外,还和前一时刻 s 2 s_2 s2 有关
对于 s 3 s_3 s3 直接展开得到下面的式子:
∂ s 3 ∂ W = ∂ s 3 ∂ s 3 ∂ s 3 + ∂ W + ∂ s 3 ∂ s 2 ∂ s 2 ∂ W \frac{\partial s_3}{\partial W}=\frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W} + \frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial W} ∂W∂s3=∂s3∂s3∂W∂s3++∂s2∂s3∂W∂s2
- 其中 ∂ s 3 + ∂ W \frac{\partial s_3^+}{\partial W} ∂W∂s3+表示不做复合求导,将W以外的都当做常量
- ∂ s 2 ∂ W \frac{\partial s_2}{\partial W} ∂W∂s2 表示复合求导
对于 s 2 s_2 s2 直接展开得到下面的式子:
∂ s 2 ∂ W = ∂ s 2 ∂ s 2 ∂ s 2 + ∂ W + ∂ s 2 ∂ s 1 ∂ s 1 ∂ W \frac{\partial s_2}{\partial W}=\frac{\partial s_2}{\partial s_2}\frac{\partial s_2^+}{\partial W} + \frac{\partial s_2}{\partial s_1}\frac{\partial s_1}{\partial W} ∂W∂s2=∂s2∂s2∂W∂s2++∂s1∂s2∂W∂s1
对于 s 1 s_1 s1 直接展开得到下面的式子:
∂ s 1 ∂ W = ∂ s 1 ∂ s 1 ∂ s 1 + ∂ W + ∂ s 1 ∂ s 0 ∂ s 0 ∂ W \frac{\partial s_1}{\partial W}=\frac{\partial s_1}{\partial s_1}\frac{\partial s_1^+}{\partial W} + \frac{\partial s_1}{\partial s_0}\frac{\partial s_0}{\partial W} ∂W∂s1=∂s1∂s1∂W∂s1++∂s0∂s1∂W∂s0
将后两个展开的代入第一个得到:
∂ s 3 ∂ W = ∑ k = 0 3 ∂ s 3 ∂ s k ∂ s k + ∂ W \frac{\partial s_3}{\partial W}=\sum_{k=0}^3\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W} ∂W∂s3=k=0∑3∂sk∂s3∂W∂sk+
最终:
∂ e 3 ∂ W = ∑ k = 0 3 ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ s 3 ∂ s k ∂ s k + ∂ W \frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W} ∂W∂e3=k=0∑3∂o3∂e3∂s3∂o3∂sk∂s3∂W∂sk+
另外一种方式(假设我们不考虑 f f f):
s t = U x t + W s t − 1 s_t=Ux_t+Ws_{t-1} st=Uxt+Wst−1s 3 = U x 3 + W s 2 s_3=Ux_3+Ws_{2} s3=Ux3+Ws2
∂ s 3 ∂ W = s 2 + W ∂ s 2 ∂ W \frac{\partial s_3}{\partial W} = s_2+W\frac{\partial s_2}{\partial W} ∂W∂s3=s2+W∂W∂s2
= s 2 + W s 1 + W W ∂ s 1 ∂ W =s_2+Ws_1+WW\frac{\partial s_1}{\partial W} =s2+Ws1+WW∂W∂s1
- s 2 = ∂ s 3 ∂ s 3 ∂ s 3 + ∂ W s_2 = \frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W} s2=∂s3∂s3∂W∂s3+
- 其中, ∂ s 3 ∂ s 3 = 1 \frac{\partial s_3}{\partial s_3}=1 ∂s3∂s3=1, ∂ s 3 + ∂ W = s 2 \frac{\partial s_3^+}{\partial W}=s_2 ∂W∂s3+=s2表示 s 3 s_3 s3 对 W W W求导,不做复合求导
s 2 = U x 2 + W s 1 s_2=Ux_2+Ws_{1} s2=Ux2+Ws1
- W s 1 = ∂ s 3 ∂ s 2 ∂ s 2 + ∂ W Ws_1 =\frac{\partial s_3}{\partial s_2}\frac{\partial s_2^+}{\partial W} Ws1=∂s2∂s3∂W∂s2+
- 其中, ∂ s 3 ∂ s 2 = W \frac{\partial s_3}{\partial s_2}=W ∂s2∂s3=W, ∂ s 2 + ∂ W = s 1 \frac{\partial s_2^+}{\partial W}=s_1 ∂W∂s2+=s1
s 1 = U x 1 + W s 0 s_1=Ux_1+Ws_{0} s1=Ux1+Ws0
W W ∂ s 1 ∂ W = ∂ s 3 ∂ s 2 ∂ s 2 ∂ s 1 ∂ s 1 + ∂ W = ∂ s 3 ∂ s 1 ∂ s 1 + ∂ W WW\frac{\partial s_1}{\partial W}=\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial s_1}\frac{\partial s_1^+}{\partial W}=\frac{\partial s_3}{\partial s_1}\frac{\partial s_1^+}{\partial W} WW∂W∂s1=∂s2∂s3∂s1∂s2∂W∂s1+=∂s1∂s3∂W∂s1+
最终:
∂ s 3 ∂ W = ∂ s 3 ∂ s 3 ∂ s 3 + ∂ W + ∂ s 3 ∂ s 2 ∂ s 2 + ∂ W + ∂ s 3 ∂ s 1 ∂ s 1 + ∂ W = ∑ k = 1 3 ∂ s 3 ∂ s k ∂ s k + ∂ W \frac{\partial s_3}{\partial W} =\frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W}+\frac{\partial s_3}{\partial s_2}\frac{\partial s_2^+}{\partial W}+\frac{\partial s_3}{\partial s_1}\frac{\partial s_1^+}{\partial W}=\sum_{k=1}^3\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W} ∂W∂s3=∂s3∂s3∂W∂s3++∂s2∂s3∂W∂s2++∂s1∂s3∂W∂s1+=k=1∑3∂sk∂s3∂W∂sk+
∂ e 3 ∂ W = ∑ k = 0 3 ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ s 3 ∂ s k ∂ s k + ∂ W \frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W} ∂W∂e3=k=0∑3∂o3∂e3∂s3∂o3∂sk∂s3∂W∂sk+
根据上图,链式法则:
∂ e 3 ∂ W = ∑ k = 0 3 ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ( ∏ j = k + 1 3 ∂ s j ∂ s j − 1 ) ∂ s k + ∂ W \frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\Big(\prod_{j=k+1}^3\frac{\partial s_j}{\partial s_{j-1}}\Big)\frac{\partial s_k^+}{\partial W} ∂W∂e3=k=0∑3∂o3∂e3∂s3∂o3(j=k+1∏3∂sj−1∂sj)∂W∂sk+
求解 U U U 的偏导数:(和求 W W W类似)
∂ e 3 ∂ U = ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ s 3 ∂ U \frac{\partial e_3}{\partial U} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial U} ∂U∂e3=∂o3∂e3∂s3∂o3∂U∂s3
假设: a t = U x t , b t = W s t − 1 a_t = Ux_t,b_t=Ws_{t-1} at=Uxt,bt=Wst−1
s t = f ( a t + b t ) s_t = f(a_t+b_t) st=f(at+bt)
求第三项,根据公式 :
s 3 = f ( U x 3 + W s 2 ) s_3 = f(Ux_3+Ws_{2}) s3=f(Ux3+Ws2)
∂ s 3 ∂ U = f ′ × ( ∂ U x 3 ∂ U + W ∂ s 2 ∂ U ) \frac{\partial s_3}{\partial U}=f' \times (\frac{\partial Ux_3}{\partial U}+W\frac{\partial s_2}{\partial U}) ∂U∂s3=f′×(∂U∂Ux3+W∂U∂s2)
= f ′ × ( ∂ U x 3 ∂ U + W f ′ × ( ∂ U x 2 ∂ U + W ∂ s 1 ∂ U ) ) =f' \times (\frac{\partial Ux_3}{\partial U}+Wf' \times (\frac{\partial Ux_2}{\partial U}+W\frac{\partial s_1}{\partial U})) =f′×(∂U∂Ux3+Wf′×(∂U∂Ux2+W∂U∂s1))
= f ′ × ( ∂ U x 3 ∂ U + W f ′ × ( ∂ U x 2 ∂ U + W f ′ × ( ∂ U x 1 ∂ U + W ∂ s 1 ∂ U ) ) ) =f' \times (\frac{\partial Ux_3}{\partial U}+Wf' \times (\frac{\partial Ux_2}{\partial U}+Wf' \times (\frac{\partial Ux_1}{\partial U}+W\frac{\partial s_1}{\partial U}))) =f′×(∂U∂Ux3+Wf′×(∂U∂Ux2+Wf′×(∂U∂Ux1+W∂U∂s1)))
= f ′ × ( ∂ U x 3 ∂ U + W f ′ × ( ∂ U x 2 ∂ U + W f ′ × ( ∂ U x 1 ∂ U + W f ′ × ( ∂ U x 0 ∂ U ) ) ) ) =f' \times \Bigg(\frac{\partial Ux_3}{\partial U}+Wf' \times \bigg(\frac{\partial Ux_2}{\partial U}+Wf' \times \Big(\frac{\partial Ux_1}{\partial U}+Wf' \times \big(\frac{\partial Ux_0}{\partial U}\big)\Big)\bigg)\Bigg) =f′×(∂U∂Ux3+Wf′×(∂U∂Ux2+Wf′×(∂U∂Ux1+Wf′×(∂U∂Ux0))))
= f ′ × ∂ U x 3 ∂ U + W ( f ′ ) 2 × ∂ U x 2 ∂ U + W 2 ( f ′ ) 3 × ∂ U x 1 ∂ U + W 3 ( f ′ ) 4 × ( ∂ U x 0 ∂ U ) =f' \times \frac{\partial Ux_3}{\partial U}+W(f')^2 \times \frac{\partial Ux_2}{\partial U}+W^2(f')^3 \times \frac{\partial Ux_1}{\partial U}+W^3(f')^4 \times \big(\frac{\partial Ux_0}{\partial U}\big) =f′×∂U∂Ux3+W(f′)2×∂U∂Ux2+W2(f′)3×∂U∂Ux1+W3(f′)4×(∂U∂Ux0)
= ∑ k = 0 3 ( f ′ ) 4 − k ∂ ( W 3 − k a k ) ∂ U =\sum_{k=0}^3 (f')^{4-k}\frac{\partial (W^{3-k}a_k)}{\partial U} =k=0∑3(f′)4−k∂U∂(W3−kak)
∂ e 3 ∂ U = ∑ k = 0 3 ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ ( W 3 − k a k ) ∂ U ( f ′ ) 4 − k \frac{\partial e_3}{\partial U} =\sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial (W^{3-k}a_k)}{\partial U}(f')^{4-k} ∂U∂e3=k=0∑3∂o3∂e3∂s3∂o3∂U∂(W3−kak)(f′)4−k
这里的结果我也不知道对不对,希望了解的朋友指导下,非常感谢。
不考虑 f f f:
s t = U x t + W s t − 1 s_t=Ux_t+Ws_{t-1} st=Uxt+Wst−1
s 3 = U x 3 + W ( U x 2 + W ( U x 1 + W U x 0 ) ) s_3=Ux_3+W\Big(Ux_2+W\big(Ux_1+WUx_0\big)\Big) s3=Ux3+W(Ux2+W(Ux1+WUx0))
= U x 3 + W U x 2 + W 2 U x 1 + W 3 U x 0 =Ux_3+WUx_2+W^2Ux_1+W^3Ux_0 =Ux3+WUx2+W2Ux1+W3Ux0
s 3 = a 3 + W a 2 + W 2 a 1 + W 3 a 0 s_3 = a_3+Wa_2+W^2a_1+W^3a_0 s3=a3+Wa2+W2a1+W3a0
∂ s 3 ∂ U = ∑ k = 0 3 ∂ ( W 3 − k a k ) ∂ U \frac{\partial s_3}{\partial U} =\sum_{k=0}^3 \frac{\partial (W^{3-k}a_k)}{\partial U} ∂U∂s3=k=0∑3∂U∂(W3−kak)
∂ e 3 ∂ U = ∑ k = 0 3 ∂ e 3 ∂ o 3 ∂ o 3 ∂ s 3 ∂ ( W 3 − k a k ) ∂ U \frac{\partial e_3}{\partial U} =\sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial (W^{3-k}a_k)}{\partial U} ∂U∂e3=k=0∑3∂o3∂e3∂s3∂o3∂U∂(W3−kak)
求解 V V V 的偏导数:
因为 V V V 只和输出 o t o_t ot有关有关,所以:
∂ e 3 ∂ V = ∂ e 3 ∂ o 3 ∂ o 3 ∂ V \frac{\partial e_3}{\partial V} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial V} ∂V∂e3=∂o3∂e3∂V∂o3
RNN
缺陷 从我们上面的推导过程,假如
t
=
0
t=0
t=0时刻的值,到
t
=
100
t=100
t=100 时,由于前面的
W
W
W 次数过大,又可能会使其忘记
t
=
0
t=0
t=0时刻的信息,我们称之为RNN
梯度消失,但是不是真正意思上的消失,因为梯度是累加的过程,不可能为0,只是在某个时刻的梯度太小,忘记了前面时刻的内容。
为了克服梯度消失的问题,LSTM和GRU模型便后续被推出了。由于它们都有特殊的方式存储“记忆”,那么以前梯度比较大的“记忆”不会像简单的RNN
一样马上被抹除,因此可以一定程度
上克服梯度消失问题。
另一个简单的技巧可以用来克服梯度爆炸的问题就是
gradient clipping
,也就是当你计算的梯度超过阈值c的或者小于阈值−c时候,便把此时的梯度设置成c或−c。
下图所示是RNN
的误差平面:
上图可以看到RNN的误差平面要么非常陡峭,要么非常平坦,如果不采取任何措施,当你的参数在某一次更新之后,刚好碰到陡峭的地方,此时梯度变得非常大,那么你的参数更新也会非常大,很容易导致震荡问题。而如果你采取了gradient clipping这个技巧,那么即使你不幸碰到陡峭的地方,梯度也不会爆炸,因为梯度被限制在某个阈值c。
LSTM
(长短期记忆网络) 由于在RNN
中,存在长期依赖的问题,可能产生梯度消失和梯度爆炸的问题。而LSTM
从名字就可以看出它特别适合解决这类需要长时间依赖的问题,相比于RNN
:
LSTM
的“记忆细胞(Cell)”改造了下图是循环网络的展开结构:
其中的 A 部分的框便表示“记忆细胞”
RNN
的“记忆细胞” 如下:
只是通过简单的非线性映射
LSTM
的“记忆细胞” 如下:
增加了三个门,来控制“记忆细胞”
细胞状态类似于传送带,直接在整个链上运行,只有一些少量的线性交互,信息在上面流传保持不变很容易。
LSTM 怎么控制“细胞状态”?
LSTM
可以通过 gates(“门”)
结构来去除或者增加“细胞状态”的信息LSTM
中主要有三个“门”
结构来控制“细胞状态”sigmoid
操作,输出一个0到1之间的概率值“忘记门”:决定从“细胞状态”中丢弃什么信息;
比如在语言模型中,细胞状态可能包含了性别信息(“他”或者“她”),当我们看到新的代名词的时候,可以考虑忘记旧的数据
Sigmoid层
决定什么值需要更新;Tanh层
创建一个新的候选向量
C
~
t
\widetilde{C}_t
C
t,主要是为了状态更新做准备
经过忘记门
、信息增加门
后,可以确定传递信息的删除
和增加
,即可以进行“细胞状态”的更新
输出门是基于“细胞状态”得到输出:
sigmoid层
来确定细胞状态的那个部分将输出tanh
处理细胞状态得到一个-1到1之间的值,再将它和sigmoid门
的输出相乘,输出程序确定输出的部分。LSTM
正向传播f t = σ ( W f ⋅ [ h t − 1 , x t ] + b f ) f_t = \sigma(W_f \cdot[h_{t-1}, x_t] + b_f) ft=σ(Wf⋅[ht−1,xt]+bf)
取 [ h t − 1 , x t ] [h_{t-1}, x_t] [ht−1,xt] 为 x f x_f xf
i t = σ ( W i ⋅ [ h t − 1 , x t ] + b i ) ) i_t = \sigma(W_i \cdot[h_{t-1}, x_t] + b_i)) it=σ(Wi⋅[ht−1,xt]+bi))
取 [ h t − 1 , x t ] [h_{t-1}, x_t] [ht−1,xt] 为 x i x_i xi
C ~ t = t a n h ( W C ⋅ [ h t − 1 , x t ] + b C ) \widetilde{C}_t = tanh(W_C \cdot [h_{t-1},x_t]+b_C) C t=tanh(WC⋅[ht−1,xt]+bC)
取 [ h t − 1 , x t ] [h_{t-1}, x_t] [ht−1,xt] 为 x C x_C xC
C t = f t ∗ C t − 1 + i t ∗ C ~ t C_t = f_t * C_{t-1} + i_t * \widetilde{C}_t Ct=ft∗Ct−1+it∗C t
o t = σ ( W o ⋅ [ h t − 1 , x t ] + b o ) o_t=\sigma(W_o\cdot [h_{t-1}, x_t] + b_o) ot=σ(Wo⋅[ht−1,xt]+bo)
取 [ h t − 1 , x t ] [h_{t-1}, x_t] [ht−1,xt] 为 x o x_o xo
h t = o t ∗ t a n h ( C t ) h_t=o_t * tanh(C_t) ht=ot∗tanh(Ct)
y ^ t = W y ⋅ h t + b y \hat{y}_t=W_y \cdot h_t + b_y y^t=Wy⋅ht+by
LSTM
反向传播使用均方误差:
E = ∑ t = 0 T E t E = \sum_{t=0}^T E_t E=t=0∑TEt
E t = 1 2 ( y ^ t − y t ) 2 E_t = \frac{1}{2} (\hat{y}_t - y_t)^2 Et=21(y^t−yt)2
∂ E ∂ W y = ∑ t = 0 T ∂ E t ∂ W y = ∑ t = 0 T ∂ E t ∂ y ^ t ∂ y ^ t ∂ W y = ∑ t = 0 T ∂ E t ∂ y ^ t ⋅ h t \frac{\partial E}{\partial W_y} = \sum_{t=0}^T\frac{\partial E_t}{\partial W_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial W_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot h_t ∂Wy∂E=t=0∑T∂Wy∂Et=t=0∑T∂y^t∂Et∂Wy∂y^t=t=0∑T∂y^t∂Et⋅ht
∂ E ∂ b y = ∑ t = 0 T ∂ E t ∂ b y = ∑ t = 0 T ∂ E t ∂ y ^ t ∂ y ^ t ∂ b y = ∑ t = 0 T ∂ E t ∂ y ^ t ⋅ 1 \frac{\partial E}{\partial b_y} = \sum_{t=0}^T\frac{\partial E_t}{\partial b_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial b_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot 1 ∂by∂E=t=0∑T∂by∂Et=t=0∑T∂y^t∂Et∂by∂y^t=t=0∑T∂y^t∂Et⋅1
因为 W f , W i , W C , W o W_f,W_i,W_C,W_o Wf,Wi,WC,Wo 均和 h t h_t ht 或 C t C_t Ct 有关系,所以求导法则均可写为关于 h t h_t ht 或 C t C_t Ct的链式法则
(1)先求 E E E 关于 h t h_t ht 和 C t C_t Ct 的导数
上图中可知, h t h_t ht 和 C t C_t Ct 都有两条链路,因此导数包含两个部分
- 一个是当前时刻误差的导数
- 另一个是下一时刻到 T T T 时刻的所有误差累积的导数
∂ E ∂ h t = ∂ E t ∂ h t + ∂ ( ∑ k = t + 1 T E k ) ∂ h t \frac{\partial E}{\partial h_t} =\frac{\partial E_t}{\partial h_t} + \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial h_t} ∂ht∂E=∂ht∂Et+∂ht∂(∑k=t+1TEk)
∂ E ∂ C t = ∂ E t ∂ C t + ∂ ( ∑ k = t + 1 T E k ) ∂ C t \frac{\partial E}{\partial C_t} =\frac{\partial E_t}{\partial C_t} + \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial C_t} ∂Ct∂E=∂Ct∂Et+∂Ct∂(∑k=t+1TEk)
∂ E t ∂ h t = ∂ E t ∂ y ^ t ∂ y ^ t ∂ h t = ∂ E t ∂ y ^ t ⋅ W y T \frac{\partial E_t}{\partial h_t} =\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial h_t}=\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T ∂ht∂Et=∂y^t∂Et∂ht∂y^t=∂y^t∂Et⋅WyT
∂ E t ∂ C t = ∂ E t ∂ h t ∂ h t ∂ C t = ∂ E t ∂ h t ⋅ o t ⋅ ( 1 − t a n h 2 ( C t ) ) = ∂ E t ∂ y ^ t ⋅ W y T ⋅ o t ⋅ ( 1 − t a n h 2 ( C t ) ) \frac{\partial E_t}{\partial C_t}=\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial C_t}= \frac{\partial E_t}{\partial h_t} \cdot o_t \cdot (1-tanh^2(C_t))=\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot o_t \cdot (1-tanh^2(C_t)) ∂Ct∂Et=∂ht∂Et∂Ct∂ht=∂ht∂Et⋅ot⋅(1−tanh2(Ct))=∂y^t∂Et⋅WyT⋅ot⋅(1−tanh2(Ct))
以下两个现在求不出来,先用一个记号命名下:
∂ ( ∑ k = t + 1 T E k ) ∂ h t = d h n e x t \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial h_t}=dh_{next} ∂ht∂(∑k=t+1TEk)=dhnext
∂ ( ∑ k = t + 1 T E k ) ∂ C t = d C n e x t \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial C_t}=dC_{next} ∂Ct∂(∑k=t+1TEk)=dCnext
(2)求 W o W_o Wo 的偏导
∂ E ∂ W o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ W o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ o t ∂ o t ∂ W o \frac{\partial E}{\partial W_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial W_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial W_o} ∂Wo∂E=t=0∑T∂ht∂Et∂Wo∂ht=t=0∑T∂ht∂Et∂ot∂ht∂Wo∂ot
∂ h t ∂ o t = t a n h ( C t ) \frac{\partial h_t}{\partial o_t}=tanh(C_t) ∂ot∂ht=tanh(Ct)
∂ o t ∂ W o = o t ⋅ ( 1 − o t ) ⋅ x o T \frac{\partial o_t}{\partial W_o} = o_t \cdot (1-o_t) \cdot x_o^T ∂Wo∂ot=ot⋅(1−ot)⋅xoT
∂ E ∂ W o = ∑ t = 0 T ∂ E t ∂ y ^ t ⋅ W y T ⋅ t a n h ( C t ) ⋅ o t ⋅ ( 1 − o t ) ⋅ x o T \frac{\partial E}{\partial W_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot tanh(C_t) \cdot o_t \cdot (1-o_t) \cdot x_o^T ∂Wo∂E=t=0∑T∂y^t∂Et⋅WyT⋅tanh(Ct)⋅ot⋅(1−ot)⋅xoT
(3)求 b o b_o bo 的偏导
∂ E ∂ b o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ h o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ o t ∂ o t ∂ b o \frac{\partial E}{\partial b_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial h_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial b_o} ∂bo∂E=t=0∑T∂ht∂Et∂ho∂ht=t=0∑T∂ht∂Et∂ot∂ht∂bo∂ot
∂ h t ∂ o t = t a n h ( C t ) \frac{\partial h_t}{\partial o_t} = tanh(C_t) ∂ot∂ht=tanh(Ct)
∂ o t ∂ b o = o t ( 1 − o t ) \frac{\partial o_t}{\partial b_o}=o_t(1-o_t) ∂bo∂ot=ot(1−ot)
∂ E ∂ b o = ∑ t = 0 T ∂ E t ∂ y ^ t ⋅ W y T ⋅ t a n h ( C t ) ⋅ o t ⋅ ( 1 − o t ) \frac{\partial E}{\partial b_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot tanh(C_t) \cdot o_t \cdot (1-o_t) ∂bo∂E=t=0∑T∂y^t∂Et⋅WyT⋅tanh(Ct)⋅ot⋅(1−ot)
(4)求 x o x_o xo 的偏导
∂ E ∂ x o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ x o = ∑ t = 0 T ∂ E t ∂ h t ∂ h t ∂ o t ∂ o t ∂ x o \frac{\partial E}{\partial x_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial x_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial x_o} ∂xo∂E=t=0∑T∂ht∂Et∂xo∂ht=t=0∑T∂ht∂Et∂ot∂ht∂xo∂ot
∂ o t ∂ x o = o t ( 1 − o t ) ⋅ W o T \frac{\partial o_t}{\partial x_o}=o_t(1-o_t)\cdot W_o^T ∂xo∂ot=ot(1−ot)⋅WoT
(5)求 W C W_C WC 的偏导
∂ E ∂ W C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ W C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ C ~ t ∂ C ~ t ∂ W C \frac{\partial E}{\partial W_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial W_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial W_C} ∂WC∂E=t=0∑T∂Ct∂Et∂WC∂Ct=t=0∑T∂Ct∂Et∂C t∂Ct∂WC∂C t
∂ C t ∂ C ~ t = i t \frac{\partial C_t}{\partial \widetilde{C}_t}=i_t ∂C t∂Ct=it
∂ C ~ t ∂ W C = ( 1 − C ~ t 2 ) ⋅ x C T \frac{\partial \widetilde{C}_t}{\partial W_C}=(1-\widetilde{C}_t^2)\cdot x_C^T ∂WC∂C t=(1−C t2)⋅xCT
(6)求 b C b_C bC 的偏导
∂ E ∂ b C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ b C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ C ~ t ∂ C ~ t ∂ b C \frac{\partial E}{\partial b_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial b_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial b_C} ∂bC∂E=t=0∑T∂Ct∂Et∂bC∂Ct=t=0∑T∂Ct∂Et∂C t∂Ct∂bC∂C t
∂ C ~ t ∂ b C = ( 1 − C ~ t 2 ) ⋅ 1 \frac{\partial \widetilde{C}_t}{\partial b_C}=(1-\widetilde{C}_t^2)\cdot 1 ∂bC∂C t=(1−C t2)⋅1
(7)求 x C x_C xC 的偏导
∂ E ∂ x C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ x C = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ C ~ t ∂ C ~ t ∂ x C \frac{\partial E}{\partial x_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial x_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial x_C} ∂xC∂E=t=0∑T∂Ct∂Et∂xC∂Ct=t=0∑T∂Ct∂Et∂C t∂Ct∂xC∂C t
∂ C ~ t ∂ x C = ( 1 − C ~ t 2 ) ⋅ W C T \frac{\partial \widetilde{C}_t}{\partial x_C}=(1-\widetilde{C}_t^2)\cdot W_C^T ∂xC∂C t=(1−C t2)⋅WCT
(8)求 W i , b i , x i W_i,b_i,x_i Wi,bi,xi的偏导
∂ E ∂ W i = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ i t ∂ i t ∂ W i \frac{\partial E}{\partial W_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial W_i} ∂Wi∂E=t=0∑T∂Ct∂Et∂it∂Ct∂Wi∂it
∂ E ∂ b i = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ i t ∂ i t ∂ b i \frac{\partial E}{\partial b_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial b_i} ∂bi∂E=t=0∑T∂Ct∂Et∂it∂Ct∂bi∂it
∂ E ∂ x i = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ i t ∂ i t ∂ x i \frac{\partial E}{\partial x_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial x_i} ∂xi∂E=t=0∑T∂Ct∂Et∂it∂Ct∂xi∂it
∂ C t ∂ i t = C ~ t \frac{\partial C_t}{\partial i_t}=\widetilde{C}_t ∂it∂Ct=C t
∂ i t ∂ W i = i t ⋅ ( 1 − i t ) ⋅ x i T \frac{\partial i_t}{\partial W_i}=i_t\cdot (1-i_t) \cdot x_i^T ∂Wi∂it=it⋅(1−it)⋅xiT
∂ i t ∂ b i = i t ⋅ ( 1 − i t ) ⋅ 1 \frac{\partial i_t}{\partial b_i}= i_t\cdot (1-i_t) \cdot 1 ∂bi∂it=it⋅(1−it)⋅1
∂ i t ∂ x i = i t ⋅ ( 1 − i t ) ⋅ W i T \frac{\partial i_t}{\partial x_i}=i_t\cdot (1-i_t) \cdot W_i^T ∂xi∂it=it⋅(1−it)⋅WiT
(9)求 W f , b f , x f W_f,b_f,x_f Wf,bf,xf的偏导
∂ E ∂ W f = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ f t ∂ f t ∂ W f \frac{\partial E}{\partial W_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial W_f} ∂Wf∂E=t=0∑T∂Ct∂Et∂ft∂Ct∂Wf∂ft
∂ E ∂ b f = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ f t ∂ f t ∂ b f \frac{\partial E}{\partial b_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial b_f} ∂bf∂E=t=0∑T∂Ct∂Et∂ft∂Ct∂bf∂ft
∂ E ∂ x f = ∑ t = 0 T ∂ E t ∂ C t ∂ C t ∂ f t ∂ f t ∂ x f \frac{\partial E}{\partial x_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial x_f} ∂xf∂E=t=0∑T∂Ct∂Et∂ft∂Ct∂xf∂ft
∂ C t ∂ f t = C t − 1 \frac{\partial C_t}{\partial f_t}=C_{t-1} ∂ft∂Ct=Ct−1
∂ f t ∂ W f = f t ⋅ ( 1 − f t ) ⋅ x f T \frac{\partial f_t}{\partial W_f}=f_t\cdot (1-f_t) \cdot x_f^T ∂Wf∂ft=ft⋅(1−ft)⋅xfT
∂ f t ∂ b f = f t ⋅ ( 1 − f t ) ⋅ 1 \frac{\partial f_t}{\partial b_f}= f_t\cdot (1-f_t)\cdot 1 ∂bf∂ft=ft⋅(1−ft)⋅1
∂ f t ∂ x f = f t ⋅ ( 1 − f t ) ⋅ W i T \frac{\partial f_t}{\partial x_f}=f_t\cdot (1-f_t)\cdot W_i^T ∂xf∂ft=ft⋅(1−ft)⋅WiT
(10)求 X X X 的偏导
∂ E ∂ X = ∂ E ∂ x i + ∂ E ∂ x f + ∂ E ∂ x o + ∂ E ∂ x C \frac{\partial E}{\partial X}=\frac{\partial E}{\partial x_i}+\frac{\partial E}{\partial x_f}+\frac{\partial E}{\partial x_o}+\frac{\partial E}{\partial x_C} ∂X∂E=∂xi∂E+∂xf∂E+∂xo∂E+∂xC∂E
由 X = [ h t − 1 , x ] X=[h_{t-1},x] X=[ht−1,x] 可知 X X X 是 [ h t − 1 , x ] [h_{t-1},x] [ht−1,x] 的组合,故:
d h n e x t = ∂ E ∂ X [ : , : H ] ( 前 H 列 ) dh_{next}=\frac{\partial E}{\partial X}[:,:H](前H列) dhnext=∂X∂E[:,:H](前H列)
(11) d C n e x t dC_{next} dCnext
∂ ( ∑ k = t T E k ) ∂ C t − 1 = ∂ ( ∑ k = t T E k ) ∂ C t ⋅ ∂ C t ∂ C t − 1 = ∂ E ∂ C t ∂ C t ∂ C t − 1 = ∂ E ∂ C t ⋅ f t \frac{\partial (\sum_{k=t}^TE_k)}{\partial C_{t-1}}=\frac{\partial (\sum_{k=t}^TE_k)}{\partial C_{t}}\cdot \frac{\partial C_t}{\partial C_{t-1}}=\frac{\partial E}{\partial C_t}\frac{\partial C_t}{\partial C_{t-1}}=\frac{\partial E}{\partial C_t}\cdot f_t ∂Ct−1∂(∑k=tTEk)=∂Ct∂(∑k=tTEk)⋅∂Ct−1∂Ct=∂Ct∂E∂Ct−1∂Ct=∂Ct∂E⋅ft
从后往前更新,故:
最后一个时刻的 d h n e x t = 0 dh_{next}=0 dhnext=0, d C n e x t = 0 dC_{next}=0 dCnext=0
LSTM
变种GRU
)GRU
,2014年提出LSTM
的结构更加简单RNN
和LSTM
的网络结构以及正向-方向传播的公式推导,看上去很复杂,尤其那么多数学公式,其实核心就是求导的链式法则,将变量之间的关联搞清楚,一步一步的去解决好像容易多了,博主也在学习,有些地方目前也不是完全懂,其中难免存在不足,希望大家多多指教。LSTM
几种变种,当然还有其他的,这里并没有详细介绍,以后学习了在介绍Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。