赞
踩
欢迎访问我的博客首页。
正向传播与反向传播的公式推导。
正向传播时,第 n 层卷积层和激活函数层的输出:
{
y
n
=
=
=
=
卷积操作
f
(
o
u
t
n
−
1
)
=
w
n
⋅
o
u
t
n
−
1
+
b
n
,
o
u
t
n
=
=
=
=
激活函数
σ
(
y
n
)
.
\left\{
只考虑卷积和激活函数操作,暂不考虑归一化层。 y n y_n yn 是第 n 层卷积的结果, o u t n out_{n} outn 是第 n 层激活函数的输出。 o u t n − 1 out_{n - 1} outn−1 是上一层激活函数的输出, o u t 0 out_0 out0 是网络的输入。
第 n 层输出对权重和偏置的导数:
{
g
r
a
d
n
w
1
=
∂
o
u
t
n
∂
w
1
=
(
∂
o
u
t
n
∂
y
n
×
∂
y
n
∂
o
u
t
n
−
1
)
×
(
∂
o
u
t
n
−
1
∂
y
n
−
1
×
∂
y
n
−
1
∂
o
u
t
n
−
2
)
×
⋯
×
(
∂
o
u
t
2
∂
y
2
×
∂
y
2
∂
o
u
t
1
)
×
(
∂
o
u
t
1
∂
y
1
×
∂
y
1
∂
w
1
)
,
g
r
a
d
n
b
1
=
∂
o
u
t
n
∂
b
1
=
(
∂
o
u
t
n
∂
y
n
×
∂
y
n
∂
o
u
t
n
−
1
)
×
(
∂
o
u
t
n
−
1
∂
y
n
−
1
×
∂
y
n
−
1
∂
o
u
t
n
−
2
)
×
⋯
×
(
∂
o
u
t
2
∂
y
2
×
∂
y
2
∂
o
u
t
1
)
×
(
∂
o
u
t
1
∂
y
1
×
∂
y
1
∂
b
1
)
.
(1)
\left\{
每个小括号内的第 1 个乘数就是激活函数的导数。第 2 个乘数就是卷积核的权重,即 ∂ y n ∂ o u t n − 1 = w n \frac{\partial y_n}{\partial out_{n-1}} = w_n ∂outn−1∂yn=wn。所以
{
g
r
a
d
n
w
1
=
∂
o
u
t
n
∂
w
1
=
(
∂
o
u
t
n
∂
y
n
×
w
n
)
×
(
∂
o
u
t
n
−
1
∂
y
n
−
1
×
w
n
−
1
)
×
⋯
×
(
∂
o
u
t
2
∂
y
2
×
w
2
)
×
(
∂
o
u
t
1
∂
y
1
×
o
u
t
0
)
,
g
r
a
d
n
b
1
=
∂
o
u
t
n
∂
b
1
=
(
∂
o
u
t
n
∂
y
n
×
w
n
)
×
(
∂
o
u
t
n
−
1
∂
y
n
−
1
×
w
n
−
1
)
×
⋯
×
(
∂
o
u
t
2
∂
y
2
×
w
2
)
×
(
∂
o
u
t
1
∂
y
1
×
1
)
.
(2)
\left\{
正向传播:后一层的输出等于前一层的输出乘以权重加上偏置,再经过激活函数:
o
u
t
n
=
σ
(
w
n
⋅
o
u
t
n
−
1
+
b
n
)
out_n = \sigma(w_n \cdot out_{n - 1} + b_n)
outn=σ(wn⋅outn−1+bn)。
反向传播:前一层的梯度等于后一层的梯度乘以权重再乘以激活函数的导数:
g
r
a
d
n
=
∂
o
u
t
n
∂
y
n
⋅
w
n
⋅
g
r
a
d
n
−
1
grad_n = \frac{\partial out_n}{\partial y_n} \cdot w_n \cdot grad_{n-1}
gradn=∂yn∂outn⋅wn⋅gradn−1。
根据公式 2 知,影响梯度大小的因素有两项:激活函数的导数、网络的权重参数。控制这两项就可以抑制梯度消失和梯度爆炸:
为了分析一个神经元的反向传播过程,对一个 2 × 2 2\times2 2×2 的感受野卷积一次,通过学习,让它认识这个感受野的特征为 1。感受野即输入特征 x 和卷积核 w 都是 2 × 2 2\times2 2×2 的矩阵。下面演示一个神经元上卷积、偏置、sigmoid 激活函数、L2 损失这四个过程的正向传播与反向传播。
损失函数:
L o s s = ( 1 − y 2 ) 2 Loss = (1-y_2)^2 Loss=(1−y2)2
激活函数:
{
σ
(
x
)
=
1
1
+
e
−
x
σ
′
(
x
)
=
σ
(
x
)
[
1
−
σ
(
x
)
]
\left\{
正向传播很简单:
L
o
s
s
=
(
1
−
y
2
)
2
=
[
1
−
σ
(
w
x
+
b
)
]
2
=
[
σ
(
w
x
+
b
)
−
1
]
2
.
激活层的输出对权重的导数:
∂
L
∂
w
=
∂
L
∂
y
2
⋅
∂
y
2
∂
y
1
⋅
∂
y
1
∂
w
=
=
=
=
=
=
=
=
=
根据正向传播的结果
2
[
σ
(
w
x
+
b
)
−
1
]
⋅
σ
′
(
w
x
+
b
)
⋅
x
=
2
x
⋅
[
σ
(
w
x
+
b
)
−
1
]
⋅
σ
′
(
w
x
+
b
)
=
=
=
=
=
=
=
=
=
使用
s
i
g
m
o
i
d
激活函数
2
x
⋅
[
σ
(
w
x
+
b
)
−
1
]
⋅
σ
(
w
x
+
b
)
⋅
[
1
−
σ
(
w
x
+
b
)
]
=
−
2
x
⋅
σ
(
w
x
+
b
)
⋅
[
σ
(
w
x
+
b
)
−
1
]
2
.
激活层的输出对偏置的导数:
∂
L
∂
b
=
∂
L
∂
y
2
⋅
∂
y
2
∂
y
1
⋅
∂
y
1
∂
b
=
−
2
⋅
σ
(
w
x
+
b
)
⋅
[
σ
(
w
x
+
b
)
−
1
]
2
.
其中 σ ( w x + b ) = y 2 \sigma(wx + b)=y2 σ(wx+b)=y2, [ σ ( w x + b ) − 1 ] 2 = L o s s [\sigma(wx + b) - 1]^2=Loss [σ(wx+b)−1]2=Loss。
填充方法 padding=‘VALID’ 不填充,不然会卷积4次。优化方法使用 GradientDescentOptimizer。激活函数使用 sigmoid。学习率恒为 0.2。下面是训练代码和输出:
import tensorflow as tf import numpy as np def net(input): global filter, bias, y1, y2 init_random = tf.random_normal_initializer(mean=0.0, stddev=0.01, seed=None, dtype=tf.float64) filter = tf.get_variable('filter', shape=[2, 2, 1, 1], initializer=init_random, dtype=tf.float64) bias = tf.Variable([0], dtype=tf.float64, name='bias') y1 = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='VALID') y2 = tf.nn.sigmoid(y1 + bias) return y2 def display(sess): # print '--it:%2d' % it,'loss:',loss.eval({input:data},sess) print print "--filter:", filter.eval(sess).reshape(1, 4), " bias:", bias.eval(sess) print "--y1:", y1.eval({input: data}, sess), " y2:", y2.eval({input: data}, sess), "loss:", loss.eval({input: data}, sess) print "--filter gradient:", tf.gradients(loss, filter)[0].eval({input: data}, sess).reshape(1, 4), \ " bias gradient:", tf.gradients(loss, bias)[0].eval({input: data}, sess).reshape(1, 1) data = np.array([[0.1, 0.2], [0.3, 0.4]]) data = np.reshape(data, (1, 2, 2, 1)) input = tf.placeholder(tf.float64, [1, 2, 2, 1]) predict = net(input) loss = tf.reduce_mean(tf.square(1 - predict)) step = tf.Variable(0, trainable=False) rate = tf.train.exponential_decay(0.2, step, 1, 1) # optimizer = tf.train.AdadeltaOptimizer(rate) # optimizer = tf.train.AdagradOptimizer(rate) # optimizer = tf.train.AdamOptimizer(rate) # optimizer = tf.train.FtrlOptimizer(rate) optimizer = tf.train.GradientDescentOptimizer(rate) # optimizer = tf.train.MomentumOptimizer(rate) # optimizer = tf.train.RMSPropOptimizer(rate) train = optimizer.minimize(loss, global_step=step) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) print "--trainable variables:", tf.trainable_variables() for it in range(3): display(sess) train.run({input: data}, sess)
输出:
--trainable variables: [<tf.Variable 'filter:0' shape=(2, 2, 1, 1) dtype=float64_ref>, <tf.Variable 'bias:0' shape=(1,) dtype=float64_ref>]
--filter: [[-0.00101103 0.00193166 -0.01216178 0.01202441]] bias: [0.]
--y1: [[[[0.00144646]]]] y2: [[[[0.50036161]]]] loss: 0.2496385164748249
--filter gradient: [[-0.02498191 -0.04996381 -0.07494572 -0.09992762]] bias gradient: [[-0.24981906]]
--filter: [[0.00398535 0.01192442 0.00282736 0.03200994]] bias: [0.04996381]
--y1: [[[[0.0164356]]]] y2: [[[[0.51659376]]]] loss: 0.23368159536065822
--filter gradient: [[-0.02414369 -0.04828738 -0.07243107 -0.09657476]] bias gradient: [[-0.24143691]]
--filter: [[0.00881409 0.02158189 0.01731358 0.05132489]] bias: [0.0982512]
--y1: [[[[0.03092182]]]] y2: [[[[0.53224842]]]] loss: 0.21879153616090155
--filter gradient: [[-0.02329029 -0.04658058 -0.06987087 -0.09316116]] bias gradient: [[-0.2329029]]
神经元的输入,一个二维矩阵:
x
=
[
0.1
0.2
0.3
0.4
]
x =
初始化卷积核和偏置,对应网络输出第 3 行:
w
=
[
−
0.00101103
0.00193166
−
0.01216178
0.01202441
]
w=
b = [ 0 ] b=[0] b=[0]
对应网络输出第 4 行。
计算梯度,对应网络输出第 5 行:
更新参数,对应网络输出第 7 行:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。