当前位置:   article > 正文

反向传播与梯度_用梯度计算写出反向传播公式,即写出w1

用梯度计算写出反向传播公式,即写出w1

欢迎访问我的博客首页


1. 正向传播与反向传播


  正向传播与反向传播的公式推导。

1.1 正向传播


  正向传播时,第 n 层卷积层和激活函数层的输出:

{ y n = = = = 卷积操作 f ( o u t n − 1 ) = w n ⋅ o u t n − 1 + b n , o u t n = = = = 激活函数 σ ( y n ) . \left\{

yn====f(outn1)=wnoutn1+bn,outn====σ(yn).
\right. ynoutn====卷积操作f(outn1)=wnoutn1+bn,====激活函数σ(yn).

  只考虑卷积和激活函数操作,暂不考虑归一化层。 y n y_n yn 是第 n 层卷积的结果, o u t n out_{n} outn 是第 n 层激活函数的输出。 o u t n − 1 out_{n - 1} outn1 是上一层激活函数的输出, o u t 0 out_0 out0 是网络的输入。

1.2 反向传播


  第 n 层输出对权重和偏置的导数:

{ g r a d n w 1 = ∂ o u t n ∂ w 1 = ( ∂ o u t n ∂ y n × ∂ y n ∂ o u t n − 1 ) × ( ∂ o u t n − 1 ∂ y n − 1 × ∂ y n − 1 ∂ o u t n − 2 ) × ⋯ × ( ∂ o u t 2 ∂ y 2 × ∂ y 2 ∂ o u t 1 ) × ( ∂ o u t 1 ∂ y 1 × ∂ y 1 ∂ w 1 ) , g r a d n b 1 = ∂ o u t n ∂ b 1 = ( ∂ o u t n ∂ y n × ∂ y n ∂ o u t n − 1 ) × ( ∂ o u t n − 1 ∂ y n − 1 × ∂ y n − 1 ∂ o u t n − 2 ) × ⋯ × ( ∂ o u t 2 ∂ y 2 × ∂ y 2 ∂ o u t 1 ) × ( ∂ o u t 1 ∂ y 1 × ∂ y 1 ∂ b 1 ) . (1) \left\{

gradnw1=outnw1=(outnyn×ynoutn1)×(outn1yn1×yn1outn2)××(out2y2×y2out1)×(out1y1×y1w1),gradnb1=outnb1=(outnyn×ynoutn1)×(outn1yn1×yn1outn2)××(out2y2×y2out1)×(out1y1×y1b1).
\right. \tag{1} gradnw1=w1outn=(ynoutn×outn1yn)×(yn1outn1×outn2yn1)××(y2out2×out1y2)×(y1out1×w1y1),gradnb1=b1outn=(ynoutn×outn1yn)×(yn1outn1×outn2yn1)××(y2out2×out1y2)×(y1out1×b1y1).(1)

  每个小括号内的第 1 个乘数就是激活函数的导数。第 2 个乘数就是卷积核的权重,即 ∂ y n ∂ o u t n − 1 = w n \frac{\partial y_n}{\partial out_{n-1}} = w_n outn1yn=wn。所以

{ g r a d n w 1 = ∂ o u t n ∂ w 1 = ( ∂ o u t n ∂ y n × w n ) × ( ∂ o u t n − 1 ∂ y n − 1 × w n − 1 ) × ⋯ × ( ∂ o u t 2 ∂ y 2 × w 2 ) × ( ∂ o u t 1 ∂ y 1 × o u t 0 ) , g r a d n b 1 = ∂ o u t n ∂ b 1 = ( ∂ o u t n ∂ y n × w n ) × ( ∂ o u t n − 1 ∂ y n − 1 × w n − 1 ) × ⋯ × ( ∂ o u t 2 ∂ y 2 × w 2 ) × ( ∂ o u t 1 ∂ y 1 × 1 ) . (2) \left\{

gradnw1=outnw1=(outnyn×wn)×(outn1yn1×wn1)××(out2y2×w2)×(out1y1×out0),gradnb1=outnb1=(outnyn×wn)×(outn1yn1×wn1)××(out2y2×w2)×(out1y1×1).
\right. \tag{2} gradnw1gradnb1=w1outn=(ynoutn×wn)×(yn1outn1×wn1)××(y2out2×w2)×(y1out1×out0),=b1outn=(ynoutn×wn)×(yn1outn1×wn1)××(y2out2×w2)×(y1out1×1).(2)

1.3 分析


  正向传播:后一层的输出等于前一层的输出乘以权重加上偏置,再经过激活函数: o u t n = σ ( w n ⋅ o u t n − 1 + b n ) out_n = \sigma(w_n \cdot out_{n - 1} + b_n) outn=σ(wnoutn1+bn)
  反向传播:前一层的梯度等于后一层的梯度乘以权重再乘以激活函数的导数: g r a d n = ∂ o u t n ∂ y n ⋅ w n ⋅ g r a d n − 1 grad_n = \frac{\partial out_n}{\partial y_n} \cdot w_n \cdot grad_{n-1} gradn=ynoutnwngradn1

1.4 梯度消失与爆炸


  根据公式 2 知,影响梯度大小的因素有两项:激活函数的导数、网络的权重参数。控制这两项就可以抑制梯度消失和梯度爆炸:

  1. 激活函数的导数与激活函数本身和激活函数的输入 y 有关。选择 ReLU 激活函数控制激活函数本身的导数范围,使用 BN 之类的归一化层控制激活函数的输入值范围。
  2. 使用权重归一化控制权重参数的范围。

2. 模型


  为了分析一个神经元的反向传播过程,对一个 2 × 2 2\times2 2×2 的感受野卷积一次,通过学习,让它认识这个感受野的特征为 1。感受野即输入特征 x 和卷积核 w 都是 2 × 2 2\times2 2×2 的矩阵。下面演示一个神经元上卷积、偏置、sigmoid 激活函数、L2 损失这四个过程的正向传播与反向传播。

一个人工神经元

图 1 一个神经元的传播

  损失函数:

L o s s = ( 1 − y 2 ) 2 Loss = (1-y_2)^2 Loss=(1y2)2

  激活函数:

{ σ ( x ) = 1 1 + e − x σ ′ ( x ) = σ ( x ) [ 1 − σ ( x ) ] \left\{

σ(x)=11+exσ(x)=σ(x)[1σ(x)]
\right. {σ(x)=1+ex1σ(x)=σ(x)[1σ(x)]

2.1 正向传播


  正向传播很简单:

L o s s = ( 1 − y 2 ) 2 = [ 1 − σ ( w x + b ) ] 2 = [ σ ( w x + b ) − 1 ] 2 .

Loss=(1y2)2=[1σ(wx+b)]2=[σ(wx+b)1]2.
Loss=(1y2)2=[1σ(wx+b)]2=[σ(wx+b)1]2.

2.2 反向传播


  激活层的输出对权重的导数:

∂ L ∂ w = ∂ L ∂ y 2 ⋅ ∂ y 2 ∂ y 1 ⋅ ∂ y 1 ∂ w = = = = = = = = = 根据正向传播的结果 2 [ σ ( w x + b ) − 1 ] ⋅ σ ′ ( w x + b ) ⋅ x = 2 x ⋅ [ σ ( w x + b ) − 1 ] ⋅ σ ′ ( w x + b ) = = = = = = = = = 使用 s i g m o i d 激活函数 2 x ⋅ [ σ ( w x + b ) − 1 ] ⋅ σ ( w x + b ) ⋅ [ 1 − σ ( w x + b ) ] = − 2 x ⋅ σ ( w x + b ) ⋅ [ σ ( w x + b ) − 1 ] 2 .

Lw=Ly2y2y1y1w=========2[σ(wx+b)1]σ(wx+b)x=2x[σ(wx+b)1]σ(wx+b)=========使sigmoid2x[σ(wx+b)1]σ(wx+b)[1σ(wx+b)]=2xσ(wx+b)[σ(wx+b)1]2.
wL=y2Ly1y2wy1=========根据正向传播的结果2[σ(wx+b)1]σ(wx+b)x=2x[σ(wx+b)1]σ(wx+b)=========使用sigmoid激活函数2x[σ(wx+b)1]σ(wx+b)[1σ(wx+b)]=2xσ(wx+b)[σ(wx+b)1]2.

激活层的输出对偏置的导数:

∂ L ∂ b = ∂ L ∂ y 2 ⋅ ∂ y 2 ∂ y 1 ⋅ ∂ y 1 ∂ b = − 2 ⋅ σ ( w x + b ) ⋅ [ σ ( w x + b ) − 1 ] 2 .

Lb=Ly2y2y1y1b=2σ(wx+b)[σ(wx+b)1]2.
bL=y2Ly1y2by1=2σ(wx+b)[σ(wx+b)1]2.

其中 σ ( w x + b ) = y 2 \sigma(wx + b)=y2 σ(wx+b)=y2 [ σ ( w x + b ) − 1 ] 2 = L o s s [\sigma(wx + b) - 1]^2=Loss [σ(wx+b)1]2=Loss

3. 训练


  填充方法 padding=‘VALID’ 不填充,不然会卷积4次。优化方法使用 GradientDescentOptimizer。激活函数使用 sigmoid。学习率恒为 0.2。下面是训练代码和输出:

import tensorflow as tf
import numpy as np

def net(input):
    global filter, bias, y1, y2
    init_random = tf.random_normal_initializer(mean=0.0, stddev=0.01, seed=None, dtype=tf.float64)
    filter = tf.get_variable('filter', shape=[2, 2, 1, 1], initializer=init_random, dtype=tf.float64)
    bias = tf.Variable([0], dtype=tf.float64, name='bias')
    y1 = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='VALID')
    y2 = tf.nn.sigmoid(y1 + bias)
    return y2

def display(sess):
    # print '--it:%2d' % it,'loss:',loss.eval({input:data},sess)
    print
    print
    "--filter:", filter.eval(sess).reshape(1, 4), " bias:", bias.eval(sess)
    print
    "--y1:", y1.eval({input: data}, sess), " y2:", y2.eval({input: data}, sess), "loss:", loss.eval({input: data}, sess)
    print
    "--filter gradient:", tf.gradients(loss, filter)[0].eval({input: data}, sess).reshape(1, 4), \
    " bias gradient:", tf.gradients(loss, bias)[0].eval({input: data}, sess).reshape(1, 1)

data = np.array([[0.1, 0.2], [0.3, 0.4]])
data = np.reshape(data, (1, 2, 2, 1))

input = tf.placeholder(tf.float64, [1, 2, 2, 1])
predict = net(input)
loss = tf.reduce_mean(tf.square(1 - predict))
step = tf.Variable(0, trainable=False)
rate = tf.train.exponential_decay(0.2, step, 1, 1)
# optimizer = tf.train.AdadeltaOptimizer(rate)
# optimizer = tf.train.AdagradOptimizer(rate)
# optimizer = tf.train.AdamOptimizer(rate)
# optimizer = tf.train.FtrlOptimizer(rate)
optimizer = tf.train.GradientDescentOptimizer(rate)
# optimizer = tf.train.MomentumOptimizer(rate)
# optimizer = tf.train.RMSPropOptimizer(rate)
train = optimizer.minimize(loss, global_step=step)
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    print
    "--trainable variables:", tf.trainable_variables()
    for it in range(3):
        display(sess)
        train.run({input: data}, sess)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48

  输出:

--trainable variables: [<tf.Variable 'filter:0' shape=(2, 2, 1, 1) dtype=float64_ref>, <tf.Variable 'bias:0' shape=(1,) dtype=float64_ref>]

--filter: [[-0.00101103  0.00193166 -0.01216178  0.01202441]]  bias: [0.]
--y1: [[[[0.00144646]]]]  y2: [[[[0.50036161]]]] loss: 0.2496385164748249
--filter gradient: [[-0.02498191 -0.04996381 -0.07494572 -0.09992762]]  bias gradient: [[-0.24981906]]

--filter: [[0.00398535 0.01192442 0.00282736 0.03200994]]  bias: [0.04996381]
--y1: [[[[0.0164356]]]]  y2: [[[[0.51659376]]]] loss: 0.23368159536065822
--filter gradient: [[-0.02414369 -0.04828738 -0.07243107 -0.09657476]]  bias gradient: [[-0.24143691]]

--filter: [[0.00881409 0.02158189 0.01731358 0.05132489]]  bias: [0.0982512]
--y1: [[[[0.03092182]]]]  y2: [[[[0.53224842]]]] loss: 0.21879153616090155
--filter gradient: [[-0.02329029 -0.04658058 -0.06987087 -0.09316116]]  bias gradient: [[-0.2329029]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

3.1 输入数据


  神经元的输入,一个二维矩阵:

x = [ 0.1 0.2 0.3 0.4 ] x =

[0.10.20.30.4]
x=[0.10.30.20.4]

3.2 网络初始化


  初始化卷积核和偏置,对应网络输出第 3 行:

w = [ − 0.00101103 0.00193166 − 0.01216178 0.01202441 ] w=

[0.001011030.001931660.012161780.01202441]
w=[0.001011030.012161780.001931660.01202441]

b = [ 0 ] b=[0] b=[0]

3.3 正向传播


  对应网络输出第 4 行。

  1. y 1 = w ⋅ x + b = 0.00144646 y_1=w\cdot x+b=0.00144646 y1=wx+b=0.00144646
  2. y 2 = s i g m o i d ( y 1 ) = 0.50036161 y_2=sigmoid(y_1)=0.50036161 y2=sigmoid(y1)=0.50036161
  3. L o s s = ( 1 − y 2 ) 2 = 0.2496385164748249 Loss=(1-y_2)^2=0.2496385164748249 Loss=(1y2)2=0.2496385164748249

3.4 反向传播


  计算梯度,对应网络输出第 5 行:

  1. ∂ L ∂ w 1 = − 2 x 1 ⋅ y 2 ⋅ L o s s = − 0.02498191 \frac{\partial L}{\partial w_1}=-2x_1\cdot y_2\cdot Loss=-0.02498191 w1L=2x1y2Loss=0.02498191
  2. ∂ L ∂ w 2 = − 2 x 3 ⋅ y 2 ⋅ L o s s = − 0.04996282 \frac{\partial L}{\partial w_2}=-2x_3\cdot y_2\cdot Loss=-0.04996282 w2L=2x3y2Loss=0.04996282
  3. ∂ L ∂ w 3 = − 2 x 3 ⋅ y 2 ⋅ L o s s = − 0.07494573 \frac{\partial L}{\partial w_3}=-2x_3\cdot y_2\cdot Loss=-0.07494573 w3L=2x3y2Loss=0.07494573
  4. ∂ L ∂ w 4 = − 2 x 3 ⋅ y 2 ⋅ L o s s = − 0.09992764 \frac{\partial L}{\partial w_4}=-2x_3\cdot y_2\cdot Loss=-0.09992764 w4L=2x3y2Loss=0.09992764
  5. ∂ L ∂ b = − 2 ⋅ y 2 ⋅ L o s s = − 0.2498191 \frac{\partial L}{\partial b}=-2\cdot y_2\cdot Loss=-0.2498191 bL=2y2Loss=0.2498191

  更新参数,对应网络输出第 7 行:

  1. w 1 = w 1 − 0.2 ⋅ ∂ L ∂ w 1 = 0.00398535 w_1=w_1-0.2\cdot \frac{\partial L}{\partial w_1}=0.00398535 w1=w10.2w1L=0.00398535
  2. w 2 = w 2 − 0.2 ⋅ ∂ L ∂ w 2 = 0.01192442 w_2=w_2-0.2\cdot \frac{\partial L}{\partial w_2}=0.01192442 w2=w20.2w2L=0.01192442
  3. w 3 = w 3 − 0.2 ⋅ ∂ L ∂ w 3 = 0.00282736 w_3=w_3-0.2\cdot \frac{\partial L}{\partial w_3}=0.00282736 w3=w30.2w3L=0.00282736
  4. w 4 = w 4 − 0.2 ⋅ ∂ L ∂ w 4 = 0.03200994 w_4=w_4-0.2\cdot \frac{\partial L}{\partial w_4}=0.03200994 w4=w40.2w4L=0.03200994
  5. b = b − 0.2 ⋅ ∂ L ∂ b = 0.04996381 b=b-0.2\cdot \frac{\partial L}{\partial b}=0.04996381 b=b0.2bL=0.04996381

4. 参考


  1. 正向传播与反向传播的推导
  2. 正向传播与反向传播
  3. 正向传播与反向传播
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Li_阴宅/article/detail/998990
推荐阅读
相关标签
  

闽ICP备14008679号