赞
踩
BP(back propagation,反向传播)神经网络(neural network),通常指具有三层网络结构的浅层神经网络。神经网络由一个个神经元(Neuron)组成,神经元由输入、计算、输出单元组成。
对应上图输入为
x
1
,
x
2
,
⋯
 
,
x
n
x_1,x_2,\cdots,x_n
x1,x2,⋯,xn和截距
+
1
+1
+1,输出为:
y
^
=
h
w
,
b
(
X
)
=
f
(
w
T
X
)
=
f
(
∑
i
=
1
n
w
i
x
i
+
b
)
\hat y=h_{w,b}(X)=f(w^T X)=f(\sum_{i=1}^n w_i x_i+b)
y^=hw,b(X)=f(wTX)=f(i=1∑nwixi+b)
其中w表示权重值,函数f为激活函数,有如下激活函数:
s
i
g
m
o
i
d
:
f
(
x
)
=
1
1
+
e
x
p
−
x
sigmoid: f(x)=\frac{1}{1+exp^{-x} }
sigmoid:f(x)=1+exp−x1
t
a
n
h
:
f
(
x
)
=
e
x
−
e
−
x
e
x
+
e
−
x
tanh: f(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}
tanh:f(x)=ex+e−xex−e−x
R
e
L
u
:
f
(
x
)
=
m
a
x
(
0
,
x
)
ReLu: f(x)=max(0,x)
ReLu:f(x)=max(0,x)
S
o
f
t
P
l
u
s
:
f
(
x
)
=
l
o
g
e
(
1
+
e
x
)
SoftPlus: f(x)=log_e(1+e^x)
SoftPlus:f(x)=loge(1+ex)
对应图像为:
一个三层的神经网络结构图:
相关参数说明
参数计算关系如下:
a
2
2
=
f
(
z
2
2
)
=
f
(
w
21
1
x
1
+
w
22
1
x
2
+
w
23
1
x
3
+
b
2
1
)
a_2^2=f(z_2^2)=f(w_{21}^1 x_1+w_{22}^1 x_2 +w_{23}^1 x_3+b_2^1)
a22=f(z22)=f(w211x1+w221x2+w231x3+b21)
即每个神经元的输入为上一层所有神经元输出的加权求和,神经元输入值经过激活函数处理后得到神经元输出。
对于每个训练样本
(
X
,
y
)
(X,y)
(X,y),损失函数为:
J
(
W
,
b
;
X
,
y
)
=
1
2
∣
∣
h
w
,
b
(
X
)
−
y
∣
∣
2
J(W,b;X,y)=\frac{1}{2}||h_{w,b}(X)-y||^2
J(W,b;X,y)=21∣∣hw,b(X)−y∣∣2
表示最后一层输出层的输出值与实际值的欧式距离,结果是一个向量,向量维度等于输出层神经元数量。
为得到损失函数最小值,首先对参数进行初始化,初始化为一个接近0的随机值。再利用前向传播得到预测值,从而计算损失值。此时需要利用损失函数调整参数,可使用梯度下降法,梯度下降公式为:
W
i
j
l
=
W
i
j
l
−
α
∂
J
(
W
,
b
)
∂
W
i
j
l
W_{ij}^l=W_{ij}^l-\alpha\frac{\partial J(W,b)}{\partial W_{ij}^l}
Wijl=Wijl−α∂Wijl∂J(W,b)
b
i
l
=
b
i
l
−
α
∂
J
(
W
,
b
)
∂
b
i
l
b_i^l=b_i^l-\alpha\frac{\partial J(W,b)}{\partial b_i^l}
bil=bil−α∂bil∂J(W,b)
其中偏导部分:
∂
J
(
W
,
b
)
∂
W
i
j
l
=
[
1
m
∑
k
=
1
m
∂
J
(
W
,
b
;
x
k
,
y
k
)
∂
W
i
j
l
]
\frac{\partial J(W,b)}{\partial W_{ij}^l}=[\frac{1}{m}\sum_{k=1}^m \frac{\partial J(W,b;x^k,y^k)}{\partial W_{ij}^l} ]
∂Wijl∂J(W,b)=[m1k=1∑m∂Wijl∂J(W,b;xk,yk)]
∂
J
(
W
,
b
)
∂
b
i
l
=
=
[
1
m
∑
k
=
1
m
∂
J
(
W
,
b
;
x
k
,
y
k
)
∂
b
i
l
]
\frac{\partial J(W,b)}{\partial b_i^l}==[\frac{1}{m}\sum_{k=1}^m \frac{\partial J(W,b;x^k,y^k)}{\partial b_i^l} ]
∂bil∂J(W,b)==[m1k=1∑m∂bil∂J(W,b;xk,yk)]
由于每两层之间有
W
W
W参数矩阵,考虑到预测值在最后一层输出层,可以先求解
W
i
j
n
l
−
1
W_{ij}^{n_l -1}
Wijnl−1,推导如下:
∂
J
(
W
,
b
)
∂
W
i
j
n
l
−
1
=
∂
1
2
∣
∣
a
n
l
−
y
∣
∣
2
∂
W
i
j
n
l
−
1
=
\frac{\partial J(W,b)}{\partial W_{ij}^{n_l -1}}=\frac{\partial \frac{1}{2}||a^{n_l}-y||^2}{\partial W_{ij}^{n_l -1}}=
∂Wijnl−1∂J(W,b)=∂Wijnl−1∂21∣∣anl−y∣∣2=
∂
1
2
∑
k
=
1
S
n
l
(
a
k
n
l
−
y
k
)
2
∂
W
i
j
n
l
−
1
=
\frac{\partial \frac{1}{2} \sum_{k=1}^{S_{n_l}} (a_k^{n_l}-y_k)^2 }{\partial W_{ij}^{n_l -1}}=
∂Wijnl−1∂21∑k=1Snl(aknl−yk)2=
∂
1
2
∑
k
=
1
S
n
l
(
f
(
z
k
n
l
)
−
y
k
)
2
∂
W
i
j
n
l
−
1
\frac{\partial \frac{1}{2} \sum_{k=1}^{S_{n_l}} (f(z_k^{n_l})-y_k)^2 }{\partial W_{ij}^{n_l -1}}
∂Wijnl−1∂21∑k=1Snl(f(zknl)−yk)2
其中
z
i
n
l
z_i^{n_l}
zinl等于:
z
i
n
l
=
∑
p
=
1
S
n
l
−
1
[
W
i
p
n
l
−
1
a
p
n
l
−
1
+
b
i
n
l
−
1
]
z_i^{n_l}=\sum_{p=1}^{S_{n_l -1}}[W_{ip}^{n_l -1}a_p^{n_l -1}+b_i^{n_l -1}]
zinl=p=1∑Snl−1[Wipnl−1apnl−1+binl−1]
即
z
i
n
l
z_i^{n_l}
zinl对
W
i
j
n
l
−
1
W_{ij}^{n_l -1}
Wijnl−1可导,可以使用链式法则求导:
∂
J
(
W
,
b
)
∂
W
i
j
n
l
−
1
=
\frac{\partial J(W,b)}{\partial W_{ij}^{n_l -1}}=
∂Wijnl−1∂J(W,b)=
∂
1
2
∑
k
=
1
S
n
l
(
f
(
z
k
n
l
)
−
y
k
)
2
∂
z
i
n
l
⋅
∂
z
i
n
l
∂
W
i
j
n
l
−
1
=
\frac{\partial \frac{1}{2} \sum_{k=1}^{S_{n_l}} (f(z_k^{n_l})-y_k)^2 }{\partial z_i^{n_l}} \cdot \frac {\partial z_i^{n_l}} {\partial W_{ij}^{n_l -1}}=
∂zinl∂21∑k=1Snl(f(zknl)−yk)2⋅∂Wijnl−1∂zinl=
[
f
(
z
i
n
l
)
−
y
i
]
⋅
f
′
(
z
i
n
l
)
⋅
∂
z
i
n
l
∂
W
i
j
n
l
−
1
=
[f(z_i^{n_l})-y_i]\cdot f'(z_i^{n_l})\cdot \frac {\partial z_i^{n_l}} {\partial W_{ij}^{n_l -1}}=
[f(zinl)−yi]⋅f′(zinl)⋅∂Wijnl−1∂zinl=
[
f
(
z
i
n
l
)
−
y
i
]
⋅
f
′
(
z
i
n
l
)
⋅
a
j
n
l
−
1
[f(z_i^{n_l})-y_i]\cdot f'(z_i^{n_l})\cdot a_j^{n_l -1}
[f(zinl)−yi]⋅f′(zinl)⋅ajnl−1
反向传播算法的思路为,对于给定训练数据
(
X
,
y
)
(X,y)
(X,y),通过前向传播算法计算每个神经元的输出值,当所有神经元的输出都计算完成后,对每个神经元计算残差,如第
l
l
l层的第i个神经元的残差表示为
δ
i
l
\delta_i^l
δil,该残差表示该神经元对最终残差的影响,最后一层的残差公式为:
δ
i
n
l
=
∂
J
(
W
,
b
)
∂
z
i
n
l
=
[
f
(
z
i
n
l
)
−
y
i
]
⋅
f
′
(
z
i
n
l
)
\delta_i^{n_l}=\frac{\partial J(W,b)}{\partial z_i^{n_l}}=[f(z_i^{n_l})-y_i]\cdot f'(z_i^{n_l})
δinl=∂zinl∂J(W,b)=[f(zinl)−yi]⋅f′(zinl)
将
δ
i
l
\delta_i^l
δil代入,得出:
∂
J
(
W
,
b
)
∂
W
i
j
n
l
−
1
=
δ
i
n
l
⋅
a
j
n
l
−
1
\frac{\partial J(W,b)}{\partial W_{ij}^{n_l -1}}=\delta_i^{n_l}\cdot a_j^{n_l -1}
∂Wijnl−1∂J(W,b)=δinl⋅ajnl−1
其中
a
j
n
l
−
1
a_j^{n_l -1}
ajnl−1可以通过前向传播得到,需要求解
δ
i
n
l
\delta_i^{n_l}
δinl,可以推导倒数第二层残差与最后一层残差的关系:
δ
i
n
l
−
1
=
∂
J
(
W
,
b
)
∂
z
i
n
l
−
1
=
\delta_i^{n_l -1}=\frac{\partial J(W,b)}{\partial z_i^{n_l -1}}=
δinl−1=∂zinl−1∂J(W,b)=
∂
1
2
∑
k
=
1
S
n
l
(
f
(
z
k
n
l
)
−
y
k
)
2
∂
z
i
n
l
−
1
=
\frac{\partial \frac{1}{2} \sum_{k=1}^{S_{n_l}} (f(z_k^{n_l})-y_k)^2 }{\partial z_i^{n_l -1}}=
∂zinl−1∂21∑k=1Snl(f(zknl)−yk)2=
1
2
∑
k
=
1
S
n
l
∂
(
f
(
z
k
n
l
)
−
y
k
)
2
∂
z
i
n
l
−
1
=
\frac{1}{2} \sum_{k=1}^{S_{n_l}} \frac{\partial (f(z_k^{n_l})-y_k)^2 }{\partial z_i^{n_l -1}}=
21k=1∑Snl∂zinl−1∂(f(zknl)−yk)2=
1
2
∑
k
=
1
S
n
l
∂
(
f
(
z
k
n
l
)
−
y
k
)
2
∂
z
k
n
l
⋅
∂
z
k
n
l
∂
z
i
n
l
−
1
=
\frac{1}{2} \sum_{k=1}^{S_{n_l}} \frac{\partial (f(z_k^{n_l})-y_k)^2 }{\partial z_k^{n_l}}\cdot \frac{\partial z_k^{n_l}}{\partial z_i^{n_l -1}}=
21k=1∑Snl∂zknl∂(f(zknl)−yk)2⋅∂zinl−1∂zknl=
∑
k
=
1
S
n
l
δ
k
n
l
⋅
∂
z
k
n
l
∂
z
i
n
l
−
1
=
\sum_{k=1}^{S_{n_l}} \delta_k^{n_l}\cdot \frac{\partial z_k^{n_l}}{\partial z_i^{n_l -1}}=
k=1∑Snlδknl⋅∂zinl−1∂zknl=
∑
k
=
1
S
n
l
δ
k
n
l
⋅
∂
∑
j
=
1
S
n
l
−
1
[
f
(
z
j
n
l
−
1
)
⋅
W
k
j
n
l
−
1
+
b
j
n
l
−
1
]
∂
z
i
n
l
−
1
=
\sum_{k=1}^{S_{n_l}} \delta_k^{n_l}\cdot \frac{\partial \sum_{j=1}^{S_{n_l-1}} [f(z_j^{n_l-1})\cdot W_{kj}^{n_l-1}+b_j^{n_l-1}]}{\partial z_i^{n_l -1}}=
k=1∑Snlδknl⋅∂zinl−1∂∑j=1Snl−1[f(zjnl−1)⋅Wkjnl−1+bjnl−1]=
∑
k
=
1
S
n
l
δ
k
n
l
⋅
W
k
i
n
l
−
1
f
′
(
z
i
n
l
−
1
)
=
\sum_{k=1}^{S_{n_l}} \delta_k^{n_l}\cdot W_{ki}^{n_l-1} f'(z_i^{n_l-1})=
k=1∑Snlδknl⋅Wkinl−1f′(zinl−1)=
[
∑
k
=
1
S
n
l
δ
k
n
l
⋅
W
k
i
n
l
−
1
]
⋅
f
′
(
z
i
n
l
−
1
)
[\sum_{k=1}^{S_{n_l}} \delta_k^{n_l}\cdot W_{ki}^{n_l-1}] \cdot f'(z_i^{n_l-1})
[k=1∑Snlδknl⋅Wkinl−1]⋅f′(zinl−1)
即
δ
i
n
l
−
1
=
[
∑
k
=
1
S
n
l
δ
k
n
l
⋅
W
k
i
n
l
−
1
]
⋅
f
′
(
z
i
n
l
−
1
)
\delta_i^{n_l -1}=[\sum_{k=1}^{S_{n_l}} \delta_k^{n_l}\cdot W_{ki}^{n_l-1}] \cdot f'(z_i^{n_l-1})
δinl−1=[k=1∑Snlδknl⋅Wkinl−1]⋅f′(zinl−1)
推导到一般情况:
δ
i
l
=
[
∑
k
=
1
S
l
+
1
δ
k
l
+
1
⋅
W
k
i
l
]
⋅
f
′
(
z
i
l
)
\delta_i^l=[\sum_{k=1}^{S_{l+1}} \delta_k^{l+1}\cdot W_{ki}^l] \cdot f'(z_i^l)
δil=[k=1∑Sl+1δkl+1⋅Wkil]⋅f′(zil)
∂
J
(
W
,
b
)
∂
W
i
j
l
=
δ
i
l
+
1
⋅
a
j
l
\frac{\partial J(W,b)}{\partial W_{ij}^l}=\delta_i^{l+1}\cdot a_j^l
∂Wijl∂J(W,b)=δil+1⋅ajl
∂
J
(
W
,
b
)
∂
b
i
l
=
δ
i
l
+
1
\frac{\partial J(W,b)}{\partial b_i^l}=\delta_i^{l+1}
∂bil∂J(W,b)=δil+1
# 代码来自《Python机器学习算法》一书 def bp_train(feature, label, n_hidden, maxCycle, alpha, n_output): '''计算隐含层的输入 input: feature(mat):特征 label(mat):标签 n_hidden(int):隐含层的节点个数 maxCycle(int):最大的迭代次数 alpha(float):学习率 n_output(int):输出层的节点个数 output: w0(mat):输入层到隐含层之间的权重 b0(mat):输入层到隐含层之间的偏置 w1(mat):隐含层到输出层之间的权重 b1(mat):隐含层到输出层之间的偏置 ''' m, n = np.shape(feature) # 1、初始化 w0 = np.mat(np.random.rand(n, n_hidden)) w0 = w0 * (8.0 * sqrt(6) / sqrt(n + n_hidden)) - \ np.mat(np.ones((n, n_hidden))) * \ (4.0 * sqrt(6) / sqrt(n + n_hidden)) b0 = np.mat(np.random.rand(1, n_hidden)) b0 = b0 * (8.0 * sqrt(6) / sqrt(n + n_hidden)) - \ np.mat(np.ones((1, n_hidden))) * \ (4.0 * sqrt(6) / sqrt(n + n_hidden)) w1 = np.mat(np.random.rand(n_hidden, n_output)) w1 = w1 * (8.0 * sqrt(6) / sqrt(n_hidden + n_output)) - \ np.mat(np.ones((n_hidden, n_output))) * \ (4.0 * sqrt(6) / sqrt(n_hidden + n_output)) b1 = np.mat(np.random.rand(1, n_output)) b1 = b1 * (8.0 * sqrt(6) / sqrt(n_hidden + n_output)) - \ np.mat(np.ones((1, n_output))) * \ (4.0 * sqrt(6) / sqrt(n_hidden + n_output)) # 2、训练 i = 0 while i <= maxCycle: # 2.1、信号正向传播 # 2.1.1、计算隐含层的输入 hidden_input = hidden_in(feature, w0, b0) # mXn_hidden # 2.1.2、计算隐含层的输出 hidden_output = hidden_out(hidden_input) # 2.1.3、计算输出层的输入 output_in = predict_in(hidden_output, w1, b1) # mXn_output # 2.1.4、计算输出层的输出 output_out = predict_out(output_in) # 2.2、误差的反向传播 # 2.2.1、隐含层到输出层之间的残差 delta_output = -np.multiply((label - output_out), partial_sig(output_in)) # 2.2.2、输入层到隐含层之间的残差 delta_hidden = np.multiply((delta_output * w1.T), partial_sig(hidden_input)) # 2.3、 修正权重和偏置 w1 = w1 - alpha * (hidden_output.T * delta_output) b1 = b1 - alpha * np.sum(delta_output, axis=0) * (1.0 / m) w0 = w0 - alpha * (feature.T * delta_hidden) b0 = b0 - alpha * np.sum(delta_hidden, axis=0) * (1.0 / m) if i % 100 == 0: print "\t-------- iter: ", i, \ " ,cost: ", (1.0/2) * get_cost(get_predict(feature, w0, w1, b0, b1) - label) i += 1 return w0, w1, b0, b1
样本特征数 n=2
隐藏层节点数 n_hidden=20
输出层节点数(分类数量) n_output=2
构成 2*20*2
的三层神经网络结构
输入层到隐藏层的权重w0 = np.mat(np.random.rand(n, n_hidden))
,即2*20
个
输入层到隐藏层的偏置b0 = np.mat(np.random.rand(1, n_hidden))
,即20
个
隐藏层到输出层的权重w1 = np.mat(np.random.rand(n_hidden, n_output))
,即20*2
个
隐藏层到输出层的偏置b1 = np.mat(np.random.rand(1, n_output))
,即2
个
可以利用更科学的随机算法,得到随机化的w0,b0,w1,b1
# 2.1.1、计算隐含层的输入
hidden_input = hidden_in(feature, w0, b0) # mXn_hidden
# 2.1.2、计算隐含层的输出
hidden_output = hidden_out(hidden_input)
# 2.1.3、计算输出层的输入
output_in = predict_in(hidden_output, w1, b1) # mXn_output
# 2.1.4、计算输出层的输出
output_out = predict_out(output_in)
hidden_in
方法计算隐藏层的输入值,对应公式:
z
i
l
=
∑
k
=
1
S
l
−
1
[
W
i
k
l
−
1
a
k
l
−
1
+
b
i
l
−
1
]
z_i^l=\sum_{k=1}^{S_{l -1}}[W_{ik}^{l -1}a_k^{l -1}+b_i^{l -1}]
zil=k=1∑Sl−1[Wikl−1akl−1+bil−1]
def hidden_in(feature, w0, b0):
'''计算隐含层的输入
input: feature(mat):特征
w0(mat):输入层到隐含层之间的权重
b0(mat):输入层到隐含层之间的偏置
output: hidden_in(mat):隐含层的输入
'''
m = np.shape(feature)[0]
hidden_in = feature * w0
for i in xrange(m):
hidden_in[i, ] += b0
return hidden_in
hidden_out
方法计算隐藏层的输出,对应公式:
a
i
l
=
f
(
z
i
l
)
a_i^l=f(z_i^l)
ail=f(zil)
def hidden_out(hidden_in):
'''隐含层的输出
input: hidden_in(mat):隐含层的输入
output: hidden_output(mat):隐含层的输出
'''
hidden_output = sig(hidden_in)
return hidden_output;
predict_in
方法等同于hidden_in
,predict_out
方法等同于hidden_out
。
# 2.2.1、隐含层到输出层之间的残差
delta_output = -np.multiply((label - output_out), partial_sig(output_in))
# 2.2.2、输入层到隐含层之间的残差
delta_hidden = np.multiply((delta_output * w1.T), partial_sig(hidden_input))
partial_sig
方法计算输入值的偏导值,delta_output
对应最后一层的残差公式:
δ
i
n
l
=
∂
J
(
W
,
b
)
∂
z
i
n
l
=
[
f
(
z
i
n
l
)
−
y
i
]
⋅
f
′
(
z
i
n
l
)
\delta_i^{n_l}=\frac{\partial J(W,b)}{\partial z_i^{n_l}}=[f(z_i^{n_l})-y_i]\cdot f'(z_i^{n_l})
δinl=∂zinl∂J(W,b)=[f(zinl)−yi]⋅f′(zinl)
delta_hidden
对应一般情况的残差公式:
δ
i
l
=
[
∑
k
=
1
S
l
+
1
δ
k
l
+
1
⋅
W
k
i
l
]
⋅
f
′
(
z
i
l
)
\delta_i^l=[\sum_{k=1}^{S_{l+1}} \delta_k^{l+1}\cdot W_{ki}^l] \cdot f'(z_i^l)
δil=[k=1∑Sl+1δkl+1⋅Wkil]⋅f′(zil)
def partial_sig(x):
'''Sigmoid导函数的值
input: x(mat/float):自变量,可以是矩阵或者是任意实数
output: out(mat/float):Sigmoid导函数的值
'''
m, n = np.shape(x)
out = np.mat(np.zeros((m, n)))
for i in xrange(m):
for j in xrange(n):
out[i, j] = sig(x[i, j]) * (1 - sig(x[i, j]))
return out
w1 = w1 - alpha * (hidden_output.T * delta_output)
b1 = b1 - alpha * np.sum(delta_output, axis=0) * (1.0 / m)
w0 = w0 - alpha * (feature.T * delta_hidden)
b0 = b0 - alpha * np.sum(delta_hidden, axis=0) * (1.0 / m)
对应公式:
∂
J
(
W
,
b
)
∂
W
i
j
l
=
δ
i
l
+
1
⋅
a
j
l
\frac{\partial J(W,b)}{\partial W_{ij}^l}=\delta_i^{l+1}\cdot a_j^l
∂Wijl∂J(W,b)=δil+1⋅ajl
∂
J
(
W
,
b
)
∂
b
i
l
=
δ
i
l
+
1
\frac{\partial J(W,b)}{\partial b_i^l}=\delta_i^{l+1}
∂bil∂J(W,b)=δil+1
def get_predict(feature, w0, w1, b0, b1):
'''计算最终的预测
input: feature(mat):特征
w0(mat):输入层到隐含层之间的权重
b0(mat):输入层到隐含层之间的偏置
w1(mat):隐含层到输出层之间的权重
b1(mat):隐含层到输出层之间的偏置
output: 预测值
'''
return predict_out(predict_in(hidden_out(hidden_in(feature, w0, b0)), w1, b1))
传入训练完成的参数,计算测试样本在输出层每个神经元的输出值,选取最大值的神经元作为分类结果。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。