赞
踩
1943年,心理学家W.S.McCulloch和数理逻辑学家W.Pitts基于神经元的生理特征,建立了单个神经元的数学模型(MP模型)。
y
k
=
φ
(
∑
i
=
1
m
ω
k
i
x
i
+
b
k
)
=
φ
(
W
k
T
X
+
b
)
y_{k}=\varphi\left(\sum_{i=1}^{m} \omega_{k i} x_{i}+b_{k}\right)=\varphi\left(W_{k}^{T} X+b\right)
yk=φ(i=1∑mωkixi+bk)=φ(WkTX+b)
1957年,Frank Rosenblatt从纯数学的度重新考察这一模型,指出能够从一些输入输出对
(
X
,
y
)
(X, y)
(X,y)中通过学习算法获得权重
W
W
W和
b
b
b 。
问题:给定一些输入输出对
(
X
,
y
)
(X, y)
(X,y),其中
y
=
±
1
y = \pm 1
y=±1,求一个函数,使
f
(
X
)
=
y
f(X) = y
f(X)=y。
感知器算法:设定
f
(
X
)
=
s
i
g
n
(
W
T
X
+
b
)
f(X) = sign (W^T X + b)
f(X)=sign(WTX+b),从一堆输入输出中自动学习,获得
W
W
W和
b
b
b。
感知器算法(Perceptron Algorithm):
(1)随机选择
W
W
W和
b
b
b;
(2)取一个训练样本
(
X
,
y
)
(X, y)
(X,y)
(i) 若
W
T
X
+
b
>
0
W^T X + b > 0
WTX+b>0且
y
=
−
1
y = -1
y=−1,则:
W
=
W
−
X
,
b
=
b
−
1.
W = W - X, b = b - 1.
W=W−X,b=b−1.
(ii)若
W
T
X
+
b
<
0
W^T X + b < 0
WTX+b<0且
y
=
+
1
y = +1
y=+1,则:
W
=
W
+
X
,
b
=
b
+
1.
W = W + X, b = b + 1.
W=W+X,b=b+1.
(3)再取另一个
(
X
,
y
)
(X, y)
(X,y),回到(2);
(4)终止条件:直到所有输入输出对
(
X
,
y
)
(X, y)
(X,y)都不满足(2)中(i)和(ii)之一,退出循环。
感知器算法演示:
两层神经网络例子:
a
1
=
ω
11
x
1
+
ω
12
x
2
+
b
1
a
2
=
ω
21
x
1
+
ω
22
x
2
+
b
2
z
1
=
φ
(
a
1
)
z
2
=
φ
(
a
2
)
y
=
ω
1
z
1
+
ω
2
z
2
+
b
3
a1=ω11x1+ω12x2+b1a2=ω21x1+ω22x2+b2z1=φ(a1)z2=φ(a2)y=ω1z1+ω2z2+b3
其中,
φ
(
⋅
)
\varphi(\cdot)
φ(⋅)为非线性函数。
定理:当
φ
(
x
)
\varphi(x)
φ(x)为阶跃函数时,三层网络可以模拟任意决策面。
举例:
多层神经网络的劣势:
训练建议:
(1)不用每输入一个样本就去变换参数,而是输入一批样本(叫做一个BATCH或MINI-BATCH),求出这些样本的梯度平均值后,根据这个平均值改变参数。
(2)在神经网络训练中,BATCH的样本数大致设置为50-200不等。
batch_size = option.batch_size;
m = size(train_x,1);
num_batches = m / batch_size;
for k = 1 : iteration
kk = randperm(m);
for l = 1 : num_batches
batch_x = train_x(kk((l - 1) * batch_size + 1 : l * batch_size), :);
batch_y = train_y(kk((l - 1) * batch_size + 1 : l * batch_size), :);
nn = nn_forward(nn,batch_x,batch_y);
nn = nn_backpropagation(nn,batch_y);
nn = nn_applygradient(nn);
end
end
m = size(batch_x,2);
前向计算
nn.cost(s) = 0.5 / m * sum(sum((nn.a{k} - batch_y).^2)) + 0.5 * nn.weight_decay * cost2;
后向传播
nn.W_grad{nn.depth-1} = nn.theta{nn.depth}*nn.a{nn.depth-1}'/m + nn.weight_decay*nn.W{nn.depth-1};
nn.b_grad{nn.depth-1} = sum(nn.theta{nn.depth},2)/m;
建议:做均值和方差归一化
n
e
w
X
=
X
−
m
e
a
n
(
X
)
s
t
d
(
X
)
newX = \frac{X - mean(X)}{std(X)}
newX=std(X)X−mean(X)
[U,V] = size(xTraining);
avgX = mean(xTraining);
sigma = std(xTraining);
xTraining = (xTraining - repmat(avgX,U,1))./repmat(sigma,U,1);
梯度消失现象:如果
W
X
T
+
b
W X^T + b
WXT+b一开始很大或很小,那么梯度将趋近于0,反向传播后前面与之相关的梯度也趋近于0,导致训练缓慢。
因此,我们要使
W
X
T
+
b
W X^T + b
WXT+b一开始在零附近。
一种比较简单有效的方法是:
nn.W{k} = 2*rand(height, width)/sqrt(width)-1/sqrt(width);
nn.b{k} = 2*rand(height, 1)/sqrt(width)-1/sqrt(width);
参数初始化是一个热点领域,相关论文包括:
论文:Batch normalization accelerating deep network training by reducing internal covariate shift (2015)
基本思想:既然我们希望每一层获得的值都在0附近,从而避免梯度消失现象,那么我们为什么不直接把每一层的值做基于均值和方差的归一化呢?
每一层FC(Fully Connected Layer)接一个BN(Batch Normalization)层。
x
^
(
k
)
=
x
(
k
)
−
E
[
x
(
k
)
]
V
a
r
[
x
(
k
)
]
\hat{x}^{(k)} = \frac{x^{(k)} - E[x^{(k)}]}{\sqrt{\mathbf{Var}[x^{(k)}]}}
x^(k)=Var[x(k)]
x(k)−E[x(k)]
算法流程:
前向计算:
y = nn.W{k-1} * nn.a{k-1} + repmat(nn.b{k-1},1,m);
if nn.batch_normalization
nn.E{k-1} = nn.E{k-1}*nn.vecNum + sum(y,2);
nn.S{k-1} = nn.S{k-1}.^2*(nn.vecNum-1) + (m-1)*std(y,0,2).^2;
nn.vecNum = nn.vecNum + m;
nn.E{k-1} = nn.E{k-1}/nn.vecNum;
nn.S{k-1} = sqrt(nn.S{k-1}/(nn.vecNum-1));
y = (y - repmat(nn.E{k-1},1,m))./repmat(nn.S{k-1}+0.0001*ones(size(nn.S{k-1})),1,m);
y = nn.Gamma{k-1}*y+nn.Beta{k-1};
end;
switch nn.activaton_function
case 'sigmoid'
nn.a{k} = sigmoid(y);
case 'tanh'
nn.a{k} = tanh(y);
后向传播:
nn.theta{k} = ((nn.W{k}'*nn.theta{k+1})) .* nn.a{k} .* (1 - nn.a{k});
if nn.batch_normalization
x = nn.W{k-1} * nn.a{k-1} + repmat(nn.b{k-1},1,m);
x = (x - repmat(nn.E{k-1},1,m))./repmat(nn.S{k- 1}+0.0001*ones(size(nn.S{k-1})),1,m);
temp = nn.theta{k}.*x;
nn.Gamma_grad{k-1} = sum(mean(temp,2));
nn.Beta_grad{k-1} = sum(mean(nn.theta{k},2));
nn.theta{k} = nn.Gamma{k-1}*nn.theta{k}./repmat((nn.S{k-1}+0.0001),1,m);
end;
nn.W_grad{k-1} = nn.theta{k}*nn.a{k-1}'/m + nn.weight_decay*nn.W{k-1};
nn.b_grad{k-1} = sum(nn.theta{k},2)/m;
cost2 = cost2 + sum(sum(nn.W{k-1}.^2));
nn.cost(s) = 0.5 / m * sum(sum((nn.a{k} - batch_y).^2)) + 0.5 * nn.weight_decay * cost2;
后向传播
nn.W_grad{k-1} = nn.theta{k}*nn.a{k-1}'/m + nn.weight_decay*nn.W{k-1};
if strcmp(nn.objective_function,'Cross Entropy')
nn.cost(s) = -0.5*sum(sum(batch_y.*log(nn.a{k})))/m + 0.5 * nn.weight_decay * cost2;
后向传播
case 'softmax'
y = nn.W{nn.depth-1} * nn.a{nn.depth-1} + repmat(nn.b{nn.depth-1},1,m);
nn.theta{nn.depth} = nn.a{nn.depth} - batch_y;
(1)常规的更新 (Vanilla Stochastic Gradient Descent)
nn.W{k} = nn.W{k} - nn.learning_rate*nn.W_grad{k};
nn.b{k} = nn.b{k} - nn.learning_rate*nn.b_grad{k};
SGD的问题
(1)
(
W
,
b
)
(W,b)
(W,b)的每一个分量获得的梯度绝对值有大有小,一些情况下,将会迫使优化路径变成Z字形状。
(2)SGD求梯度的策略过于随机,由于上一次和下一次用的是完全不同的BATCH数据,将会出现优化的方向随机的情况。
L
(
W
)
=
1
N
∑
i
=
1
N
L
i
(
x
i
,
y
i
,
W
)
∇
L
(
W
)
=
1
N
∑
i
=
1
N
∇
W
L
i
(
x
i
,
y
i
,
W
)
L(W)=1N∑Ni=1Li(xi,yi,W)∇L(W)=1N∑Ni=1∇WLi(xi,yi,W)
解决各个方向梯度不一致的方法:
(1)AdaGrad
AdaGrad 算法在随机梯度下降法的基础上,通过记录各个分量梯度的累计情况, 以对不同的分量方向的步长做出调整。具体而言,利用
G
k
=
∑
i
=
1
k
g
i
⊙
g
i
G^k=\sum^k_{i=1} g_i⊙g_i
Gk=∑i=1kgi⊙gi 记录分量梯度的累计,并构造如下迭代格式:
x
k
+
1
=
x
k
−
α
G
k
+
ϵ
1
n
⊙
g
k
,
G
k
+
1
=
G
k
+
g
k
+
1
⊙
g
k
+
1
.
x^{k+1} =x^k−\frac{α}{G^k+ϵ\mathbf{1}_n}⊙g^k, \\ G^{k+1} = G^k+g^{k+1}⊙g^{k+1}.
xk+1=xk−Gk+ϵ1nα⊙gk,Gk+1=Gk+gk+1⊙gk+1.
if strcmp(nn.optimization_method, 'AdaGrad')
nn.rW{k} = nn.rW{k} + nn.W_grad{k}.^2;
nn.rb{k} = nn.rb{k} + nn.b_grad{k}.^2;
nn.W{k} = nn.W{k} - nn.learning_rate*nn.W_grad{k}./(sqrt(nn.rW{k})+0.001);
nn.b{k} = nn.b{k} - nn.learning_rate*nn.b_grad{k}./(sqrt(nn.rb{k})+0.001);
(2)RMSProp
if strcmp(nn.optimization_method, 'RMSProp')
nn.rW{k} = 0.9*nn.rW{k} + 0.1*nn.W_grad{k}.^2;
nn.rb{k} = 0.9*nn.rb{k} + 0.1*nn.b_grad{k}.^2;
nn.W{k} = nn.W{k} - nn.learning_rate*nn.W_grad{k}./(sqrt(nn.rW{k})+0.001);
nn.b{k} = nn.b{k} - nn.learning_rate*nn.b_grad{k}./(sqrt(nn.rb{k})+0.001); %rho = 0.9
解决梯度随机性问题:
(3)Momentum
if strcmp(nn.optimization_method, 'Momentum')
nn.vW{k} = 0.5*nn.vW{k} + nn.learning_rate*nn.W_grad{k};
nn.vb{k} = 0.5*nn.vb{k} + nn.learning_rate*nn.b_grad{k};
nn.W{k} = nn.W{k} - nn.vW{k};
nn.b{k} = nn.b{k} - nn.vb{k}; %rho = 0.5;
同时两个问题:
(4)Adam
if strcmp(nn.optimization_method, 'Adam')
nn.sW{k} = 0.9*nn.sW{k} + 0.1*nn.W_grad{k};
nn.sb{k} = 0.9*nn.sb{k} + 0.1*nn.b_grad{k};
nn.rW{k} = 0.999*nn.rW{k} + 0.001*nn.W_grad{k}.^2;
nn.rb{k} = 0.999*nn.rb{k} + 0.001*nn.b_grad{k}.^2;
nn.W{k} = nn.W{k} - 10*nn.learning_rate*nn.sW{k}./sqrt(1000*nn.rW{k}+0.00001);
nn.b{k} = nn.b{k} - 10*nn.learning_rate*nn.sb{k}./sqrt(1000*nn.rb{k}+0.00001); %rho1 = 0.9, rho2 = 0.999, delta = 0.00001
(1) Batch Normalization 比较好用,用了这个后,对学习率、参数更新策略等不敏感。建议如果用Batch Normalization, 更新策略用最简单的SGD即可,我的经验是加上其他反而不好。
(2)如果不用Batch Normalization, 通过合理变换其他参数组合,也可以达到目的。
(3)由于梯度累积效应,AdaGrad, RMSProp, Adam三种更新策略到了训练的后期会很慢,可以采用提高学习率的策略来补偿这一效应。
浙江大学胡浩基《机器学习:人工神经网络介绍》
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。