赞
踩
问题:如何计算均值 x ˉ \bar{x} xˉ
回答:
第一种方法:
E
[
X
]
≈
x
ˉ
:
=
1
N
∑
i
=
1
N
x
i
\mathbb{E}[X] \approx \bar{x}:=\frac{1}{N} \sum_{i=1}^N x_i
E[X]≈xˉ:=N1i=1∑Nxi
问题:需要等,将所有数据收集到一起后再求平均
第二种方法:增量式与迭代式的方法
w
k
+
1
和
w_{k+1}和
wk+1和
w
k
w_k
wk分别表示前
k
+
1
k+1
k+1个求平均和前
k
k
k个求平均
w
k
+
1
=
1
k
∑
i
=
1
k
x
i
,
k
=
1
,
2
,
…
w
k
=
1
k
−
1
∑
i
=
1
k
−
1
x
i
,
k
=
2
,
3
,
…
我们发现前
k
+
1
k+1
k+1个求平均和前
k
k
k个求平均是有关系的:
w
k
+
1
=
1
k
∑
i
=
1
k
x
i
=
1
k
(
∑
i
=
1
k
−
1
x
i
+
x
k
)
=
1
k
(
(
k
−
1
)
w
k
+
x
k
)
=
w
k
−
1
k
(
w
k
−
x
k
)
.
w
k
+
1
=
w
k
−
1
k
(
w
k
−
x
k
)
.
例子验证:发现可以表征,我们就得到了一个求平均数的迭代式的算法
w
1
=
x
1
,
w
2
=
w
1
−
1
1
(
w
1
−
x
1
)
=
x
1
,
w
3
=
w
2
−
1
2
(
w
2
−
x
2
)
=
x
1
−
1
2
(
x
1
−
x
2
)
=
1
2
(
x
1
+
x
2
)
,
w
4
=
w
3
−
1
3
(
w
3
−
x
3
)
=
1
3
(
x
1
+
x
2
+
x
3
)
,
⋮
w
k
+
1
=
1
k
∑
i
=
1
k
x
i
.
Stochastic approximation(SA 随机近似):
Robbins-Monro算法(RM算法)
求解的问题:
g
(
w
)
=
0
g(w)=0
g(w)=0
计算过程:
w k + 1 = w k − a k g ~ ( w k , η k ) , k = 1 , 2 , 3 , … w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right), \quad k=1,2,3, \ldots wk+1=wk−akg~(wk,ηk),k=1,2,3,…
这个里面中 g ( w ) g(w) g(w)是个黑盒,输入 { w k } \left\{w_k\right\} {wk},输出 { g ~ ( w k , η k ) } \left\{\tilde{g}\left(w_k, \eta_k\right)\right\} {g~(wk,ηk)}
g ( w ) = tanh ( w − 1 ) = 0 g(w)=\tanh (w-1)=0 g(w)=tanh(w−1)=0
参数: w 1 = 3 , a k = 1 / k , η k ≡ 0 w_1=3, a_k=1 / k, \eta_k \equiv 0 w1=3,ak=1/k,ηk≡0
RM算法: w k + 1 = w k − a k g ( w s ) w_{k+1}=w_k-a_k g\left(w_s\right) wk+1=wk−akg(ws)
仿真结果:
我们发现 w k + 1 w_{k+1} wk+1 更接近于 w ∗ w^* w∗ 相比于 w k w_k wk,因为 w k + 1 = w k − a k g ( w k ) < w k w_{k+1}=w_k-a_k g\left(w_k\right)<w_k wk+1=wk−akg(wk)<wk,所以 w k + 1 w_{k+1} wk+1比 w k w_k wk小更接近 w ∗ w* w∗
min w J ( w ) = E [ f ( w , X ) ] \min _w \quad J(w)=\mathbb{E}[f(w, X)] wminJ(w)=E[f(w,X)]
w k + 1 = w k − α k ∇ w E [ f ( w k , X ) ] = w k − α k E [ ∇ w f ( w k , X ) ] w_{k+1}=w_k-\alpha_k \nabla_w \mathbb{E}\left[f\left(w_k, X\right)\right]=w_k-\alpha_k \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] wk+1=wk−αk∇wE[f(wk,X)]=wk−αkE[∇wf(wk,X)]
问题:对期望求梯度如何计算:
E
[
∇
w
f
(
w
k
,
X
)
]
≈
1
n
∑
i
=
1
n
∇
w
f
(
w
k
,
x
i
)
w
k
+
1
=
w
k
−
α
k
1
n
∑
i
=
1
n
∇
w
f
(
w
k
,
x
i
)
.
问题:在每一次更新采样k的时候都要采集n次
w k + 1 = w k − α k ∇ w f ( w k , x k ) , w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right), wk+1=wk−αk∇wf(wk,xk),
min w J ( w ) = E [ f ( w , X ) ] = E [ 1 2 ∥ w − X ∥ 2 ] , \min _w \quad J(w)=\mathbb{E}[f(w, X)]=\mathbb{E}\left[\frac{1}{2}\|w-X\|^2\right], wminJ(w)=E[f(w,X)]=E[21∥w−X∥2],
其中: f ( w , X ) = ∥ w − X ∥ 2 / 2 ∇ w f ( w , X ) = w − X f(w, X)=\|w-X\|^2 / 2 \quad \nabla_w f(w, X)=w-X f(w,X)=∥w−X∥2/2∇wf(w,X)=w−X
【问题1】:最优解是否是
w
∗
=
E
[
X
]
w^*=\mathbb{E}[X]
w∗=E[X]?
∇
w
J
(
w
)
=
0
⇒
E
[
∇
w
f
(
w
,
X
)
⏟
w
−
x
]
=
0
⇒
E
[
w
−
x
]
=
0
⇒
w
=
E
[
x
]
【问题2】:写出解决次问题的GD算法
w
k
+
1
=
w
k
−
α
k
∇
w
J
(
w
k
)
=
w
k
−
α
k
E
[
∇
w
f
(
w
k
,
X
)
]
=
w
k
−
α
k
E
[
w
k
−
X
]
【问题3】:写出解决此问题的SGD算法
w
k
+
1
=
w
k
−
α
k
∇
w
f
(
w
k
,
x
k
)
=
w
k
−
α
k
(
w
k
−
x
k
)
w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right)=w_k-\alpha_k\left(w_k-x_k\right)
wk+1=wk−αk∇wf(wk,xk)=wk−αk(wk−xk)
w
k
+
1
=
w
k
−
α
k
E
[
∇
w
f
(
w
k
,
X
)
]
⇓
w
k
+
1
=
w
k
−
α
k
∇
w
f
(
w
k
,
x
k
)
从随机梯度来的,由于模型不知道所以通过随机采样来近似E,这个采样叫随机梯度,前面的叫真实梯度
∇
w
f
(
w
k
,
x
k
)
=
E
[
∇
w
f
(
w
,
X
)
]
+
∇
w
f
(
w
k
,
x
k
)
−
E
[
∇
w
f
(
w
,
X
)
]
⏟
η
\nabla_w f\left(w_k, x_k\right)=\mathbb{E}\left[\nabla_w f(w, X)\right]+\underbrace{\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f(w, X)\right]}_\eta
∇wf(wk,xk)=E[∇wf(w,X)]+η
∇wf(wk,xk)−E[∇wf(w,X)]
由于是用随机梯度来近似真实梯度所以不准确的存在误差
η
\eta
η
问题:由于 ∇ w f ( w k , x k ) ≠ E [ ∇ w f ( w , X ) ] \nabla_w f\left(w_k, x_k\right) \neq \mathbb{E}\left[\nabla_w f(w, X)\right] ∇wf(wk,xk)=E[∇wf(w,X)],所以使用SGD是否当 k → ∞ k \rightarrow \infty k→∞ 时 w k → w ∗ w_k \rightarrow w^* wk→w∗ ?
回答:由于SGD是特殊的RM算法,那么前面RM算法的收敛性就可以应用到SGD的收敛性当中
问题:随机梯度是个随机的,逼近会不会不准确,SGD收敛是慢还是随机的
回答:
δ
k
≐
∣
∇
w
f
(
w
k
,
x
k
)
−
E
[
∇
w
f
(
w
k
,
X
)
]
∣
∣
E
[
∇
w
f
(
w
k
,
X
)
]
∣
\delta_k \doteq \frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{\left|\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}
δk≐∣E[∇wf(wk,X)]∣∣∇wf(wk,xk)−E[∇wf(wk,X)]∣
由于
E
[
∇
w
f
(
w
∗
,
X
)
]
=
0
\mathbb{E}\left[\nabla_w f\left(w^*, X\right)\right]=0
E[∇wf(w∗,X)]=0,我们将上面的式子进行修正为:
δ
k
=
∣
∇
w
f
(
w
k
,
x
k
)
−
E
[
∇
w
f
(
w
k
,
X
)
]
∣
∣
E
[
∇
w
f
(
w
k
,
X
)
]
−
E
[
∇
w
f
(
w
∗
,
X
)
]
∣
=
∣
∇
w
f
(
w
k
,
x
k
)
−
E
[
∇
w
f
(
w
k
,
X
)
]
∣
∣
E
[
∇
w
2
f
(
w
~
k
,
X
)
(
w
k
−
w
∗
)
]
∣
\delta_k=\frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{\left|\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]-\mathbb{E}\left[\nabla_w f\left(w^*, X\right)\right]\right|}=\frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{\left|\mathbb{E}\left[\nabla_w^2 f\left(\tilde{w}_k, X\right)\left(w_k-w^*\right)\right]\right|}
δk=∣E[∇wf(wk,X)]−E[∇wf(w∗,X)]∣∣∇wf(wk,xk)−E[∇wf(wk,X)]∣=∣E[∇w2f(w~k,X)(wk−w∗)]∣∣∇wf(wk,xk)−E[∇wf(wk,X)]∣
其中左边的式子将上面的带了进去,右边的式子对左边的式子使用中值定理进行化简,我们假设其中
∇
w
2
f
≥
c
>
0
\nabla_w^2 f \geq c>0
∇w2f≥c>0,对于所有的
w
,
X
w,X
w,X
∣
E
[
∇
w
2
f
(
w
~
k
,
X
)
(
w
k
−
w
∗
)
]
∣
=
∣
E
[
∇
w
2
f
(
w
~
k
,
X
)
]
(
w
k
−
w
∗
)
∣
=
∣
E
[
∇
w
2
f
(
w
~
k
,
X
)
]
∣
∣
(
w
k
−
w
∗
)
∣
≥
c
∣
w
k
−
w
∗
∣
将其分母进行带入得到:
δ
k
≤
∣
∇
w
f
(
w
k
,
x
k
)
−
E
[
∇
w
f
(
w
k
,
X
)
]
∣
c
∣
w
k
−
w
∗
∣
\delta_k \leq \frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{c\left|w_k-w^*\right|}
δk≤c∣wk−w∗∣∣∇wf(wk,xk)−E[∇wf(wk,X)]∣
我们再来分析这个式子:
δ
k
≤
∣
∇
w
f
(
w
k
,
x
k
)
⏞
stochastic gradient
−
E
[
∇
w
f
(
w
k
,
X
)
]
⏞
true gradient
∣
c
∣
w
k
−
w
∗
∣
⏟
distance to the optimal solution
.
\delta_k \leq \frac{|\overbrace{\nabla_w f\left(w_k, x_k\right)}^{\text {stochastic gradient }}-\overbrace{\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]}^{\text {true gradient }}|}{\underbrace{c\left|w_k-w^*\right|}_{\text {distance to the optimal solution }}} .
δk≤distance to the optimal solution
c∣wk−w∗∣∣∇wf(wk,xk)
stochastic gradient −E[∇wf(wk,X)]
true gradient ∣.
假设我们有20x20的范围,其最终收敛效果如下所示:
目标函数:
J
(
w
)
=
E
[
f
(
w
,
X
)
]
J(w)=\mathbb{E}[f(w, X)]
J(w)=E[f(w,X)] 随机变量:
{
x
i
}
i
=
1
n
\left\{x_i\right\}_{i=1}^n
{xi}i=1n
w
k
+
1
=
w
k
−
α
k
1
n
∑
i
=
1
n
∇
w
f
(
w
k
,
x
i
)
,
(
B
G
D
)
w
k
+
1
=
w
k
−
α
k
1
m
∑
j
∈
I
k
∇
w
f
(
w
k
,
x
j
)
,
(
M
B
G
D
)
w
k
+
1
=
w
k
−
α
k
∇
w
f
(
w
k
,
x
k
)
.
(
S
G
D
)
MBGD囊括了BGD和SGD,当比较小的时候接近SGD,当比较大的时候接近BGD
如果m=1,MBGD是SGD
如果m=n,MBGD不等于BGD,MBGD是在所以采样时随机抽取可能抽不到
Mean estimation:使用一组数
{
x
k
}
\left\{x_k\right\}
{xk}来求平均
E
[
X
]
\mathbb{E}[X]
E[X]
w
k
+
1
=
w
k
−
1
k
(
w
k
−
x
k
)
.
w_{k+1}=w_k-\frac{1}{k}\left(w_k-x_k\right) .
wk+1=wk−k1(wk−xk).
RM 迭代:用含有噪音的测量进行估计
{
g
~
(
w
k
,
η
k
)
1
}
\left\{\tilde{g}\left(w_k, \eta_k\right)_1\right\}
{g~(wk,ηk)1},
g
(
w
)
=
0
g(w)=0
g(w)=0
w
k
+
1
=
w
k
−
a
k
g
~
(
w
k
,
η
k
)
w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right)
wk+1=wk−akg~(wk,ηk)
SGD迭代:利用梯度采样
{
∇
w
f
(
w
k
,
x
k
)
}
\left\{\nabla_w f\left(w_k, x_k\right)\right\}
{∇wf(wk,xk)}求解
J
(
w
)
=
E
[
f
(
w
,
X
)
]
J(w)=\mathbb{E}[f(w, X)]
J(w)=E[f(w,X)]
w
k
+
1
=
w
k
−
α
k
∇
w
f
(
w
k
,
x
k
)
w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right)
wk+1=wk−αk∇wf(wk,xk)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。