actor+env+reward function,env和reward是不能控制的,唯一可以变的是actor,Policy
π
\pi
π是一个网络,参数为
θ
\theta
θ,输入是当前的观察,输出是采取的行为,例如游戏中输入的是游戏画面
s
1
s_1
s1,输出的是采取的操作
a
1
a_1
a1,有了决定的action
a
1
a_1
a1之后会获取对应的reward
r
1
r_1
r1,并且画面也会有对应的改变得到
s
2
s_2
s2,这个过程不断进行得到一个trajectory
τ
=
{
s
1
,
a
1
,
s
2
,
a
2
,
⋯
,
s
T
,
a
T
}
\tau = \{s_1,a_1,s_2,a_2,\cdots,s_T,a_T\}
τ={s1,a1,s2,a2,⋯,sT,aT},假设网络参数固定,那么某条trajectory的几率是
p
θ
(
τ
)
=
p
(
s
1
)
p
θ
(
a
1
∣
s
1
)
p
(
s
2
∣
s
1
,
a
1
)
p
θ
(
a
2
∣
s
2
)
⋯
=
p
(
s
1
)
∏
t
=
1
T
p
θ
(
a
t
∣
s
t
)
p
(
s
t
+
1
∣
s
t
,
a
t
)
p_\theta(\tau) = p(s_1)p_\theta(a_1|s_1)p(s_2|s_1,a_1)p_\theta(a_2|s_2)\cdots = p(s_1)\prod_{t = 1}^Tp_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)
pθ(τ)=p(s1)pθ(a1∣s1)p(s2∣s1,a1)pθ(a2∣s2)⋯=p(s1)∏t=1Tpθ(at∣st)p(st+1∣st,at),某一条trajectory得到的reward
R
(
τ
)
=
∑
t
=
1
T
r
t
R(\tau)=\sum_{t = 1}^Tr_t
R(τ)=∑t=1Trt,目标就是调整网络参数,使得reward的期望值大
R
‾
θ
=
∑
τ
R
(
τ
)
p
θ
(
τ
)
\overline{R}_\theta = \sum_\tau R(\tau)p_\theta(\tau)
Rθ=∑τR(τ)pθ(τ),如何优化
θ
\theta
θ呢,梯度下降
∇
(
‾
R
)
θ
=
∑
τ
R
(
τ
)
∇
p
θ
(
τ
)
=
∑
τ
R
(
τ
)
p
θ
(
τ
)
∇
p
θ
(
τ
)
p
θ
(
τ
)
=
∑
τ
R
(
τ
)
p
θ
(
τ
)
∇
log
p
θ
(
τ
)
=
E
τ
∼
p
θ
(
τ
)
[
R
(
τ
)
∇
log
p
θ
(
τ
)
]
=
1
N
∑
n
=
1
N
R
(
τ
n
)
∇
log
p
θ
(
τ
n
)
=
1
N
∑
n
=
1
N
∑
t
=
1
T
n
R
(
τ
n
)
∇
log
p
θ
(
a
t
n
∣
s
t
n
)
\nabla \overline(R)_\theta=\sum_\tau R(\tau)\nabla p_\theta(\tau) = \sum_\tau R(\tau)p_\theta(\tau)\frac{\nabla p_\theta(\tau)}{p_\theta(\tau)}=\sum_\tau R(\tau)p_\theta(\tau)\nabla \log p_\theta(\tau) = E_{\tau\sim p_\theta(\tau)}[R(\tau)\nabla\log p_\theta(\tau)] = \frac{1}{N}\sum_{n = 1}^NR(\tau^n)\nabla\log p_\theta(\tau^n) = \frac{1}{N}\sum_{n = 1}^N\sum_{t = 1}^{T_n}R(\tau^n)\nabla\log p_\theta(a^n_t|s_t^n)
∇(R)θ=∑τR(τ)∇pθ(τ)=∑τR(τ)pθ(τ)pθ(τ)∇pθ(τ)=∑τR(τ)pθ(τ)∇logpθ(τ)=Eτ∼pθ(τ)[R(τ)∇logpθ(τ)]=N1∑n=1NR(τn)∇logpθ(τn)=N1∑n=1N∑t=1TnR(τn)∇logpθ(atn∣stn),更新参数
θ
←
θ
+
η
∇
R
‾
θ
\theta\leftarrow \theta + \eta\nabla \overline{R}_\theta
θ←θ+η∇Rθ,训练数据的获得,根据当前的网络,去玩游戏获取不同的trajectory,记录
s
i
t
,
a
i
t
,
r
i
s^t_i,a^t_i,r_i
sit,ait,ri的数据对,计算梯度,更新参数,之后再次sample trajectory;
本质上可以看做一个分类问题,网络希望输入
s
i
t
s_i^t
sit输出
a
i
t
a_i^t
ait使得reward
r
i
r_i
ri最大,其中
r
i
r_i
ri是针对整场游戏而言的,所以可以看做以reward为权重的log likelihood,希望加权的likelihood越大越好,以此提升输入
s
i
t
s_i^t
sit得到reward大的时候对应的
a
i
t
a_i^t
ait的几率,对应的就是分类的时候提升正确类别对应的几率;
现在reward是trajectory粒度的,但是一条trajectory里面可能并不是所有的action都是好的,所以需要为不同的步骤分配不同的credit,此时变为
R
(
τ
n
)
→
∑
t
′
=
t
T
n
r
t
′
n
→
∑
t
′
=
t
T
n
γ
t
′
−
t
r
t
′
n
(随时间指数)
R(\tau^n)\rightarrow \sum_{t'=t}^{T_n}r_{t'}^n\rightarrow\sum_{t'=t}^{T_n}\gamma^{t'-t}r_{t'}^n(随时间指数)
R(τn)→∑t′=tTnrt′n→∑t′=tTnγt′−trt′n(随时间指数),减去bias之后记作
A
θ
(
s
t
,
a
t
)
A^\theta(s_t,a_t)
Aθ(st,at);
importance sampling:
E
x
∼
p
[
f
(
x
)
]
=
1
N
∑
i
=
1
N
f
(
x
i
)
E_{x\sim p}[f(x)] = \frac{1}{N}\sum_{i = 1}^Nf(x^i)
Ex∼p[f(x)]=N1∑i=1Nf(xi),但是我们现在不能从
p
p
p sample数据,只能从
q
(
x
)
q(x)
q(x) sample数据,所以换成
E
x
∼
p
[
f
(
x
)
]
=
∫
f
(
x
)
p
(
x
)
d
x
=
∫
f
(
x
)
p
(
x
)
q
(
x
)
q
(
x
)
d
x
=
E
x
∼
q
[
f
(
x
)
p
(
x
)
q
(
x
)
]
E_{x\sim p}[f(x)] = \int f(x)p(x)dx = \int f(x)\frac{p(x)}{q(x)}q(x)dx = E_{x\sim q}[f(x)\frac{p(x)}{q(x)}]
Ex∼p[f(x)]=∫f(x)p(x)dx=∫f(x)q(x)p(x)q(x)dx=Ex∼q[f(x)q(x)p(x)],也就是做了一个修正,乘上了
p
(
x
)
q
(
x
)
\frac{p(x)}{q(x)}
q(x)p(x),也就是importance weight,但是importance sampling有一个问题就是
p
p
p和
q
q
q不能差太多;
对应的梯度
∇
R
‾
θ
=
E
τ
∼
p
θ
′
(
τ
)
[
p
θ
(
τ
)
p
θ
′
(
τ
)
R
(
τ
)
∇
log
p
θ
(
τ
)
]
=
E
(
s
t
,
a
t
)
∼
π
θ
′
[
P
θ
(
s
t
,
a
t
)
P
θ
′
(
s
t
,
a
t
)
A
θ
′
(
s
t
,
a
t
)
∇
log
p
θ
(
a
t
n
∣
s
t
n
)
]
=
E
(
s
t
,
a
t
)
∼
π
θ
′
[
P
θ
(
a
t
∣
s
t
)
P
θ
′
(
a
t
∣
s
t
)
p
θ
(
s
t
)
p
θ
′
(
s
t
)
A
θ
′
(
s
t
,
a
t
)
∇
log
p
θ
(
a
t
n
∣
s
t
n
)
]
=
E
(
s
t
,
a
t
)
∼
π
θ
′
[
P
θ
(
a
t
∣
s
t
)
P
θ
′
(
a
t
∣
s
t
)
A
θ
′
(
s
t
,
a
t
)
∇
log
p
θ
(
a
t
n
∣
s
t
n
)
]
\nabla \overline R_\theta = E_{\tau\sim p_{\theta'}(\tau)}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)}R(\tau)\nabla\log p_\theta(\tau)]=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(s_t,a_t)}{P_{\theta'}(s_t,a_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t^n|s_t^n)]=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}\frac{p_\theta(s_t)}{p_{\theta'}(s_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t^n|s_t^n)]=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t^n|s_t^n)]
∇Rθ=Eτ∼pθ′(τ)[pθ′(τ)pθ(τ)R(τ)∇logpθ(τ)]=E(st,at)∼πθ′[Pθ′(st,at)Pθ(st,at)Aθ′(st,at)∇logpθ(atn∣stn)]=E(st,at)∼πθ′[Pθ′(at∣st)Pθ(at∣st)pθ′(st)pθ(st)Aθ′(st,at)∇logpθ(atn∣stn)]=E(st,at)∼πθ′[Pθ′(at∣st)Pθ(at∣st)Aθ′(st,at)∇logpθ(atn∣stn)],根据
∇
f
(
x
)
=
f
(
x
)
∇
log
f
(
x
)
\nabla f(x) = f(x)\nabla\log f(x)
∇f(x)=f(x)∇logf(x)反推出原优化目标为
J
θ
′
(
θ
)
=
E
(
s
t
,
a
t
)
∼
π
θ
′
[
P
θ
(
a
t
∣
s
t
)
P
θ
′
(
a
t
∣
s
t
)
A
θ
′
(
s
t
,
a
t
)
]
J^{\theta'}(\theta) =E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)]
Jθ′(θ)=E(st,at)∼πθ′[Pθ′(at∣st)Pθ(at∣st)Aθ′(st,at)];
PPO就是加了一项使得
p
θ
p_\theta
pθ和
p
θ
′
p_{\theta'}
pθ′之间不能差太多,
J
P
P
O
θ
′
(
θ
)
=
J
θ
′
(
θ
)
−
β
K
L
(
θ
,
θ
′
)
J^{\theta'}_{PPO}(\theta) = J^{\theta'}(\theta)-\beta KL(\theta,\theta')
JPPOθ′(θ)=Jθ′(θ)−βKL(θ,θ′),其中
β
\beta
β动态调整,如果
K
L
(
θ
,
θ
′
)
>
K
L
m
a
x
KL(\theta,\theta')>KL_{max}
KL(θ,θ′)>KLmax增大
b
e
t
a
beta
beta,如果
K
L
(
θ
,
θ
′
)
<
K
L
m
i
n
KL(\theta,\theta')<KL_{min}
KL(θ,θ′)<KLmin减小
b
e
t
a
beta
beta;