赞
踩
游戏模型如下:
策略网络输入状态s,输出动作a的概率分布如下: π ( a ∣ s ) \pi(a|s) π(a∣s)
多次训练轨迹如下
[
s
11
a
11
r
11
…
…
s
1
t
a
1
t
r
1
t
…
…
s
1
T
a
1
T
r
1
T
…
…
…
…
…
…
s
n
1
a
n
1
r
n
1
…
…
s
n
t
a
n
t
r
n
t
…
…
s
n
T
a
n
T
r
n
T
…
…
…
…
…
…
s
N
1
a
N
1
r
N
1
…
…
s
N
t
a
N
t
r
N
t
…
…
s
N
T
a
N
T
r
N
T
]
策略轨迹 τ = s 1 a 1 , s 2 a 2 , … … , s T a T \tau = s_{1} a_{1} , s_{2} a_{2},……,s_{T} a_{T} τ=s1a1,s2a2,……,sTaT
发生的概率
P
(
τ
)
=
P
(
s
1
a
1
,
s
2
a
2
,
…
…
,
s
T
a
T
)
P(\tau) = P(s_{1} a_{1} , s_{2} a_{2},……,s_{T} a_{T})
P(τ)=P(s1a1,s2a2,……,sTaT)
=
P
(
s
1
)
π
(
a
1
∣
s
1
)
P
(
s
2
∣
s
1
,
a
1
)
π
(
a
2
∣
s
2
)
P
(
s
3
∣
s
1
,
a
1
,
s
2
,
a
2
)
…
…
= P(s_{1})\pi(a_{1}|s_{1})P(s_{2}|s_{1},a_{1})\pi(a_{2}|s_{2})P(s_{3}|s_{1},a_{1},s_{2},a_{2})……
=P(s1)π(a1∣s1)P(s2∣s1,a1)π(a2∣s2)P(s3∣s1,a1,s2,a2)……
=
P
(
s
1
)
∏
t
=
1
T
−
1
π
(
a
t
∣
s
t
)
P
(
s
t
+
1
∣
s
1
,
a
1
,
.
.
.
.
.
.
,
s
t
,
a
t
)
= P(s_{1})\prod_{t=1}^{T-1}\pi(a_{t}|s_{t})P(s_{t+1}|s_{1},a_{1},......,s_{t},a_{t})
=P(s1)t=1∏T−1π(at∣st)P(st+1∣s1,a1,......,st,at)
根据 马尔科夫性(Markov Property),简化为:
P
(
τ
)
=
P
(
s
1
)
∏
t
=
1
T
−
1
π
(
a
t
∣
s
t
)
P
(
s
t
+
1
∣
s
t
,
a
t
)
P(\tau) = P(s_{1})\prod_{t=1}^{T-1}\pi(a_{t}|s_{t})P(s_{t+1}|s_{t},a_{t})
P(τ)=P(s1)t=1∏T−1π(at∣st)P(st+1∣st,at)
在每次智能体与环境的交互过程中,均会得到一个滞后的奖励
r
t
=
r
(
s
t
,
a
t
)
r_{t} = r(s_{t},a_{t})
rt=r(st,at)
一次交互轨迹
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。