赞
踩
τ ~ p(τ) 是轨迹分布
t∈[0,T-1] 是一条轨迹的步骤数
策略 π 是动作 a 的概率分布
V
π
(
s
t
)
=
E
τ
∼
p
(
τ
)
[
R
(
τ
t
:
T
)
∣
τ
s
t
=
s
t
]
V^{\pi} (s_{t}) = E_{\tau \sim p(\tau )} [R(\tau_{t:T}) | \tau_{s_{t}}=s_{t}]
Vπ(st)=Eτ∼p(τ)[R(τt:T)∣τst=st]
V
π
(
s
t
)
=
E
τ
∼
p
(
τ
)
[
r
(
s
t
)
+
γ
r
t
+
1
+
γ
2
r
t
+
2
+
.
.
.
]
V^{\pi} (s_{t}) = E_{\tau \sim p(\tau )} [ r(s_{t}) + \gamma r_{t+1} + \gamma^2 r_{t+2}+... ]
Vπ(st)=Eτ∼p(τ)[r(st)+γrt+1+γ2rt+2+...]
V(st)函数的贝尔曼方程:
V
π
(
s
t
)
=
E
τ
∼
p
(
τ
)
[
r
(
s
t
)
+
γ
V
π
(
s
t
+
1
)
]
V^{\pi} (s_{t}) = E_{\tau \sim p(\tau )} [r(s_{t}) + \gamma V^{\pi} (s_{t+1}) ]
Vπ(st)=Eτ∼p(τ)[r(st)+γVπ(st+1)]
它定义为环境在状态st智能体在策略π控制执行动作at的条件下, 能获得的期望回报值:
Q
π
(
s
t
,
a
t
)
=
E
τ
∼
p
(
τ
)
[
R
(
τ
t
:
T
)
∣
τ
a
t
=
a
t
,
τ
s
t
=
s
t
]
Q^{\pi } (s_{t} ,a_{t}) = E_{\tau \sim p(\tau)} [ R(\tau_{t:T}) | \tau_{a_{t}}=a_{t} , \tau_{s_{t}}=s_{t} ]
Qπ(st,at)=Eτ∼p(τ)[R(τt:T)∣τat=at,τst=st]
Q
π
(
s
t
,
a
t
)
=
E
τ
∼
p
(
τ
)
[
r
(
s
t
,
a
t
)
+
γ
r
t
+
1
+
γ
2
r
t
+
2
+
.
.
.
]
Q^{\pi } (s_{t} ,a_{t}) = E_{\tau \sim p(\tau)} [ r(s_{t},a_{t}) + \gamma r_{t+1} + \gamma^2 r_{t+2}+... ]
Qπ(st,at)=Eτ∼p(τ)[r(st,at)+γrt+1+γ2rt+2+...]
Q
π
(
s
t
,
a
t
)
=
E
τ
∼
p
(
τ
)
[
r
(
s
t
,
a
t
)
+
γ
V
π
(
s
t
+
1
)
]
Q^{\pi } (s_{t} ,a_{t}) = E_{\tau \sim p(\tau)} [ r(s_{t},a_{t}) + \gamma V^{\pi} (s_{t+1}) ]
Qπ(st,at)=Eτ∼p(τ)[r(st,at)+γVπ(st+1)]
这里有两个随机变量 st 和 at ,其中由于
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。