赞
踩
之前的章节提到过在线策略算法的采样效率比较低,我们通常更倾向于使用离线策略算法。然而,虽然DDPG是离线策略算法,但是它的训练非常不稳定,收敛性较差,对超参数比较敏感,也难以适应不同的复杂环境。2018年,一个更加稳定的离线策略算法Soft Actor-Critic (SAC)被提出。SAC的前身是Soft Q-Learning,它们都属于最大熵强化学习的范畴。Soft Q-learning不存在一个显式的策略函数,而是使用一个函数Q的玻尔兹曼分布,在连续空间下求解非常麻烦。于是SAC提出使用一个Actor表示策略函数,从而解决这个问题。目前,在无模型的强化学习算法中,SAC是一个非常高效的算法,它学习一个随机性策略,在不少标准环境中取得了领先成绩。
熵表示对一个随机变量的随机程度的度量。具体而言,如果X是一个随机变量,且它的概率密度函数为p,那么它的熵H就被定义为
H ( X ) = E x p [ − log p ( x ) ] H\left( X \right) =\mathbb E_{x~p}\left[ -\log\text{ }p\left( x \right) \right] H(X)=Ex p[−log p(x)]
在强化学习中,我么可以使用 H ( π ( ⋅ ∣ s ) ) H(\pi(\cdot|s)) H(π(⋅∣s))来表示策略 π \pi π在状态s下的随机程度。
最大熵强化学习的思想就是除了要最大化累积奖励,还要使得策略更加随机。如此,强化学习的目标中就加入了一项熵的正则项,定义为
π ∗ = a r g m a x π E π [ Σ t r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] \pi^*=arg\text{ }\underset{\pi}{max} \mathbb E_{\pi}[\underset{t}{\varSigma}r\left( s_t,a_t \right) +\alpha H\left( \pi \left( \cdot |s_t \right) \right) ] π∗=arg πmaxEπ[tΣr(st,at)+αH(π(⋅∣st))]
其中 α \alpha α是一个正则化的系数,用来控制熵的重要程度。
熵正则化增加了强化学习算法的探索程度, α \alpha α越大,探索性越强,有助于加速后续的策略学习,并减少策略陷入较差的局部最优的可能性,传统的强化学习和最大熵强化学习的区别如图14-1所示。
在最大熵强化学习框架中,由于目标函数发生了变化,其他的一些定义也有相应的变化。首先我们看一下Soft贝尔曼方程:
Q ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 [ V ( s t + 1 ) ] Q(s_t,a_t)=r(s_t,a_t)+\gamma \mathbb E_{s_{t+1}}[V(s_{t+1})] Q(st,at)=r(st,at)+γEst+1[V(st+1)]
其中,状态价值函数被写为
V ( s t ) = E a t π [ Q ( s t , a t ) − α l o g π ( a t ∣ s t ) ] = E s t + 1 [ Q ( s t , a t ) ] + H ( π ( ⋅ ∣ s t ) ) V(s_t)=\mathbb E_{a_t~\pi}[Q(s_t,a_t)-\alpha log \text{}\pi(a_t|s_t)]=\mathbb E_{s_{t+1}}[Q(s_t,a_t)]+H(\pi(\cdot|s_t)) V(st)=Eat π[Q(st,at)−αlogπ(at∣st)]=Est+1[Q(st,at)]+H(π(⋅∣st))
于是,根据该Soft贝尔曼方程,在有限的状态和动作空间情况下,Soft策略评估可以收敛到策略 π \pi π的Soft Q Q Q函数。然后,根据如下Soft策略提升公式可以改进策略:
π n e w = a r g m i n π ′ D K L ( π ′ ( ⋅ ∣ s ) , e x p ( 1 α Q π o l d ( s , ⋅ ) ) Z π o l d ( s , ⋅ ) ) \pi_{new}=arg \text{}\underset{\pi'}{min}D_{KL}(\pi'(\cdot|s),\frac{exp(\frac{1}{\alpha}Q^{\pi_{old}}(s,\cdot))}{Z^{\pi_{old}}(s,\cdot)}) πnew=argπ′minDKL(π′(⋅∣s),Zπold(s,⋅)exp(α1Qπold(s,⋅)))
重复交替使用Soft策略评估和Soft策略提升,最终策略可以收敛到最大熵强化学习目标中的最优策略。但该Soft策略迭代方法只适用于表格型设置的情况,即状态空间和动作空间是有限的情况。在连续空间下,我们需要通过参数化函数Q和策略 π \pi π来近似这样的迭代。
在SAC算法中,我们为两个动作价值函数Q(参数分别为 ω 1 \omega_1 ω1和 ω 2 \omega_2 ω2)和一个策略函数 π \pi π(参数为 θ \theta θ)建模。基于Double DQN的思想,SAC使用两个Q网络,但每次用Q网络时会挑选一个Q值小的网络,从而缓解Q值过高估计的问题。任意一个函数Q的损失函数为:
L Q ( ω ) = E ( s t , a t , r t , s t + 1 ) R [ 1 2 ( Q ω ( s t , a t ) − ( r t + γ V ω − ( s t + 1 ) ) ] L_Q(\omega)=\mathbb E_{(s_t,a_t,r_t,s_{t+1})~R}[\frac{1}{2}(Q_{\omega}(s_t,a_t)-(r_t+\gamma V_{\omega^-}(s_{t+1}))] LQ(ω)=E(st,at,rt,st+1) R[21(Qω(st,at)−(rt+γVω−(st+1))]
= E ( s t , a t , r t , s t + 1 ) R , a t + 1 π θ ( ⋅ ∣ s t + 1 ) [ 1 2 ( Q ω ( s t , a t ) − ( r t + γ ( m i n j = 1 , 2 Q ω j − ( s t + 1 , a t + 1 ) − α l o g π ( a t + 1 ∣ s t + 1 ) ) ) ] =\mathbb E_{(s_t,a_t,r_t,s_{t+1})~R,a_{t+1}~\pi_\theta(\cdot|s_{t+1})}[\frac{1}{2}(Q_{\omega}(s_t,a_t)-(r_t+\gamma(\underset{j=1,2}{min}Q_{\omega_j^-}(s_{t+1},a_{t+1})-\alpha log\text{}\pi(a_{t+1}|s_{t+1})))] =E(st,at,rt,st+1) R,at+1 πθ(⋅∣st+1)[21(Qω(st,at)−(rt+γ(j=1,2minQωj−(st+1,at+1)−αlogπ(at+1∣st+1)))]
其中,R是策略过去收集的数据,因为SAC是一种离线策略算法。为了让训练更加稳定,这里使用了目标 Q Q Q网络 Q ω − Q_{\omega^-} Qω−,同样是两个目标网络,与两个Q网络一一对应。SAC中目标Q网络的更新方式与DDPG中的更新方式一样。
策略 π \pi π的损失函数由KL散度得到,化简后为:
L π ( θ ) = E s t R , a t π θ [ α l o g ( π θ ( a t ∣ s t ) − Q ω ( s t , a t ) ) ] L_\pi(\theta)=\mathbb E_{s_t~R,a_t~\pi_\theta}[\alpha log\text{}(\pi_{\theta}(a_t|s_t)-Q_\omega(s_t,a_t))] Lπ(θ)=Est R,at πθ[αlog(πθ(at∣st)−Qω(st,at))]
可以理解为最大化函数V,因为有 V ( s t ) = E a t π [ Q ( s t , a t ) − α l o g π ( a t ∣ s t ) ] V(s_t)=\mathbb E_{a_t~\pi}[Q(s_t,a_t)-\alpha log \text{}\pi(a_t|s_t)] V(st)=Eat π[Q(st,at)−αlogπ(at∣st)]。
对连续动作空间的环境,SAC算法的策略输出高斯分布的均值和标准差,但是根据高斯分布来采样动作的过程是不可导的。因此,我们需要用到重参数技巧化。重参数化的做法是先从一个单位高斯分布 N \mathcal{N} N采样,再把采样值乘以标准差后加上均值。这样就可以认为是从策略高斯分布采样,并且这样对于策略函数是可导的。我们将其表示为 a t = f θ ( ϵ t ; a t ) a_t=f_\theta(\epsilon_t;a_t) at=fθ(ϵt;at),其中 ϵ t \epsilon_t ϵt是一个噪声随机变量。同时考虑到两个函数Q,重写策略的损失函数:
L π ( θ ) = E s t R , ϵ t N [ α l o g ( π θ ( f θ ( ϵ t ; a t ) ∣ s t ) − Q ω ( s t , f θ ( ϵ t ; a t ) ) ) ] L_\pi(\theta)=\mathbb E_{s_t~R,\epsilon_t~\mathcal{N}}[\alpha log\text{}(\pi_{\theta}(f_\theta(\epsilon_t;a_t)|s_t)-Q_\omega(s_t,f_\theta(\epsilon_t;a_t)))] Lπ(θ)=Est R,ϵt N[αlog(πθ(fθ(ϵt;at)∣st)−Qω(st,fθ(ϵt;at)))]
自动调整熵正则项
在SAC算法中,如何选择熵正则项的系数非常重要。在不同的状态下需要不同大小的熵:在最优动作不确定的某个状态下,熵的取值应该大一点;而在某个最优动作比较确定的状态下,熵的取值可以小一点。为了自动调整熵正则项,SAC将强化学习的目标改写为一个带约束的优化问题:
m a x π E π [ Σ t r ( s t , a t ) ] \underset{\pi}{max}\mathbb E_{\pi}[\underset{t}{\varSigma} r(s_t,a_t)] πmaxEπ[tΣr(st,at)] s . t . s.t. s.t. E ( s t , a t ) ρ π [ − l o g ( π t ( a t ∣ s t ) ) ] ≥ H 0 \mathbb E_{(s_t,a_t)~\rho_\pi}[-log(\pi_t(a_t|s_t))]\ge \mathcal{H}_0 E(st,at) ρπ[−log(πt(at∣st))]≥H0
也就是最大化期望回报,同时约束熵的均值大于 H 0 \mathcal{H}_0 H0。通过一些数学技巧简化后,得到 α \alpha α的损失函数:
L ( α ) = E s t R , a t π ( ⋅ ∣ s t ) [ − α l o g ( π θ ( a t ∣ s t ) − α H 0 ] L(\alpha)=\mathbb E_{s_t~R,a_t~\pi(\cdot|s_t)}[-\alpha log\text{}(\pi_{\theta}(a_t|s_t)-\alpha \mathcal {H}_0] L(α)=Est R,at π(⋅∣st)[−αlog(πθ(at∣st)−αH0]
即当策略的熵低于目标值 H 0 \mathcal{H}_0 H0时,训练目标 L ( α ) L(\alpha) L(α)会使 α \alpha α的值增大,进而在上述最小化损失函数 L π ( θ ) L_{\pi}(\theta) Lπ(θ)的过程中增加了策略熵对应项的重要性;而当策略的熵高于目标值 H 0 \mathcal{H}_0 H0时,训练目标 L ( α ) L(\alpha) L(α)会使 α \alpha α的值减小,进而使得策略训练时更专注于价值提升。
至此,我们介绍完了SAC算法的整体思想,它的具体流程如下:
用随机的网络参数 ω 1 , ω 2 \omega_1,\omega_2 ω1,ω2和 θ \theta θ分别初始化Critic网络 Q ω 1 ( s , a ) , Q ω 2 ( s , a ) Q_{\omega_1}(s,a),Q_{\omega_2}(s,a) Qω1(s,a),Qω2(s,a)和Actor网络 π θ ( s ) \pi_\theta(s) πθ(s)
复制相同的参数 ω 1 − ← ω 1 , ω 2 − ← ω 2 \omega^-_1\leftarrow\omega_1,\omega^-_2\leftarrow\omega_2 ω1−←ω1,ω2−←ω2,分别初始化目标网络 Q ω 1 − Q_{\omega^-_1} Qω1−和 Q ω 2 − Q_{\omega^-_2} Qω2−
初始化经验回放池R
for序列 e = 1 → E e=1\rightarrow E e=1→E do
获取环境初始状态 s 1 s_1 s1
for时间步 t = 1 → T t=1\rightarrow T t=1→T do
根据当前策略选择动作 a t = π θ ( s t ) a_t=\pi_\theta(s_t) at=πθ(st)
执行动作 a t a_t at,获得奖励 r t r_t rt,环境状态变为 s t + 1 s_{t+1} st+1
将 ( s t , a t , r t , s t + 1 ) (s_t,a_t,r_t,s_{t+1}) (st,at,rt,st+1)存入回放池R
for训练轮数 k = 1 → K k=1\rightarrow K k=1→Kdo
从R中采样N个元组 { ( s i , a i , r i , s i + 1 ) } i = 1 , … … , N \left\{(s_i,a_i,r_i,s_{i+1})\right\}_{i=1,……,N} {(si,ai,ri,si+1)}i=1,……,N
对每个元组,用目标网络计算 y i = r i + γ m i n j = 1 , 2 Q ω j − ( s i + 1 , a i + 1 ) − α l o g π θ ( a i + 1 ∣ a i + 1 ) y_i=r_i+\gamma min_{j=1,2}Q_{\omega^-_j}(s_{i+1},a_{i+1})-\alpha log \pi_\theta(a_{i+1}|a_{i+1}) yi=ri+γminj=1,2Qωj−(si+1,ai+1)−αlogπθ(ai+1∣ai+1),其中 a i + 1 π θ ( ⋅ ∣ s i + 1 ) a_{i+1}~\pi_\theta(\cdot|s_{i+1}) ai+1 πθ(⋅∣si+1)
对两个Critic网络都进行如下更新:对 j = 1 , 2 j=1,2 j=1,2,最小化损失函数 L = 1 N Σ i = 1 N ( y i − Q ω j ( s i , a i ) ) 2 L=\frac{1}{N}\varSigma _{i=1}^{N}\left( y_i-Q_{\omega _j}\left( s_i,a_i \right) \right) ^2 L=N1Σi=1N(yi−Qωj(si,ai))2
用重参数化技巧采样动作
a
i
\overset {~}{a}_i
a i,然后用以下损失函数更新当前Actor网络
L
π
(
θ
)
=
1
N
Σ
N
i
=
1
(
α
log
π
θ
(
a
i
∣
a
i
)
−
min
j
=
1
,
2
Q
ω
j
(
s
i
,
a
i
)
)
L_{\pi}\left( \theta \right) =\frac{1}{N}\underset{i=1}{\overset{N}{\varSigma}}\left( \alpha \log \pi _{\theta}\left( \overset{~}{a}_i|a_i \right) -\underset{j=1,2}{\min}Q_{\omega _j}\left( s_i,\overset{~}{a}_i \right) \right)
Lπ(θ)=N1i=1ΣN(αlogπθ(a i∣ai)−j=1,2minQωj(si,a i))
更新熵正则项的系数 α \alpha α
更新目标网络: ω 1 − ← τ ω 1 + ( 1 − τ ) ω 1 − \omega^-_1\leftarrow \tau \omega_1+(1-\tau)\omega_1^- ω1−←τω1+(1−τ)ω1− ω 2 − ← τ ω 2 + ( 1 − τ ) ω 2 − \omega^-_2\leftarrow \tau \omega_2+(1-\tau)\omega_2^- ω2−←τω2+(1−τ)ω2−
end for
end for
end for
import random
import gym
import numpy as np
from tqdm import tqdm
import torch
import torch.nn.functional as F
from torch.distributions import Normal
import matplotlib.pyplot as plt
import rl_utils
定义策略网络和价值网络。由于处理的是与连续动作交互的环境,策略网络输出一个高斯分布的均值和标准差来表示动作分布;而价值网络的输入是状态和动作的拼接向量,输出一个实数来表示动作价值。
class PolicyNetContinuous(torch.nn.Module): def __init__(self, state_dim, hidden_dim, action_dim, action_bound): super(PolicyNetContinuous, self).__init__() self.fc1 = torch.nn.Linear(state_dim, hidden_dim) self.fc_mu = torch.nn.Linear(hidden_dim, action_dim) self.fc_std = torch.nn.Linear(hidden_dim, action_dim) self.action_bound = action_bound def forward(self, x): x = F.relu(self.fc1(x)) mu = self.fc_mu(x) std = F.softplus(self.fc_std(x)) dist = Normal(mu, std) normal_sample = dist.rsample() # rsample()是重参数化采样 把采样值乘以标准差后加上均值 log_prob = dist.log_prob(normal_sample) # 简单来说就是对value求了个对数 action = torch.tanh(normal_sample) # 计算tanh_normal分布的对数概率密度 log_prob = log_prob - torch.log(1 - torch.tanh(action).pow(2) + 1e-7) action = action * self.action_bound return action, log_prob class QValueNetContinuous(torch.nn.Module): def __init__(self, state_dim, hidden_dim, action_dim): super(QValueNetContinuous, self).__init__() self.fc1 = torch.nn.Linear(state_dim + action_dim, hidden_dim) self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim) self.fc_out = torch.nn.Linear(hidden_dim, 1) def forward(self, x, a): cat = torch.cat([x, a], dim=1) x = F.relu(self.fc1(cat)) x = F.relu(self.fc2(x)) return self.fc_out(x) class SACContinuous: ''' 处理连续动作的SAC算法 ''' def __init__(self, state_dim, hidden_dim, action_dim, action_bound, actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma, device): self.actor = PolicyNetContinuous(state_dim, hidden_dim, action_dim, action_bound).to(device) # 策略网络 self.critic_1 = QValueNetContinuous(state_dim, hidden_dim, action_dim).to(device) # 第一个Q网络 self.critic_2 = QValueNetContinuous(state_dim, hidden_dim, action_dim).to(device) # 第二个Q网络 self.target_critic_1 = QValueNetContinuous(state_dim, hidden_dim, action_dim).to( device) # 第一个目标Q网络 self.target_critic_2 = QValueNetContinuous(state_dim, hidden_dim, action_dim).to( device) # 第二个目标Q网络 # 令目标Q网络的初始参数和Q网络一样 self.target_critic_1.load_state_dict(self.critic_1.state_dict()) self.target_critic_2.load_state_dict(self.critic_2.state_dict()) self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr) self.critic_1_optimizer = torch.optim.Adam(self.critic_1.parameters(), lr=critic_lr) self.critic_2_optimizer = torch.optim.Adam(self.critic_2.parameters(), lr=critic_lr) # 使用alpha的log值,可以使训练结果比较稳定 self.log_alpha = torch.tensor(np.log(0.01), dtype=torch.float) # 对alpha求对数 直接给alpha赋值为0.01 self.log_alpha.requires_grad = True # 可以对alpha求梯度 self.log_alpha_optimizer = torch.optim.Adam([self.log_alpha], lr=alpha_lr) self.target_entropy = target_entropy # 目标熵的大小 self.gamma = gamma self.tau = tau self.device = device def take_action(self, state): state = torch.tensor([state], dtype=torch.float).to(self.device) action = self.actor(state)[0] return [action.item()] def calc_target(self, rewards, next_states, dones): # 计算目标Q值 next_actions, log_prob = self.actor(next_states) entropy = -log_prob q1_value = self.target_critic_1(next_states, next_actions) q2_value = self.target_critic_2(next_states, next_actions) next_value = torch.min(q1_value, q2_value) + self.log_alpha.exp() * entropy td_target = rewards + self.gamma * next_value * (1 - dones) return td_target def soft_update(self, net, target_net): # DDPG 的策略软更新 for param_target, param in zip(target_net.parameters(), net.parameters()): param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau) def update(self, transition_dict): states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device) actions = torch.tensor(transition_dict['actions'], dtype=torch.float).view(-1, 1).to(self.device) rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1).to(self.device) next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device) dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1, 1).to(self.device) # 和之前章节一样,对倒立摆环境的奖励进行重塑以便训练 rewards = (rewards + 8.0) / 8.0 # 更新两个Q网络 用一个目标Q值更新两个网络 td_target = self.calc_target(rewards, next_states, dones) # 目标Q值 critic_1_loss = torch.mean( F.mse_loss(self.critic_1(states, actions), td_target.detach())) critic_2_loss = torch.mean( F.mse_loss(self.critic_2(states, actions), td_target.detach())) self.critic_1_optimizer.zero_grad() critic_1_loss.backward() self.critic_1_optimizer.step() self.critic_2_optimizer.zero_grad() critic_2_loss.backward() self.critic_2_optimizer.step() # 更新策略网络 new_actions, log_prob = self.actor(states) entropy = -log_prob q1_value = self.critic_1(states, new_actions) q2_value = self.critic_2(states, new_actions) actor_loss = torch.mean(-self.log_alpha.exp() * entropy - torch.min(q1_value, q2_value)) self.actor_optimizer.zero_grad() actor_loss.backward() self.actor_optimizer.step() # 更新alpha值 alpha_loss = torch.mean( (entropy - self.target_entropy).detach() * self.log_alpha.exp()) self.log_alpha_optimizer.zero_grad() alpha_loss.backward() self.log_alpha_optimizer.step() self.soft_update(self.critic_1, self.target_critic_1) self.soft_update(self.critic_2, self.target_critic_2) env_name = 'Pendulum-v0' env = gym.make(env_name) state_dim = env.observation_space.shape[0] action_dim = env.action_space.shape[0] action_bound = env.action_space.high[0] # 动作最大值 random.seed(0) np.random.seed(0) env.seed(0) torch.manual_seed(0) actor_lr = 3e-4 critic_lr = 3e-3 alpha_lr = 3e-4 num_episodes = 100 hidden_dim = 128 gamma = 0.99 tau = 0.005 # 软更新参数 buffer_size = 100000 minimal_size = 1000 batch_size = 64 target_entropy = -env.action_space.shape[0] device = torch.device("cuda") if torch.cuda.is_available() else torch.device( "cpu") replay_buffer = rl_utils.ReplayBuffer(buffer_size) agent = SACContinuous(state_dim, hidden_dim, action_dim, action_bound, actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma, device) return_list = rl_utils.train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size) episodes_list = list(range(len(return_list))) plt.plot(episodes_list, return_list) plt.xlabel('Episodes') plt.ylabel('Returns') plt.title('SAC on {}'.format(env_name)) plt.show() mv_return = rl_utils.moving_average(return_list, 9) plt.plot(episodes_list, mv_return) plt.xlabel('Episodes') plt.ylabel('Returns') plt.title('SAC on {}'.format(env_name)) plt.show()
Iteration 0: 0%| | 0/10 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:27: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.)
Iteration 0: 100%|██████████| 10/10 [00:09<00:00, 1.03it/s, episode=10, return=-1534.655]
Iteration 1: 100%|██████████| 10/10 [00:18<00:00, 1.83s/it, episode=20, return=-1085.715]
Iteration 2: 100%|██████████| 10/10 [00:15<00:00, 1.60s/it, episode=30, return=-364.507]
Iteration 3: 100%|██████████| 10/10 [00:13<00:00, 1.37s/it, episode=40, return=-222.485]
Iteration 4: 100%|██████████| 10/10 [00:13<00:00, 1.36s/it, episode=50, return=-157.978]
Iteration 5: 100%|██████████| 10/10 [00:13<00:00, 1.37s/it, episode=60, return=-166.056]
Iteration 6: 100%|██████████| 10/10 [00:13<00:00, 1.38s/it, episode=70, return=-143.147]
Iteration 7: 100%|██████████| 10/10 [00:13<00:00, 1.37s/it, episode=80, return=-127.939]
Iteration 8: 100%|██████████| 10/10 [00:14<00:00, 1.42s/it, episode=90, return=-180.905]
Iteration 9: 100%|██████████| 10/10 [00:14<00:00, 1.41s/it, episode=100, return=-171.265]
**小结:**该算法和前面算法主要的区别在于策略网络的不同,该算法引入了最大熵强化学习,要最大化累积奖励,还要保证策略的随机性,为了让熵的存在保持一定的合理性引入了 α \alpha α 并且更新 α \alpha α的大小。
车杆环境和倒立摆环境不同的地方就是 车杆环境是离散的动作 倒立摆是连续的
那将连续动作的算法放在离散动作算法中的网络怎么改变:
import random import gym import numpy as np from tqdm import tqdm import torch import torch.nn.functional as F from torch.distributions import Normal import matplotlib.pyplot as plt import rl_utils class PolicyNet(torch.nn.Module): def __init__(self, state_dim, hidden_dim, action_dim): super(PolicyNet, self).__init__() self.fc1 = torch.nn.Linear(state_dim, hidden_dim) self.fc2 = torch.nn.Linear(hidden_dim, action_dim) def forward(self, x): x = F.relu(self.fc1(x)) return F.softmax(self.fc2(x), dim=1) class QValueNet(torch.nn.Module): ''' 只有一层隐藏层的Q网络 ''' def __init__(self, state_dim, hidden_dim, action_dim): super(QValueNet, self).__init__() self.fc1 = torch.nn.Linear(state_dim, hidden_dim) self.fc2 = torch.nn.Linear(hidden_dim, action_dim) def forward(self, x): x = F.relu(self.fc1(x)) return self.fc2(x)
该策略网络输出一个离散的动作分布,所以在价值网络的学习过程中,不需要再对下一个动作 a t + 1 a_{t+1} at+1进行采样,而是直接通过概率计算来得到下一个状态的价值。同理,在 α \alpha α的损失函数计算中,也不需要再对动作进行采样。
class SAC: ''' 处理离散动作的SAC算法 ''' def __init__(self, state_dim, hidden_dim, action_dim, actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma, device): # 策略网络 self.actor = PolicyNet(state_dim, hidden_dim, action_dim).to(device) # 第一个Q网络 self.critic_1 = QValueNet(state_dim, hidden_dim, action_dim).to(device) # 第二个Q网络 self.critic_2 = QValueNet(state_dim, hidden_dim, action_dim).to(device) self.target_critic_1 = QValueNet(state_dim, hidden_dim, action_dim).to(device) # 第一个目标Q网络 self.target_critic_2 = QValueNet(state_dim, hidden_dim, action_dim).to(device) # 第二个目标Q网络 # 令目标Q网络的初始参数和Q网络一样 self.target_critic_1.load_state_dict(self.critic_1.state_dict()) self.target_critic_2.load_state_dict(self.critic_2.state_dict()) self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr) self.critic_1_optimizer = torch.optim.Adam(self.critic_1.parameters(), lr=critic_lr) self.critic_2_optimizer = torch.optim.Adam(self.critic_2.parameters(), lr=critic_lr) # 使用alpha的log值,可以使训练结果比较稳定 self.log_alpha = torch.tensor(np.log(0.01), dtype=torch.float) self.log_alpha.requires_grad = True # 可以对alpha求梯度 self.log_alpha_optimizer = torch.optim.Adam([self.log_alpha], lr=alpha_lr) self.target_entropy = target_entropy # 目标熵的大小 self.gamma = gamma self.tau = tau self.device = device def take_action(self, state): state = torch.tensor([state], dtype=torch.float).to(self.device) probs = self.actor(state) action_dist = torch.distributions.Categorical(probs) action = action_dist.sample() return action.item() # 计算目标Q值,直接用策略网络的输出概率进行期望计算 并没有得到下一个动作 def calc_target(self, rewards, next_states, dones): next_probs = self.actor(next_states) next_log_probs = torch.log(next_probs + 1e-8) entropy = -torch.sum(next_probs * next_log_probs, dim=1, keepdim=True) q1_value = self.target_critic_1(next_states) q2_value = self.target_critic_2(next_states) min_qvalue = torch.sum(next_probs * torch.min(q1_value, q2_value), dim=1, keepdim=True) next_value = min_qvalue + self.log_alpha.exp() * entropy td_target = rewards + self.gamma * next_value * (1 - dones) return td_target def soft_update(self, net, target_net): for param_target, param in zip(target_net.parameters(), net.parameters()): param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau) def update(self, transition_dict): states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device) actions = torch.tensor(transition_dict['actions']).view(-1, 1).to( self.device) # 动作不再是float类型 rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1).to(self.device) next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device) dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1, 1).to(self.device) # 更新两个Q网络 td_target = self.calc_target(rewards, next_states, dones) critic_1_q_values = self.critic_1(states).gather(1, actions) critic_1_loss = torch.mean( F.mse_loss(critic_1_q_values, td_target.detach())) critic_2_q_values = self.critic_2(states).gather(1, actions) critic_2_loss = torch.mean( F.mse_loss(critic_2_q_values, td_target.detach())) self.critic_1_optimizer.zero_grad() critic_1_loss.backward() self.critic_1_optimizer.step() self.critic_2_optimizer.zero_grad() critic_2_loss.backward() self.critic_2_optimizer.step() # 更新策略网络 probs = self.actor(states) log_probs = torch.log(probs + 1e-8) # 直接根据概率计算熵 entropy = -torch.sum(probs * log_probs, dim=1, keepdim=True) # 动作概率的期望 q1_value = self.critic_1(states) q2_value = self.critic_2(states) min_qvalue = torch.sum(probs * torch.min(q1_value, q2_value), dim=1, keepdim=True) # 直接根据概率计算期望 actor_loss = torch.mean(-self.log_alpha.exp() * entropy - min_qvalue) self.actor_optimizer.zero_grad() actor_loss.backward() self.actor_optimizer.step() # 更新alpha值 alpha_loss = torch.mean( (entropy - self.target_entropy).detach() * self.log_alpha.exp()) self.log_alpha_optimizer.zero_grad() alpha_loss.backward() self.log_alpha_optimizer.step() self.soft_update(self.critic_1, self.target_critic_1) self.soft_update(self.critic_2, self.target_critic_2) actor_lr = 1e-3 critic_lr = 1e-2 alpha_lr = 1e-2 num_episodes = 200 hidden_dim = 128 gamma = 0.98 tau = 0.005 # 软更新参数 buffer_size = 10000 minimal_size = 500 batch_size = 64 target_entropy = -1 device = torch.device("cuda") if torch.cuda.is_available() else torch.device( "cpu") env_name = 'CartPole-v0' env = gym.make(env_name) random.seed(0) np.random.seed(0) env.seed(0) torch.manual_seed(0) replay_buffer = rl_utils.ReplayBuffer(buffer_size) state_dim = env.observation_space.shape[0] action_dim = env.action_space.n agent = SAC(state_dim, hidden_dim, action_dim, actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma, device) return_list = rl_utils.train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size) episodes_list = list(range(len(return_list))) plt.plot(episodes_list, return_list) plt.xlabel('Episodes') plt.ylabel('Returns') plt.title('SAC on {}'.format(env_name)) plt.show() mv_return = rl_utils.moving_average(return_list, 9) plt.plot(episodes_list, mv_return) plt.xlabel('Episodes') plt.ylabel('Returns') plt.title('SAC on {}'.format(env_name)) plt.show()
**小结:**因为是离散动作,所以动作个数是有限的,直接就能得到每个动作执行的概率 所以熵就可以直接根据概率期望求得 离散环境和连续环境最大的区别就是策略网络价值网络和熵的计算上
Iteration 0: 100%|██████████| 20/20 [00:00<00:00, 148.74it/s, episode=20, return=19.700]
Iteration 1: 100%|██████████| 20/20 [00:00<00:00, 28.35it/s, episode=40, return=10.600]
Iteration 2: 100%|██████████| 20/20 [00:00<00:00, 24.96it/s, episode=60, return=10.000]
Iteration 3: 100%|██████████| 20/20 [00:00<00:00, 24.87it/s, episode=80, return=9.800]
Iteration 4: 100%|██████████| 20/20 [00:00<00:00, 26.33it/s, episode=100, return=9.100]
Iteration 5: 100%|██████████| 20/20 [00:00<00:00, 26.30it/s, episode=120, return=9.500]
Iteration 6: 100%|██████████| 20/20 [00:09<00:00, 2.19it/s, episode=140, return=178.400]
Iteration 7: 100%|██████████| 20/20 [00:15<00:00, 1.30it/s, episode=160, return=200.000]
Iteration 8: 100%|██████████| 20/20 [00:15<00:00, 1.30it/s, episode=180, return=200.000]
Iteration 9: 100%|██████████| 20/20 [00:15<00:00, 1.29it/s, episode=200, return=197.600]
本章首先讲解了什么是最大熵强化学习,并通过控制策略所采取动作的熵来调整探索与利用的平衡,可以帮助读者加深对探索与利用的理解;然后讲解了SAC算法,剖析了他背后的原理以及具体流程,最后在连续的倒立摆环境中以及离散的车杆环境中进行了SAC算法的代码实践。由于有扎实的理论基础和优秀的实验性能,SAC算法已经成为炙手可热的深度强化学习算法。
:15<00:00, 1.30it/s, episode=160, return=200.000]
Iteration 8: 100%|██████████| 20/20 [00:15<00:00, 1.30it/s, episode=180, return=200.000]
Iteration 9: 100%|██████████| 20/20 [00:15<00:00, 1.29it/s, episode=200, return=197.600]
[外链图片转存中...(img-BvoyKErx-1701698703499)] [外链图片转存中...(img-B5pTyNUf-1701698703499)] ## 14.6 小结 本章首先讲解了什么是最大熵强化学习,并通过控制策略所采取动作的熵来调整探索与利用的平衡,可以帮助读者加深对探索与利用的理解;然后讲解了SAC算法,剖析了他背后的原理以及具体流程,最后在连续的倒立摆环境中以及离散的车杆环境中进行了SAC算法的代码实践。由于有扎实的理论基础和优秀的实验性能,SAC算法已经成为炙手可热的深度强化学习算法。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。