赞
踩
1、【强化学习】Actor-Critic公式推导分析(整体理解,包含AC、A3C))
2、Actor Critic(公式计算变形)
3、【强化学习】Actor-Critic算法详解(Actor-Critic with Eligibility Traces)
3.5、资格痕迹(Eligibility Traces)
4、简单认识Adam优化器
5、一文读懂 深度强化学习算法 A3C(损失函数中3个部分的各自意义)
5.5、交叉熵定义与运用
6、强化学习(十四) Actor-Critic(算法流程)
7、简单介绍AC 、A2C、A3C
Actor Critic是一种结合体算法:
(1)Actor 的前生是 Policy Gradients,这能让它毫不费力地在连续动作中选取合适的动作,而 Q-learning 做这件事会瘫痪;
(2)Critic 的前生是 Q-learning 或者其他的 以值为基础的学习法,能进行单步更新,而传统的 Policy Gradients 则是回合更新,这降低了学习效率。
Actor 基于概率选行为,Critic 基于 Actor 的行为 评判行为的得分,Actor 根据 Critic 的评分修改选行为的概率.
Actor 想要最大化期望的 reward, 在 Actor Critic 算法中, 我们用 “比平时好多少” (TD error) 来当做 reward。
1. 初始化参数
self.sess = sess
self.s = tf.placeholder(tf.float32, [1, n_features], "state")
self.a = tf.placeholder(tf.int32, None, "act")
self.td_error = tf.placeholder(tf.float32, None, "td_error") # TD_error
2. 各层网络
with tf.variable_scope('Actor'): l1 = tf.layers.dense( inputs=self.s, units=20, # number of hidden units activation=tf.nn.relu, kernel_initializer=tf.random_normal_initializer(0., .1), # weights bias_initializer=tf.constant_initializer(0.1), # biases name='l1' ) self.acts_prob = tf.layers.dense( # 输出动作的概率 inputs=l1, units=n_actions, # output units activation=tf.nn.softmax, # get action probabilities kernel_initializer=tf.random_normal_initializer(0., .1), # weights bias_initializer=tf.constant_initializer(0.1), # biases name='acts_prob' )
3. 学习更新
with tf.variable_scope('exp_v'):
log_prob = tf.log(self.acts_prob[0, self.a]) # log 动作概率
self.exp_v = tf.reduce_mean(log_prob * self.td_error) # log 概率 * TD 方向
with tf.variable_scope('train'):
self.train_op = tf.train.AdamOptimizer(lr).minimize(-self.exp_v) # minimize(-exp_v) = maximize(exp_v)
def learn(self, s, a, td):
s = s[np.newaxis, :]
feed_dict = {self.s: s, self.a: a, self.td_error: td}
_, exp_v = self.sess.run([self.train_op, self.exp_v], feed_dict)
return exp_v
s, a 用于产生 Gradient ascent 的方向,
td 则来自 Critic,用于告诉 Actor 这方向对不对。
4. 基于分析出的概率来选动作
# 与Policy Gradient步骤相同
def choose_action(self, s):
s = s[np.newaxis, :]
probs = self.sess.run(self.acts_prob, {self.s: s}) # get probabilities for all actions
return np.random.choice(np.arange(probs.shape[1]), p=probs.ravel()) # return a int
1. 初始化参数
self.sess = sess
self.s = tf.placeholder(tf.float32, [1, n_features], "state")
self.v_ = tf.placeholder(tf.float32, [1, 1], "v_next")
self.r = tf.placeholder(tf.float32, None, 'r')
2. 各层网络
with tf.variable_scope('Critic'): l1 = tf.layers.dense( inputs=self.s, units=20, # 隐藏层单元数量 activation=tf.nn.relu, # have to be linear to make sure the convergence of actor. # But linear approximator seems hardly learns the correct Q. kernel_initializer=tf.random_normal_initializer(0., .1), # weights bias_initializer=tf.constant_initializer(0.1), # biases name='l1' ) # 预测状态state的价值 v self.v = tf.layers.dense( inputs=l1, units=1, # output units activation=None, kernel_initializer=tf.random_normal_initializer(0., .1), # weights bias_initializer=tf.constant_initializer(0.1), # biases name='V' )
3. 学习更新
Critic 的更新很简单,就是像 Q learning 那样更新现实和估计的误差 (TD error) 就好了。
with tf.variable_scope('squared_TD_error'):
self.td_error = self.r + (GAMMA * self.v_) - self.v
self.loss = tf.square(self.td_error) # TD_error = (r+gamma*V_next) - V_eval
with tf.variable_scope('train'):
self.train_op = tf.train.AdamOptimizer(lr).minimize(self.loss)
def learn(self, s, r, s_):
s, s_ = s[np.newaxis, :], s_[np.newaxis, :]
# 计算下一个状态s_的价值v_
v_ = self.sess.run(self.v, {self.s: s_})
td_error, _ = self.sess.run([self.td_error, self.train_op],
{self.s: s, self.v_: v_, self.r: r})
return td_error
此处学习的是 状态的价值 (state value),不是行为的价值 (action value)
通过计算 TD_error = (r + v_) - v,用 TD_error 评判这一步的行为有没有带来比平时更好的结果
参数设置
OUTPUT_GRAPH = False
MAX_EPISODE = 3000
DISPLAY_REWARD_THRESHOLD = 200 # renders environment if total episode reward is greater then this threshold
MAX_EP_STEPS = 1000 # maximum time step in one episode
RENDER = False # rendering wastes time
GAMMA = 0.9 # reward discount in TD error
LR_A = 0.001 # learning rate for actor
LR_C = 0.01 # learning rate for critic
建立网络
sess = tf.Session()
actor = Actor(sess, n_features=N_F, n_actions=N_A, lr=LR_A)
# critic需要比actor学习的快
critic = Critic(sess, n_features=N_F, lr=LR_C)
sess.run(tf.global_variables_initializer())
循环训练
for i_episode in range(MAX_EPISODE): s = env.reset() t = 0 track_r = [] # 每个episode的所有奖励 while True: if RENDER: env.render() a = actor.choose_action(s) s_, r, done, info = env.step(a) if done: r = -20 # 回合结束的惩罚 track_r.append(r) # Critic学习, gradient = grad[r + gamma * V(s_) - V(s)] td_error = critic.learn(s, r, s_) # Actor学习,true_gradient = grad[logPi(s,a) * td_error] actor.learn(s, a, td_error) s = s_ t += 1 if done or t >= MAX_EP_STEPS: ep_rs_sum = sum(track_r) if 'running_reward' not in globals(): running_reward = ep_rs_sum else: running_reward = running_reward * 0.95 + ep_rs_sum * 0.05 if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True # rendering print("episode:", i_episode, " reward:", int(running_reward)) break
A3C是在AC的基础上,在多个线程异步执行AC算法,得到网络参数的增量后,异步地更新到全局模型中。在下一次迭代时,使用当前模型的最新参数作为新一次训练的基础参数。算法如下:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。