当前位置:   article > 正文

Ray-Asynchronous Advantage Actor Critic (A3C) 、 Batch L-BFGS 和Policy Gradient Methods(PPO)_rllib a3c

rllib a3c

本片主要介绍Ray中强化学习的三个算法A3C,L-BFGS和PPO算法。A3C算法中的的核心代码的展示就通过tensorboard实现结果的可视化。 L-BFGS是一种准牛顿方法,其使用梯度信息以计算有效的方式逼近损失函数的逆Hessian算法,本处主要说明其运行的串行和并行机制。策略梯度算法(PPO)的使用方法。

1 Asynchronous Advantage Actor Critic (A3C)

A3C(论文链接)是一种最先进的强化学习算法。在本例中,我们在Ray实现使用A3C的OpenAI Universe 启动器代理实例代码链接



pip install tensorflow
pip install six
pip install gym[atari]
pip install opencv-python-headless
pip install scipy
  • 1
  • 2
  • 3
  • 4
  • 5


rllib train --env=Pong-ram-v4 --run=A3C --config='{"num_workers": N}'
  • 1


在我们的A3C实现中,每个worker作为一个Ray actor实现,不断地模拟环境。驱动程序将创建一个任务,该任务使用最新的模型运行模拟器的一些步骤,计算梯度更新,并将更新返回给驱动程序。每当一个任务完成时,驱动程序将使用梯度更新来更新模型,并将使用最新的模型启动一个新任务。


1.2 worker 的代码示例(Worker Code Walkthrough)

使用Ray actor 模拟环境。

import numpy as np
import ray

class Runner(object):
    """Actor object to start running simulation on workers.
        Gradient computation is also executed on this object."""
    def __init__(self, env_name, actor_id):
        # starts simulation environment, policy, and thread.
        # Thread will continuously interact with the simulation environment
        self.env = env = create_env(env_name)
        self.id = actor_id
        self.policy = LSTMPolicy()
        self.runner = RunnerThread(env, self.policy, 20)

    def start(self):
        # starts the simulation thread

    def pull_batch_from_queue(self):
        # Implementation details removed - gets partial rollout from queue
        return rollout

    def compute_gradient(self, params):
        rollout = self.pull_batch_from_queue()
        batch = process_rollout(rollout, gamma=0.99, lambda_=1.0)
        gradient = self.policy.compute_gradients(batch)
        info = {"id": self.id,
                "size": len(batch.a)}
        return gradient, info
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32

1.3 驱动程序的代码示例(Driver Code Walkthrough)


import numpy as np
import ray

def train(num_workers, env_name="PongDeterministic-v4"):
    # Setup a copy of the environment
    # Instantiate a copy of the policy - mainly used as a placeholder
    env = create_env(env_name, None, None)
    policy = LSTMPolicy(env.observation_space.shape, env.action_space.n, 0)
    obs = 0

    # Start simulations on actors
    agents = [Runner(env_name, i) for i in range(num_workers)]

    # Start gradient calculation tasks on each actor
    parameters = policy.get_weights()
    gradient_list = [agent.compute_gradient.remote(parameters) for agent in agents]

    while True: # Replace with your termination condition
        # wait for some gradient to be computed - unblock as soon as the earliest arrives
        done_id, gradient_list = ray.wait(gradient_list)

        # get the results of the task from the object store
        gradient, info = ray.get(done_id)[0]
        obs += info["size"]

        # apply update, get the weights from the model, start a new task on the same actor object
        parameters = policy.get_weights()
    return policy
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
1.4 可视化

对于PongDeterministic-v4和Amazon EC2 m4.16xlarge实例,我们能够在大约15分钟内用16个worker训练代理。8个woeker,我们可以在25分钟左右培训代理人。

可以通过在一个单独的屏幕上运行tensorboard--logdir [directory]来可视化性能,其中[directory]默认为~/ray_results/。如果您正在运行多个实验,请确保更改Tensorflow保存其进展的目录(可以在a3 .py中找到)。

2 Batch L-BFGS

这里主要介绍 L-BFGS 的使用。示例代码在文末附录。

pip install tensorflow
pip install scipy
  • 1
  • 2


python ray/examples/lbfgs/driver.py
  • 1

优化是许多机器学习算法的核心。 大部分机器学习涉及指定损失函数并找到最小化损失的参数。 如果我们可以计算损失函数的梯度,那么我们可以应用各种基于梯度的优化算法。 L-BFGS就是这样一种算法。 这是一种准牛顿方法,其使用梯度信息以计算有效的方式逼近损失函数的逆Hessian算法。

2.1 串行版本


from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
batch_size = 100
num_batches = mnist.train.num_examples // batch_size
batches = [mnist.train.next_batch(batch_size) for _ in range(num_batches)]
  • 1
  • 2
  • 3
  • 4
  • 5


def loss(theta, xs, ys):
    # compute the loss on a batch of data
    return loss

def grad(theta, xs, ys):
    # compute the gradient on a batch of data
    return grad

def full_loss(theta):
    # compute the loss on the full data set
    return sum([loss(theta, xs, ys) for (xs, ys) in batches])

def full_grad(theta):
    # compute the gradient on the full data set
    return sum([grad(theta, xs, ys) for (xs, ys) in batches])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15


theta_init = 1e-2 * np.random.normal(size=dim)
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)
  • 1
  • 2

2.2 分布式版本



batch_ids = [(ray.put(xs), ray.put(ys)) for (xs, ys) in batches]
  • 1



class Network(object):
    def __init__():
        # Initialize network.

    def loss(theta, xs, ys):
        # compute the loss
        return loss

    def grad(theta, xs, ys):
        # compute the gradient
        return grad
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11


def full_loss(theta):
    theta_id = ray.put(theta)
    loss_ids = [actor.loss(theta_id) for actor in actors]
    return sum(ray.get(loss_ids))

def full_grad(theta):
    theta_id = ray.put(theta)
    grad_ids = [actor.grad(theta_id) for actor in actors]
    return sum(ray.get(grad_ids)).astype("float64") # This conversion is necessary for use with fmin_l_bfgs_b.
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

注意,在将theta传递给远程函数之前,我们使用theta_id = ray.put(theta)将theta转换为远程对象。如果我们写 [actor.loss(theta_id) for actor in actors]代替

theta_id = ray.put(theta)
[actor.loss(theta_id) for actor in actors]
  • 1
  • 2




theta_init = 1e-2 * np.random.normal(size=dim)
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)
  • 1
  • 2




pip install gym[atari]
pip install tensorflow
  • 1
  • 2


rllib train --env=Pong-ram-v4 --run=PPO
  • 1

这将在Pong-ram-v4 Atari环境中训练一个代理。您还可以尝试传入Pong-v0环境或CartPole-v0环境。如果您希望使用不同的环境,您将需要更改example.py中的几行


tensorboard --logdir=~/ray_results
  • 1

许多TensorBoard指标也被打印到控制台,但是您可能会发现使用TensorBoard UI更容易在运行之间进行可视化和比较。

3 附录

1. L-BFGS 示例代码:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import os
import scipy.optimize

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

import ray
import ray.experimental.tf_utils

class LinearModel(object):
    """Simple class for a one layer neural network.
    Note that this code does not initialize the network weights. Instead
    weights are set via self.variables.set_weights.
        net = LinearModel([10, 10])
        weights = [np.random.normal(size=[10, 10]),
        variable_names = [v.name for v in net.variables]
        net.variables.set_weights(dict(zip(variable_names, weights)))
        x (tf.placeholder): Input vector.
        w (tf.Variable): Weight matrix.
        b (tf.Variable): Bias vector.
        y_ (tf.placeholder): Input result vector.
        cross_entropy (tf.Operation): Final layer of network.
        cross_entropy_grads (tf.Operation): Gradient computation.
        sess (tf.Session): Session used for training.
        variables (TensorFlowVariables): Extracted variables and methods to
            manipulate them.

    def __init__(self, shape):
        """Creates a LinearModel object."""
        x = tf.placeholder(tf.float32, [None, shape[0]])
        w = tf.Variable(tf.zeros(shape))
        b = tf.Variable(tf.zeros(shape[1]))
        self.x = x
        self.w = w
        self.b = b
        y = tf.nn.softmax(tf.matmul(x, w) + b)
        y_ = tf.placeholder(tf.float32, [None, shape[1]])
        self.y_ = y_
        cross_entropy = tf.reduce_mean(
            -tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
        self.cross_entropy = cross_entropy
        self.cross_entropy_grads = tf.gradients(cross_entropy, [w, b])
        self.sess = tf.Session()
        # In order to get and set the weights, we pass in the loss function to
        # Ray's TensorFlowVariables to automatically create methods to modify
        # the weights.
        self.variables = ray.experimental.tf_utils.TensorFlowVariables(
            cross_entropy, self.sess)

    def loss(self, xs, ys):
        """Computes the loss of the network."""
        return float(
                self.cross_entropy, feed_dict={
                    self.x: xs,
                    self.y_: ys

    def grad(self, xs, ys):
        """Computes the gradients of the network."""
        return self.sess.run(
            self.cross_entropy_grads, feed_dict={
                self.x: xs,
                self.y_: ys

class NetActor(object):
    def __init__(self, xs, ys):
        os.environ["CUDA_VISIBLE_DEVICES"] = ""
        with tf.device("/cpu:0"):
            self.net = LinearModel([784, 10])
            self.xs = xs
            self.ys = ys

    # Compute the loss on a batch of data.
    def loss(self, theta):
        net = self.net
        return net.loss(self.xs, self.ys)

    # Compute the gradient of the loss on a batch of data.
    def grad(self, theta):
        net = self.net
        gradients = net.grad(self.xs, self.ys)
        return np.concatenate([g.flatten() for g in gradients])

    def get_flat_size(self):
        return self.net.variables.get_flat_size()

# Compute the loss on the entire dataset.
def full_loss(theta):
    theta_id = ray.put(theta)
    loss_ids = [actor.loss.remote(theta_id) for actor in actors]
    return sum(ray.get(loss_ids))

# Compute the gradient of the loss on the entire dataset.
def full_grad(theta):
    theta_id = ray.put(theta)
    grad_ids = [actor.grad.remote(theta_id) for actor in actors]
    # The float64 conversion is necessary for use with fmin_l_bfgs_b.
    return sum(ray.get(grad_ids)).astype("float64")

if __name__ == "__main__":

    # From the perspective of scipy.optimize.fmin_l_bfgs_b, full_loss is simply
    # a function which takes some parameters theta, and computes a loss.
    # Similarly, full_grad is a function which takes some parameters theta, and
    # computes the gradient of the loss. Internally, these functions use Ray to
    # distribute the computation of the loss and the gradient over the data
    # that is represented by the remote object IDs x_batches and y_batches and
    # which is potentially distributed over a cluster. However, these details
    # are hidden from scipy.optimize.fmin_l_bfgs_b, which simply uses it to run
    # the L-BFGS algorithm.

    # Load the mnist data and turn the data into remote objects.
    print("Downloading the MNIST dataset. This may take a minute.")
    mnist = input_data.read_data_sets("MNIST_data", one_hot=True)
    num_batches = 10
    batch_size = mnist.train.num_examples // num_batches
    batches = [mnist.train.next_batch(batch_size) for _ in range(num_batches)]
    print("Putting MNIST in the object store.")
    actors = [NetActor.remote(xs, ys) for (xs, ys) in batches]
    # Initialize the weights for the network to the vector of all zeros.
    dim = ray.get(actors[0].get_flat_size.remote())
    theta_init = 1e-2 * np.random.normal(size=dim)

    # Use L-BFGS to minimize the loss function.
    print("Running L-BFGS.")
    result = scipy.optimize.fmin_l_bfgs_b(
        full_loss, theta_init, maxiter=10, fprime=full_grad, disp=True)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
