赞
踩
RLlib是一种建立在Ray之上的行业级别的强化学习(RL)库。RLlib提供了高度可扩展性和统一的API,适用于各种行业和研究应用。
下面在Anaconda中创建Ray RLlib的环境。
- conda create -n RayRLlib python=3.7
- conda activate RayRLlib
- conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
- pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple
- pip install tensorflow-probability -i https://pypi.tuna.tsinghua.edu.cn/simple
- pip install ipykernel -i https://pypi.tuna.tsinghua.edu.cn/simple
- pip install pyarrow
- pip install gputil
- pip install "ray[rllib]" -i https://pypi.tuna.tsinghua.edu.cn/simple
选择上述RayLib作为解释器,导入gym环境与ray库。使用PPO算法,gym环境为自定义的。
- import gymnasium as gym
- from ray.rllib.algorithms.ppo import PPOConfig
定义了一个名为SimpleCorridor的自定义gym环境。在这个环境中,智能体需要学会向右移动以到达走廊的出口。智能体需要在走廊里移动以到达出口。S表示起点,G表示目标,走廊长度可配置。智能体可以选择的动作有0(左)和1(右)。观察值是一个浮点数,表示当前位置的索引。每一步的奖励值是-0.1,除非到达目标位置(奖励值+1.0)。
英文原版:
Corridor in which an agent must learn to move right to reach the exit.
---------------------
| S | 1 | 2 | 3 | G | S=start; G=goal; corridor_length=5
---------------------
Possible actions to chose from are: 0=left; 1=right
Observations are floats indicating the current field index, e.g. 0.0 for
starting position, 1.0 for the field next to the starting position, etc..
Rewards are -0.1 for all steps, except when reaching the goal (+1.0).
class定义如下
- # Define your problem using python and Farama-Foundation's gymnasium API:
-
- class SimpleCorridor(gym.Env):
-
- def __init__(self, config):
- # 初始化环境,包括设置结束位置、当前位置、动作空间(两个离散动作:左和右)和观察空间。
- self.end_pos = config["corridor_length"]
- self.cur_pos = 0
- self.action_space = gym.spaces.Discrete(2) # left and right
- self.observation_space = gym.spaces.Box(0.0, self.end_pos, shape=(1,))
-
- def reset(self, *, seed=None, options=None):
- # 重置环境,将当前位置设为0,并返回初始观察值。
- """Resets the episode.
- Returns:
- Initial observation of the new episode and an info dict.
- """
- self.cur_pos = 0
- # Return initial observation.
- return [self.cur_pos], {}
-
- def step(self, action):
- # 根据给定的动作在环境中执行一步操作。根据动作和当前位置更新智能体位置。
- # 当到达走廊末端(目标)时,设置terminated标志。
- # 当目标达成时,奖励为+1.0,否则为-0.1。
- # 返回新的观察值、奖励、terminated标志、truncated标志和信息字典。
-
- """Takes a single step in the episode given `action`.
- Returns:
- New observation, reward, terminated-flag, truncated-flag, info-dict (empty).
- """
- # Walk left.
- if action == 0 and self.cur_pos > 0:
- self.cur_pos -= 1
- # Walk right.
- elif action == 1:
- self.cur_pos += 1
- # Set `terminated` flag when end of corridor (goal) reached.
- terminated = self.cur_pos >= self.end_pos
- truncated = False
- # +1 when goal reached, otherwise -1.
- reward = 1.0 if terminated else -0.1
- return [self.cur_pos], reward, terminated, truncated, {}
以下代码通过Ray RLlib创建了一个PPOConfig对象,并使用SimpleCorridor环境。设置环境配置,设置走廊长度为28。通过设置num_rollout_workers为10来并行化环境探索。通过配置构建PPO算法对象。
- config = (
- PPOConfig().environment(
- # Env class to use (here: our gym.Env sub-class from above).
- env=SimpleCorridor,
- # Config dict to be passed to our custom env's constructor.
- # Use corridor with 20 fields (including S and G).
- env_config={"corridor_length": 28},
- )
- # Parallelize environment rollouts.
- .rollouts(num_rollout_workers=10)
- )
- # Construct the actual (PPO) algorithm object from the config.
- algo = config.build()
-
- # 循环训练PPO算法20次迭代,输出每次迭代的平均奖励。
- for i in range(20):
- results = algo.train()
- print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}")
通过上述代码进行强化学习训练,并行rollout workers为10个,训练迭代次数为20次。
训练过程中平均Reward输出如下。
- (RolloutWorker pid=334231) /home/yaoyao/anaconda3/envs/RayRLlib/lib/python3.7/site-packages/gymnasium/spaces/box.py:227: UserWarning: WARN: Casting input x to numpy array.
- ...
- Iter: 0; avg. reward=-24.700000000000117
- Iter: 1; avg. reward=-29.840909090909282
- ...
- Iter: 18; avg. reward=-1.7286713286713296
- Iter: 19; avg. reward=-1.7269503546099298
在走廊环境中执行一个完整的episode。从初始观察值开始,使用算法计算一个动作,将动作应用于环境并获得新的观察值、奖励、terminated标志和truncated标志。累积奖励并在循环结束时输出总奖励。
- # 在训练完成后,使用训练好的算法在新的走廊环境(长度为10)中进行推理
- env = SimpleCorridor({"corridor_length": 10})
- # 首先初始化环境并获得初始观察值。
- terminated = truncated = False
- total_reward = 0.0
- # 玩1个回合
- while not terminated and not truncated:
- # Compute a single action, given the current observation
- # from the environment.
- action = algo.compute_single_action(obs)
- # Apply the computed action in the environment.
- obs, reward, terminated, truncated, info = env.step(action)
- # Sum up rewards for reporting purposes.
- total_reward += reward
- # 结果输出
- print(f"Played 1 episode; total-reward={total_reward}")
经过训练获得的模型在指定环境中验证,其最终获得的奖励为+0.1,相比较初始的-24有明显进步。
Played 1 episode; total-reward=0.10000000000000009
由于本案例中,长廊长度为10,Agent采取最优策略(一直向右行走),能够获得的最大奖励为
可以说明,Agent通过PPO算法已经学会了最优策略。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。