当前位置:   article > 正文

基于云ModelArts的PPO算法玩“超级马里奥兄弟”【华为云至简致远】_华为云使用ppo

华为云使用ppo

【摘要】 一.前言我们利用PPO算法来玩“Super Mario Bros”(超级马里奥兄弟)。目前来看,对于绝大部分关卡,智能体都可以在1500个episode内学会过关。 二.PPO算法的基本结构PPO算法有两种主要形式:PPO-Penalty和PPO-Clip(PPO2)。在这里,我们讨论PPO-Clip(OpenAI使用的主要形式)。 PPO的主要特点如下:PPO属于on-policy算法P...

一.前言

我们利用PPO算法来玩“Super Mario Bros”(超级马里奥兄弟)。目前来看,对于绝大部分关卡,智能体都可以在1500个episode内学会过关。

二.PPO算法的基本结构

PPO算法有两种主要形式:PPO-Penalty和PPO-Clip(PPO2)。在这里,我们讨论PPO-Clip(OpenAI使用的主要形式)。 PPO的主要特点如下:
PPO属于on-policy算法
PPO同时适用于离散和连续的动作空间
损失函数 PPO-Clip算法最精髓的地方就是加入了一项比例用以描绘新老策略的差异,通过超参数ϵ限制策略的更新步长:


更新策略:

探索策略 PPO采用随机探索策略。
优势函数 表示在状态s下采取动作a,相较于其他动作有多少优势,如果>0,则当前动作比平均动作好,反之,则差

算法主要流程大致如下:

三.进入实操

我们需要先进入我们的华为云实例网址,使用PPO算法玩超级马里奥兄弟
我们需要登录华为云账号,点击订阅这个实例,然后才能点击Run in ModelArts,进入 JupyterLab 页面。


我们进入页面,先需要等待,等待30s之后弹出如下页面,让我们选择合适的运行环境,我们选择免费的就好,点击切换规格。

等待切换规格完成:等待初始化完成…

如下图,等待初始化完成。一切就绪

3.1 程序初始化

安装基础依赖

  1. !pip install -U pip
  2. !pip install gym==0.19.0
  3. !pip install tqdm==4.48.0
  4. !pip install nes-py==8.1.0
  5. !pip install gym-super-mario-bros==7.3.2

3.2 导入相关的库

  1. import os
  2. import shutil
  3. import subprocess as sp
  4. from collections import deque
  5. import numpy as np
  6. import torch
  7. import torch.nn as nn
  8. import torch.nn.functional as F
  9. import torch.multiprocessing as _mp
  10. from torch.distributions import Categorical
  11. import torch.multiprocessing as mp
  12. from nes_py.wrappers import JoypadSpace
  13. import gym_super_mario_bros
  14. from gym.spaces import Box
  15. from gym import Wrapper
  16. from gym_super_mario_bros.actions import SIMPLE_MOVEMENT, COMPLEX_MOVEMENT, RIGHT_ONLY
  17. import cv2
  18. import matplotlib.pyplot as plt
  19. from IPython import display
  20. import moxing as mox

3.3训练参数初始化

  1. opt={
  2. "world": 1, # 可选大关:1,2,3,4,5,6,7,8
  3. "stage": 1, # 可选小关:1,2,3,4
  4. "action_type": "simple", # 动作类别:"simple""right_only", "complex"
  5. 'lr': 1e-4, # 建议学习率:1e-31e-4, 1e-57e-5
  6. 'gamma': 0.9, # 奖励折扣
  7. 'tau': 1.0, # GAE参数
  8. 'beta': 0.01, # 熵系数
  9. 'epsilon': 0.2, # PPO的Clip系数
  10. 'batch_size': 16, # 经验回放的batch_size
  11. 'max_episode':10, # 最大训练局数
  12. 'num_epochs': 10, # 每条经验回放次数
  13. "num_local_steps": 512, # 每局的最大步数
  14. "num_processes": 8, # 训练进程数,一般等于训练机核心数
  15. "save_interval": 5, # 每{}局保存一次模型
  16. "log_path": "./log", # 日志保存路径
  17. "saved_path": "./model", # 训练模型保存路径
  18. "pretrain_model": True, # 是否加载预训练模型,目前只提供1-1关卡的预训练模型,其他需要从零开始训练
  19. "episode":5
  20. }

3.4 创建环境

  1. # 创建环境
  2. def create_train_env(world, stage, actions, output_path=None):
  3. # 创建基础环境
  4. env = gym_super_mario_bros.make("SuperMarioBros-{}-{}-v0".format(world, stage))
  5. env = JoypadSpace(env, actions)
  6. # 对环境自定义
  7. env = CustomReward(env, world, stage, monitor=None)
  8. env = CustomSkipFrame(env)
  9. return env
  10. # 对原始环境进行修改,以获得更好的训练效果
  11. class CustomReward(Wrapper):
  12. def __init__(self, env=None, world=None, stage=None, monitor=None):
  13. super(CustomReward, self).__init__(env)
  14. self.observation_space = Box(low=0, high=255, shape=(1, 84, 84))
  15. self.curr_score = 0
  16. self.current_x = 40
  17. self.world = world
  18. self.stage = stage
  19. if monitor:
  20. self.monitor = monitor
  21. else:
  22. self.monitor = None
  23. def step(self, action):
  24. state, reward, done, info = self.env.step(action)
  25. if self.monitor:
  26. self.monitor.record(state)
  27. state = process_frame(state)
  28. reward += (info["score"] - self.curr_score) / 40.
  29. self.curr_score = info["score"]
  30. if done:
  31. if info["flag_get"]:
  32. reward += 50
  33. else:
  34. reward -= 50
  35. if self.world == 7 and self.stage == 4:
  36. if (506 <= info["x_pos"] <= 832 and info["y_pos"] > 127) or (
  37. 832 < info["x_pos"] <= 1064 and info["y_pos"] < 80) or (
  38. 1113 < info["x_pos"] <= 1464 and info["y_pos"] < 191) or (
  39. 1579 < info["x_pos"] <= 1943 and info["y_pos"] < 191) or (
  40. 1946 < info["x_pos"] <= 1964 and info["y_pos"] >= 191) or (
  41. 1984 < info["x_pos"] <= 2060 and (info["y_pos"] >= 191 or info["y_pos"] < 127)) or (
  42. 2114 < info["x_pos"] < 2440 and info["y_pos"] < 191) or info["x_pos"] < self.current_x - 500:
  43. reward -= 50
  44. done = True
  45. if self.world == 4 and self.stage == 4:
  46. if (info["x_pos"] <= 1500 and info["y_pos"] < 127) or (
  47. 1588 <= info["x_pos"] < 2380 and info["y_pos"] >= 127):
  48. reward = -50
  49. done = True
  50. self.current_x = info["x_pos"]
  51. return state, reward / 10., done, info
  52. def reset(self):
  53. self.curr_score = 0
  54. self.current_x = 40
  55. return process_frame(self.env.reset())
  56. class MultipleEnvironments:
  57. def __init__(self, world, stage, action_type, num_envs, output_path=None):
  58. self.agent_conns, self.env_conns = zip(*[mp.Pipe() for _ in range(num_envs)])
  59. if action_type == "right_only":
  60. actions = RIGHT_ONLY
  61. elif action_type == "simple":
  62. actions = SIMPLE_MOVEMENT
  63. else:
  64. actions = COMPLEX_MOVEMENT
  65. self.envs = [create_train_env(world, stage, actions, output_path=output_path) for _ in range(num_envs)]
  66. self.num_states = self.envs[0].observation_space.shape[0]
  67. self.num_actions = len(actions)
  68. for index in range(num_envs):
  69. process = mp.Process(target=self.run, args=(index,))
  70. process.start()
  71. self.env_conns[index].close()
  72. def run(self, index):
  73. self.agent_conns[index].close()
  74. while True:
  75. request, action = self.env_conns[index].recv()
  76. if request == "step":
  77. self.env_conns[index].send(self.envs[index].step(action.item()))
  78. elif request == "reset":
  79. self.env_conns[index].send(self.envs[index].reset())
  80. else:
  81. raise NotImplementedError
  82. def process_frame(frame):
  83. if frame is not None:
  84. frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
  85. frame = cv2.resize(frame, (84, 84))[None, :, :] / 255.
  86. return frame
  87. else:
  88. return np.zeros((1, 84, 84))
  89. class CustomSkipFrame(Wrapper):
  90. def __init__(self, env, skip=4):
  91. super(CustomSkipFrame, self).__init__(env)
  92. self.observation_space = Box(low=0, high=255, shape=(skip, 84, 84))
  93. self.skip = skip
  94. self.states = np.zeros((skip, 84, 84), dtype=np.float32)
  95. def step(self, action):
  96. total_reward = 0
  97. last_states = []
  98. for i in range(self.skip):
  99. state, reward, done, info = self.env.step(action)
  100. total_reward += reward
  101. if i >= self.skip / 2:
  102. last_states.append(state)
  103. if done:
  104. self.reset()
  105. return self.states[None, :, :, :].astype(np.float32), total_reward, done, info
  106. max_state = np.max(np.concatenate(last_states, 0), 0)
  107. self.states[:-1] = self.states[1:]
  108. self.states[-1] = max_state
  109. return self.states[None, :, :, :].astype(np.float32), total_reward, done, info
  110. def reset(self):
  111. state = self.env.reset()
  112. self.states = np.concatenate([state for _ in range(self.skip)], 0)
  113. return self.states[None, :, :, :].astype(np.float32)

3.5定义神经网络

  1. class Net(nn.Module):
  2. def __init__(self, num_inputs, num_actions):
  3. super(Net, self).__init__()
  4. self.conv1 = nn.Conv2d(num_inputs, 32, 3, stride=2, padding=1)
  5. self.conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
  6. self.conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
  7. self.conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
  8. self.linear = nn.Linear(32 * 6 * 6, 512)
  9. self.critic_linear = nn.Linear(512, 1)
  10. self.actor_linear = nn.Linear(512, num_actions)
  11. self._initialize_weights()
  12. def _initialize_weights(self):
  13. for module in self.modules():
  14. if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
  15. nn.init.orthogonal_(module.weight, nn.init.calculate_gain('relu'))
  16. nn.init.constant_(module.bias, 0)
  17. def forward(self, x):
  18. x = F.relu(self.conv1(x))
  19. x = F.relu(self.conv2(x))
  20. x = F.relu(self.conv3(x))
  21. x = F.relu(self.conv4(x))
  22. x = self.linear(x.view(x.size(0), -1))
  23. return self.actor_linear(x), self.critic_linear(x)

3.6 定义PPO算法

  1. def evaluation(opt, global_model, num_states, num_actions,curr_episode):
  2. print('start evalution !')
  3. torch.manual_seed(123)
  4. if opt['action_type'] == "right":
  5. actions = RIGHT_ONLY
  6. elif opt['action_type'] == "simple":
  7. actions = SIMPLE_MOVEMENT
  8. else:
  9. actions = COMPLEX_MOVEMENT
  10. env = create_train_env(opt['world'], opt['stage'], actions)
  11. local_model = Net(num_states, num_actions)
  12. if torch.cuda.is_available():
  13. local_model.cuda()
  14. local_model.eval()
  15. state = torch.from_numpy(env.reset())
  16. if torch.cuda.is_available():
  17. state = state.cuda()
  18. plt.figure(figsize=(10,10))
  19. img = plt.imshow(env.render(mode='rgb_array'))
  20. done=False
  21. local_model.load_state_dict(global_model.state_dict()) #加载网络参数\
  22. while not done:
  23. if torch.cuda.is_available():
  24. state = state.cuda()
  25. logits, value = local_model(state)
  26. policy = F.softmax(logits, dim=1)
  27. action = torch.argmax(policy).item()
  28. state, reward, done, info = env.step(action)
  29. state = torch.from_numpy(state)
  30. img.set_data(env.render(mode='rgb_array')) # just update the data
  31. display.display(plt.gcf())
  32. display.clear_output(wait=True)
  33. if info["flag_get"]:
  34. print("flag getted in episode:{}!".format(curr_episode))
  35. torch.save(local_model.state_dict(),
  36. "{}/ppo_super_mario_bros_{}_{}_{}".format(opt['saved_path'], opt['world'], opt['stage'],curr_episode))
  37. opt.update({'episode':curr_episode})
  38. env.close()
  39. return True
  40. return False
  41. def train(opt):
  42. #判断cuda是否可用
  43. if torch.cuda.is_available():
  44. torch.cuda.manual_seed(123)
  45. else:
  46. torch.manual_seed(123)
  47. if os.path.isdir(opt['log_path']):
  48. shutil.rmtree(opt['log_path'])
  49. os.makedirs(opt['log_path'])
  50. if not os.path.isdir(opt['saved_path']):
  51. os.makedirs(opt['saved_path'])
  52. mp = _mp.get_context("spawn")
  53. #创建环境
  54. envs = MultipleEnvironments(opt['world'], opt['stage'], opt['action_type'], opt['num_processes'])
  55. #创建模型
  56. model = Net(envs.num_states, envs.num_actions)
  57. if opt['pretrain_model']:
  58. print('加载预训练模型')
  59. if not os.path.exists("ppo_super_mario_bros_1_1_0"):
  60. mox.file.copy_parallel(
  61. "obs://modelarts-labs-bj4/course/modelarts/zjc_team/reinforcement_learning/ppo_mario/ppo_super_mario_bros_1_1_0",
  62. "ppo_super_mario_bros_1_1_0")
  63. if torch.cuda.is_available():
  64. model.load_state_dict(torch.load("ppo_super_mario_bros_1_1_0"))
  65. model.cuda()
  66. else:
  67. model.load_state_dict(torch.load("ppo_super_mario_bros_1_1_0",torch.device('cpu')))
  68. else:
  69. model.cuda()
  70. model.share_memory()
  71. optimizer = torch.optim.Adam(model.parameters(), lr=opt['lr'])
  72. #环境重置
  73. [agent_conn.send(("reset", None)) for agent_conn in envs.agent_conns]
  74. #接收当前状态
  75. curr_states = [agent_conn.recv() for agent_conn in envs.agent_conns]
  76. curr_states = torch.from_numpy(np.concatenate(curr_states, 0))
  77. if torch.cuda.is_available():
  78. curr_states = curr_states.cuda()
  79. curr_episode = 0
  80. #在最大局数内训练
  81. while curr_episode<opt['max_episode']:
  82. if curr_episode % opt['save_interval'] == 0 and curr_episode > 0:
  83. torch.save(model.state_dict(),
  84. "{}/ppo_super_mario_bros_{}_{}_{}".format(opt['saved_path'], opt['world'], opt['stage'], curr_episode))
  85. curr_episode += 1
  86. old_log_policies = []
  87. actions = []
  88. values = []
  89. states = []
  90. rewards = []
  91. dones = []
  92. #一局内最大步数
  93. for _ in range(opt['num_local_steps']):
  94. states.append(curr_states)
  95. logits, value = model(curr_states)
  96. values.append(value.squeeze())
  97. policy = F.softmax(logits, dim=1)
  98. old_m = Categorical(policy)
  99. action = old_m.sample()
  100. actions.append(action)
  101. old_log_policy = old_m.log_prob(action)
  102. old_log_policies.append(old_log_policy)
  103. #执行action
  104. if torch.cuda.is_available():
  105. [agent_conn.send(("step", act)) for agent_conn, act in zip(envs.agent_conns, action.cpu())]
  106. else:
  107. [agent_conn.send(("step", act)) for agent_conn, act in zip(envs.agent_conns, action)]
  108. state, reward, done, info = zip(*[agent_conn.recv() for agent_conn in envs.agent_conns])
  109. state = torch.from_numpy(np.concatenate(state, 0))
  110. if torch.cuda.is_available():
  111. state = state.cuda()
  112. reward = torch.cuda.FloatTensor(reward)
  113. done = torch.cuda.FloatTensor(done)
  114. else:
  115. reward = torch.FloatTensor(reward)
  116. done = torch.FloatTensor(done)
  117. rewards.append(reward)
  118. dones.append(done)
  119. curr_states = state
  120. _, next_value, = model(curr_states)
  121. next_value = next_value.squeeze()
  122. old_log_policies = torch.cat(old_log_policies).detach()
  123. actions = torch.cat(actions)
  124. values = torch.cat(values).detach()
  125. states = torch.cat(states)
  126. gae = 0
  127. R = []
  128. #gae计算
  129. for value, reward, done in list(zip(values, rewards, dones))[::-1]:
  130. gae = gae * opt['gamma'] * opt['tau']
  131. gae = gae + reward + opt['gamma'] * next_value.detach() * (1 - done) - value.detach()
  132. next_value = value
  133. R.append(gae + value)
  134. R = R[::-1]
  135. R = torch.cat(R).detach()
  136. advantages = R - values
  137. #策略更新
  138. for i in range(opt['num_epochs']):
  139. indice = torch.randperm(opt['num_local_steps'] * opt['num_processes'])
  140. for j in range(opt['batch_size']):
  141. batch_indices = indice[
  142. int(j * (opt['num_local_steps'] * opt['num_processes'] / opt['batch_size'])): int((j + 1) * (
  143. opt['num_local_steps'] * opt['num_processes'] / opt['batch_size']))]
  144. logits, value = model(states[batch_indices])
  145. new_policy = F.softmax(logits, dim=1)
  146. new_m = Categorical(new_policy)
  147. new_log_policy = new_m.log_prob(actions[batch_indices])
  148. ratio = torch.exp(new_log_policy - old_log_policies[batch_indices])
  149. actor_loss = -torch.mean(torch.min(ratio * advantages[batch_indices],
  150. torch.clamp(ratio, 1.0 - opt['epsilon'], 1.0 + opt['epsilon']) *
  151. advantages[
  152. batch_indices]))
  153. critic_loss = F.smooth_l1_loss(R[batch_indices], value.squeeze())
  154. entropy_loss = torch.mean(new_m.entropy())
  155. #损失函数包含三个部分:actor损失,critic损失,和动作entropy损失
  156. total_loss = actor_loss + critic_loss - opt['beta'] * entropy_loss
  157. optimizer.zero_grad()
  158. total_loss.backward()
  159. torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
  160. optimizer.step()
  161. print("Episode: {}. Total loss: {}".format(curr_episode, total_loss))
  162. finish=False
  163. for i in range(opt["num_processes"]):
  164. if info[i]["flag_get"]:
  165. finish=evaluation(opt, model,envs.num_states, envs.num_actions,curr_episode)
  166. if finish:
  167. break
  168. if finish:
  169. break

3.7 训练模型

训练10 Episode,耗时约5分钟

train(opt)

这里比较费时间哈,多等待,正在训练模型中…
我这里花了2.6分钟哈,还是比较快的,如图:

3.8 使用模型推理游戏

定义推理函数

  1. def infer(opt):
  2. if torch.cuda.is_available():
  3. torch.cuda.manual_seed(123)
  4. else:
  5. torch.manual_seed(123)
  6. if opt['action_type'] == "right":
  7. actions = RIGHT_ONLY
  8. elif opt['action_type'] == "simple":
  9. actions = SIMPLE_MOVEMENT
  10. else:
  11. actions = COMPLEX_MOVEMENT
  12. env = create_train_env(opt['world'], opt['stage'], actions)
  13. model = Net(env.observation_space.shape[0], len(actions))
  14. if torch.cuda.is_available():
  15. model.load_state_dict(torch.load("{}/ppo_super_mario_bros_{}_{}_{}".format(opt['saved_path'],opt['world'], opt['stage'],opt['episode'])))
  16. model.cuda()
  17. else:
  18. model.load_state_dict(torch.load("{}/ppo_super_mario_bros_{}_{}_{}".format(opt['saved_path'], opt['world'], opt['stage'],opt['episode']),
  19. map_location=torch.device('cpu')))
  20. model.eval()
  21. state = torch.from_numpy(env.reset())
  22. plt.figure(figsize=(10,10))
  23. img = plt.imshow(env.render(mode='rgb_array'))
  24. while True:
  25. if torch.cuda.is_available():
  26. state = state.cuda()
  27. logits, value = model(state)
  28. policy = F.softmax(logits, dim=1)
  29. action = torch.argmax(policy).item()
  30. state, reward, done, info = env.step(action)
  31. state = torch.from_numpy(state)
  32. img.set_data(env.render(mode='rgb_array')) # just update the data
  33. display.display(plt.gcf())
  34. display.clear_output(wait=True)
  35. if info["flag_get"]:
  36. print("World {} stage {} completed".format(opt['world'], opt['stage']))
  37. break
  38. if done and info["flag_get"] is False:
  39. print('Game Failed')
  40. break

运行

infer(opt)

四.成果展示

【华为云至简致远】有奖征文火热进行中:https://bbs.huaweicloud.com/blogs/352809

想了解更多华为云产品相关信息,请联系我们:

​电话:950808按0转1

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/331477
推荐阅读
相关标签
  

闽ICP备14008679号