当前位置:   article > 正文

强化学习 | 强化学习基础知识(图解)_强化学习训练模型图

强化学习训练模型图

强化学习是机器学习的一个领域。它是关于在特定情况下采取适当的行动来最大化奖励。它被各种软件和机器用来找到在特定情况下应该采取的最佳行为或路径。强化学习与监督学习的不同之处在于,在监督学习中,训练数据具有答案键,因此模型本身使用正确答案进行训练,而在强化学习中,没有答案,但强化代理决定如何执行给定的任务。在没有训练数据集的情况下,它必然会从它的经验中学习。

强化学习(RL)是一门决策科学。它是关于学习环境中的最佳行为以获得最大的奖励。在 RL 中,数据是从使用试错法的机器学习系统中累积的。数据不是在有监督或无监督机器学习中找到的输入的一部分。

强化学习使用从结果中学习并决定下一步要采取的操作的算法。在每个操作之后,算法都会收到反馈,帮助它确定它所做的选择是正确的、中立的还是不正确的。对于必须在没有人工指导的情况下做出大量小决策的自动化系统来说,这是一个很好的技术。

强化学习是一种自主的自学系统,本质上是通过反复试验来学习的。它执行旨在最大化奖励的行动,或者换句话说,它是边做边学,以达到最佳结果。

1.理论基础

强化学习(Reinforcement Learning)的基本概念从马尔科夫决策过程(MDP)出发。MDP 是指在状态传播过程中遵循马可夫属性的过程。

2.强化学习算法:使用 Q 学习实现 Python

环境:jupyter lab

步骤 1:导入所需的库

  1. #步骤 1:导入所需的库
  2. import numpy as np
  3. import pylab as pl
  4. import networkx as nx

步骤 2:定义和可视化图形

  1. # 步骤 2:定义和可视化图形
  2. edges = [(0, 1), (1, 5), (5, 6), (5, 4), (1, 2),
  3. (1, 3), (9, 10), (2, 4), (0, 6), (6, 7),
  4. (8, 9), (7, 8), (1, 7), (3, 9)]
  5. #
  6. goal = 10
  7. G = nx.Graph()
  8. G.add_edges_from(edges)
  9. pos = nx.spring_layout(G)
  10. nx.draw_networkx_nodes(G, pos)
  11. nx.draw_networkx_edges(G, pos)
  12. nx.draw_networkx_labels(G, pos)
  13. pl.show()

#上面的图表在代码的复制上可能看起来不一样,因为python中的networkx库从给定的边缘生成一个随机图。

步骤 3:为机器人定义系统的奖励

  1. # 步骤 3:为机器人定义系统的奖励
  2. MATRIX_SIZE = 11
  3. M = np.matrix(np.ones(shape =(MATRIX_SIZE, MATRIX_SIZE)))
  4. M *= -1
  5. for point in edges:
  6. print(point)
  7. if point[1] == goal:
  8. M[point] = 100
  9. else:
  10. M[point] = 0
  11. if point[0] == goal:
  12. M[point[::-1]] = 100
  13. else:
  14. M[point[::-1]]= 0
  15. # reverse of point
  16. M[goal, goal]= 100
  17. print(M)
  18. # add goal point round trip

步骤 4:定义一些要在训练中使用的实用程序函数

  1. # 步骤 4:定义一些要在训练中使用的实用程序函数
  2. Q = np.matrix(np.zeros([MATRIX_SIZE, MATRIX_SIZE]))
  3. gamma = 0.75
  4. # learning parameter
  5. initial_state = 1
  6. # Determines the available actions for a given state
  7. def available_actions(state):
  8. current_state_row = M[state, ]
  9. available_action = np.where(current_state_row >= 0)[1]
  10. return available_action
  11. available_action = available_actions(initial_state)
  12. # Chooses one of the available actions at random
  13. def sample_next_action(available_actions_range):
  14. next_action = int(np.random.choice(available_action, 1))
  15. return next_action
  16. action = sample_next_action(available_action)
  17. def update(current_state, action, gamma):
  18. max_index = np.where(Q[action, ] == np.max(Q[action, ]))[1]
  19. if max_index.shape[0] > 1:
  20. max_index = int(np.random.choice(max_index, size = 1))
  21. else:
  22. max_index = int(max_index)
  23. max_value = Q[action, max_index]
  24. Q[current_state, action] = M[current_state, action] + gamma * max_value
  25. if (np.max(Q) > 0):
  26. return(np.sum(Q / np.max(Q)*100))
  27. else:
  28. return (0)
  29. # Updates the Q-Matrix according to the path chosen
  30. update(initial_state, action, gamma)

步骤 5:使用 Q 矩阵训练和评估机器人

  1. # 第 5 步:使用 Q 矩阵训练和评估机器人
  2. scores = []
  3. for i in range(1000):
  4. current_state = np.random.randint(0, int(Q.shape[0]))
  5. available_action = available_actions(current_state)
  6. action = sample_next_action(available_action)
  7. score = update(current_state, action, gamma)
  8. scores.append(score)
  9. # print("Trained Q matrix:")
  10. # print(Q / np.max(Q)*100)
  11. # You can uncomment the above two lines to view the trained Q matrix
  12. # Testing
  13. current_state = 0
  14. steps = [current_state]
  15. while current_state != 10:
  16. next_step_index = np.where(Q[current_state, ] == np.max(Q[current_state, ]))[1]
  17. if next_step_index.shape[0] > 1:
  18. next_step_index = int(np.random.choice(next_step_index, size = 1))
  19. else:
  20. next_step_index = int(next_step_index)
  21. steps.append(next_step_index)
  22. current_state = next_step_index
  23. print("Most efficient path:")
  24. print(steps)
  25. pl.plot(scores)
  26. pl.xlabel('No of iterations')
  27. pl.ylabel('Reward gained')
  28. pl.show()

步骤 6 :使用环境线索定义和可视化新图形

  1. # 第 6 步:使用环境线索定义和可视化新图形
  2. # Defining the locations of the police and the drug traces
  3. police = [2, 4, 5]
  4. drug_traces = [3, 8, 9]
  5. G = nx.Graph()
  6. G.add_edges_from(edges)
  7. mapping = {0:'0 - Detective', 1:'1', 2:'2 - Police', 3:'3 - Drug traces',
  8. 4:'4 - Police', 5:'5 - Police', 6:'6', 7:'7', 8:'Drug traces',
  9. 9:'9 - Drug traces', 10:'10 - Drug racket location'}
  10. H = nx.relabel_nodes(G, mapping)
  11. pos = nx.spring_layout(H)
  12. #nx.draw_networkx_nodes(H, pos, node_size =[200, 200, 200, 200, 200, 200, 200, 200])
  13. nx.draw_networkx_nodes(H, pos)
  14. nx.draw_networkx_edges(H, pos)
  15. nx.draw_networkx_labels(H, pos)
  16. pl.show()

上图可能看起来与上一张图略有不同,但实际上它们是相同的图表。这是由于networkx库随机放置节点。

步骤 7:为训练过程定义一些实用程序函数

  1. # 步骤 7:为训练过程定义一些实用程序函数
  2. Q = np.matrix(np.zeros([MATRIX_SIZE, MATRIX_SIZE]))
  3. env_police = np.matrix(np.zeros([MATRIX_SIZE, MATRIX_SIZE]))
  4. env_drugs = np.matrix(np.zeros([MATRIX_SIZE, MATRIX_SIZE]))
  5. initial_state = 1
  6. # Same as above
  7. def available_actions(state):
  8. current_state_row = M[state, ]
  9. av_action = np.where(current_state_row >= 0)[1]
  10. return av_action
  11. # Same as above
  12. def sample_next_action(available_actions_range):
  13. next_action = int(np.random.choice(available_action, 1))
  14. return next_action
  15. # Exploring the environment
  16. def collect_environmental_data(action):
  17. found = []
  18. if action in police:
  19. found.append('p')
  20. if action in drug_traces:
  21. found.append('d')
  22. return (found)
  23. available_action = available_actions(initial_state)
  24. action = sample_next_action(available_action)
  25. def update(current_state, action, gamma):
  26. max_index = np.where(Q[action, ] == np.max(Q[action, ]))[1]
  27. if max_index.shape[0] > 1:
  28. max_index = int(np.random.choice(max_index, size = 1))
  29. else:
  30. max_index = int(max_index)
  31. max_value = Q[action, max_index]
  32. Q[current_state, action] = M[current_state, action] + gamma * max_value
  33. environment = collect_environmental_data(action)
  34. if 'p' in environment:
  35. env_police[current_state, action] += 1
  36. if 'd' in environment:
  37. env_drugs[current_state, action] += 1
  38. if (np.max(Q) > 0):
  39. return(np.sum(Q / np.max(Q)*100))
  40. else:
  41. return (0)
  42. # Same as above
  43. update(initial_state, action, gamma)
  44. def available_actions_with_env_help(state):
  45. current_state_row = M[state, ]
  46. av_action = np.where(current_state_row >= 0)[1]
  47. # if there are multiple routes, dis-favor anything negative
  48. env_pos_row = env_matrix_snap[state, av_action]
  49. if (np.sum(env_pos_row < 0)):
  50. # can we remove the negative directions from av_act?
  51. temp_av_action = av_action[np.array(env_pos_row)[0]>= 0]
  52. if len(temp_av_action) > 0:
  53. av_action = temp_av_action
  54. return av_action
  55. # Determines the available actions according to the environment

步骤 8:可视化环境矩阵

  1. # 步骤 8:可视化环境矩阵
  2. scores = []
  3. for i in range(1000):
  4. current_state = np.random.randint(0, int(Q.shape[0]))
  5. available_action = available_actions(current_state)
  6. action = sample_next_action(available_action)
  7. score = update(current_state, action, gamma)
  8. # print environmental matrices
  9. print('Police Found')
  10. print(env_police)
  11. print('')
  12. print('Drug traces Found')
  13. print(env_drugs)

步骤 9:训练和评估模型

  1. scores = []
  2. for i in range(1000):
  3. current_state = np.random.randint(0, int(Q.shape[0]))
  4. available_action = available_actions_with_env_help(current_state)
  5. action = sample_next_action(available_action)
  6. score = update(current_state, action, gamma)
  7. scores.append(score)
  8. pl.plot(scores)
  9. pl.xlabel('Number of iterations')
  10. pl.ylabel('Reward gained')
  11. pl.show()

参考文献

【1】Part 1: Key Concepts in RL — Spinning Up documentation (openai.com)

【2】 BartoSutton.pdf (cmu.edu)

【3】 Reinforcement learning - GeeksforGeeks

【4】ML | Reinforcement Learning Algorithm : Python Implementation using Q-learning - GeeksforGeeks 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/368623
推荐阅读
相关标签
  

闽ICP备14008679号