赞
踩
1.Q-learning是一种基于值的强化学习方法。它旨在找到一个策略,该策略可以最大化智能体在长期内从环境中获得的奖励。
2.在Q-learning中,智能体维护一个Q表,该表对每个状态-行动对(s, a)分配一个值Q(s, a),表示智能体在状态s采取行动a时预期的未来奖励。
3.智能体在每个时间步会选择一个行动,观察环境的响应(新的状态和奖励),然后用这些信息来更新Q表。
Q值的更新是通过以下公式完成的:
Q(s,a)=(1−α)⋅Q(s,a)+α⋅(r+γ ⋅ a · maxQ(s′,a′))
假设我们有以下情况:
在一个迷宫问题中,智能体的目标是找到从起点到终点的最短路径。我们可以通过Q-learning来训练智能体,具体步骤和概念定义如下:
import numpy as np import random # 迷宫参数 maze = [ [0, 1, 0, 0, 0], [0, 1, 0, 1, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 1, 2] ] start = (0, 0) goal = (4, 4) # Q-learning参数 alpha = 0.1 gamma = 0.9 epsilon = 0.1 actions = [(0, 1), (1, 0), (0, -1), (-1, 0)] # 右、下、左、上 # 初始化Q表 states = [(i, j) for i in range(5) for j in range(5)] q_table = {(state, action): 0 for state in states for action in actions} # 训练智能体 num_episodes = 1000 for episode in range(num_episodes): state = start done = False while not done: # 选择一个行动 if random.uniform(0, 1) < epsilon: action = random.choice(actions) else: action_values = [q_table[(state, a)] for a in actions] action = actions[np.argmax(action_values)] # 采取行动并观察下一个状态和奖励 next_state = (state[0] + action[0], state[1] + action[1]) if next_state[0] < 0 or next_state[0] >= 5 or next_state[1] < 0 or next_state[1] >= 5: next_state = state reward = -100 else: cell_value = maze[next_state[0]][next_state[1]] if cell_value == 1: reward = -100 next_state = state elif cell_value == 2: reward = 50 done = True else: reward = -1 # 更新Q表 predict = q_table[(state, action)] target = reward + gamma * max(q_table[(next_state, a)] for a in actions) q_table[(state, action)] = predict + alpha * (target - predict) state = next_state # 打印Q表 for state in states: for action in actions: print(f"Q({state}, {action}) = {q_table[(state, action)]:.2f}")
这个代码中:
1.在每一轮中,智能体从起点开始,然后在每个状态选择一个行动,观察下一个状态和奖励,并更新Q表,直到到达目标。在选择行动时,有epsilon的概率选择一个随机行动,否则选择当前状态下的最优行动(即Q值最大的行动)。这是一种“epsilon-greedy”策略,可以确保智能体在探索和利用之间取得平衡。
2.在更新Q表时,我们用到了之前提到的Q值更新公式。
3.完成训练后,Q表中的值会逐渐收敛,智能体就能够根据Q表选择最优的行动。
import numpy as np import random import math import matplotlib.pyplot as plt # 迷宫参数 maze = [ [0, 1, 0, 0, 0], [0, 1, 0, 1, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 1, 2] ] start = (0, 0) goal = (4, 4) # Q-learning参数 alpha = 0.1 gamma = 0.9 initial_epsilon = 1.0 min_epsilon = 0.1 decay_rate = 0.01 actions = [(0, 1), (1, 0), (0, -1), (-1, 0)] # 右、下、左、上 # 初始化Q表 states = [(i, j) for i in range(5) for j in range(5)] q_table = {(state, action): 0 for state in states for action in actions} # 计算奖励 def compute_reward(state): if state == goal: return 50 elif maze[state[0]][state[1]] == 1: return -100 else: return -1 # 获取epsilon def get_epsilon(episode): return min_epsilon + (initial_epsilon - min_epsilon) * math.exp(-decay_rate * episode) # 存储每个episode的奖励和epsilon rewards = [] epsilons = [] # 训练智能体 num_episodes = 1000 for episode in range(num_episodes): state = start done = False epsilon = get_epsilon(episode) total_reward = 0 while not done: # 选择一个行动 if random.uniform(0, 1) < epsilon: action = random.choice(actions) else: action_values = [q_table[(state, a)] for a in actions] action = actions[np.argmax(action_values)] # 采取行动并观察下一个状态和奖励 next_state = (state[0] + action[0], state[1] + action[1]) if next_state[0] < 0 or next_state[0] >= 5 or next_state[1] < 0 or next_state[1] >= 5: next_state = state reward = compute_reward(next_state) total_reward += reward # 更新Q表 predict = q_table[(state, action)] target = reward + gamma * max(q_table[(next_state, a)] for a in actions) q_table[(state, action)] = predict + alpha * (target - predict) state = next_state done = state == goal rewards.append(total_reward) epsilons.append(epsilon) # 打印Q表 for state in states: for action in actions: print(f"Q({state}, {action}) = {q_table[(state, action)]:.2f}") # 画图 plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) plt.plot(rewards) plt.xlabel('Episode') plt.ylabel('Total Reward') plt.subplot(1, 2, 2) plt.plot(epsilons) plt.xlabel('Episode') plt.ylabel('Epsilon') plt.tight_layout() plt.show()
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。