赞
踩
Q-learning是强化学习中的一个重要算法,用于解决基于马尔可夫决策过程(MDP)的任务。Q-learning是一个强大的算法,可用于解决马尔可夫决策过程中的最优策略问题。通过学习Q值函数,并使用贝尔曼方程进行更新,Q-learning能够在不断的训练中逐渐学到最优策略,以实现任务的最大累积奖励。
ε-贪婪策略(epsilon-greedy policy)是强化学习中一种常用的策略,它与探索的概念密切相关。
1.ε-贪婪策略
ε-贪婪策略是一种在强化学习中用于选择动作的策略,在ε-贪婪策略中,代理有一个参数 ε(epsilon),该参数是一个小于1的正数。在每个时间步骤,代理以ε的概率随机选择一个动作,以便进行探索。以1-ε的概率选择具有最高估计值(通常是 Q 值)的动作,以便进行开发(exploitation)。
2. 探索与开发
在强化学习中,代理需要在探索和开发之间取得平衡。探索是指代理尝试未知动作或状态,以便了解环境并找到更好的策略。没有探索,代理可能会陷入局部最优解。
开发是指代理根据当前已知的最佳策略来执行动作,以最大化累积奖励。没有开发,代理可能无法利用已知的信息来获得奖励。
3.ε-贪婪策略的作用
ε-贪婪策略在探索和开发之间进行权衡。通过调整 ε 的值,可以控制探索的程度。
ε-贪婪策略的关键在于在学习过程中逐渐减小 ε,以便代理在训练初期更多地进行探索,随着时间的推移逐渐增加开发的机会,直到最终收敛到一个较小的 ε。这样,代理可以在学习过程中逐渐优化其策略,同时探索环境以发现更好的策略。
总之,ε-贪婪策略允许强化学习代理在探索和开发之间找到平衡,这对于有效地学习和优化策略至关重要。通过调整 ε 的值,可以控制代理的探索程度,以适应不同的强化学习任务和学习阶段。
在深度学习中,ε-贪婪策略与探索通常用于训练强化学习智能体,如深度Q网络(DQN)。在下面的示例中,我们将使用Python和PyTorch来演示如何在一个简单的Q-learning任务中使用ε-贪婪策略与探索。
实例4-7:在深度学习模型中使用Q-learning和ε-贪婪策略(源码路径:daima\4\tan.py)
编写实例文件tan.py,功能是实现了一个简单的 Q-learning任务,展示了构建 Q 网络、实施 Q-learning 算法、定义环境和策略,并最终进行训练和测试的过程。本实例可以作为学习强化学习和 Q-learning 的入门示例,实例文件tan.py的具体实现代码如下所示。
- import numpy as np
- import torch
- import torch.nn as nn
- import torch.optim as optim
-
- # 创建一个简单的Q网络
- class QNetwork(nn.Module):
- def __init__(self, state_size, action_size):
- super(QNetwork, self).__init__()
- self.fc1 = nn.Linear(state_size, 24)
- self.fc2 = nn.Linear(24, 24)
- self.fc3 = nn.Linear(24, action_size)
-
- def forward(self, state):
- x = torch.relu(self.fc1(state))
- x = torch.relu(self.fc2(x))
- return self.fc3(x)
-
- # 定义ε-贪婪策略
- def epsilon_greedy_policy(q_values, epsilon):
- if np.random.rand() < epsilon:
- return np.random.randint(len(q_values)) # 随机选择动作
- else:
- return np.argmax(q_values) # 选择Q值最大的动作
-
- # 定义Q-learning算法
- def q_learning(env, q_network, num_episodes, learning_rate, gamma, epsilon):
- optimizer = optim.Adam(q_network.parameters(), lr=learning_rate)
- criterion = nn.MSELoss()
-
- for episode in range(num_episodes):
- state = env.reset()
- done = False
-
- while not done:
- # 将整数状态转换为 one-hot 编码的向量
- state_one_hot = np.zeros(env.num_states)
- state_one_hot[state] = 1
- state_tensor = torch.FloatTensor([state_one_hot])
-
- # 根据当前状态选择动作
- q_values = q_network(state_tensor)
- action = epsilon_greedy_policy(q_values.detach().numpy()[0], epsilon)
-
- # 执行动作并观察奖励和下一个状态
- next_state, reward, done, _ = env.step(action)
-
- # 将整数下一个状态转换为 one-hot 编码的向量
- next_state_one_hot = np.zeros(env.num_states)
- next_state_one_hot[next_state] = 1
- next_state_tensor = torch.FloatTensor([next_state_one_hot])
-
- # 计算目标Q值
- target_q_values = q_values.clone()
- if not done:
- target_q_values[0][action] = reward + gamma * torch.max(q_network(next_state_tensor))
- else:
- target_q_values[0][action] = reward
-
- # 计算损失并更新Q网络
- loss = criterion(q_values, target_q_values)
- optimizer.zero_grad()
- loss.backward()
- optimizer.step()
-
- state = next_state
-
- return q_network
-
- # 示例环境:一个简单的Q-learning任务
- class SimpleEnvironment:
- def __init__(self):
- self.num_states = 4
- self.num_actions = 2
- self.transitions = np.array([[1, 0], [0, 1], [2, 3], [3, 2]]) # 状态转移矩阵
-
- def reset(self):
- return 0
-
- def step(self, action):
- next_state = self.transitions[action, 0]
- reward = self.transitions[action, 1]
- done = (next_state == 3)
- return next_state, reward, done, {}
-
- # 创建环境和Q网络
- env = SimpleEnvironment()
- q_network = QNetwork(env.num_states, env.num_actions)
-
- # 训练Q网络
- trained_q_network = q_learning(env, q_network, num_episodes=100, learning_rate=0.1, gamma=0.9, epsilon=0.1)
-
- # 测试学习后的Q网络
- state = env.reset()
- done = False
- while not done:
- # 将整个状态转换为 one-hot 编码的向量
- state_one_hot = np.zeros(env.num_states)
- state_one_hot[state] = 1
- state_tensor = torch.FloatTensor([state_one_hot])
-
- q_values = trained_q_network(state_tensor)
- action = epsilon_greedy_policy(q_values.detach().numpy()[0], epsilon=0.0) # 使用贪婪策略进行测试
- next_state, reward, done, _ = env.step(action)
- print(f"State: {state}, Action: {action}, Reward: {reward}, Next State: {next_state}")
- state = next_state
上述代码演示了使用 Q-learning 算法来训练一个神经网络(Q 网络)来解决一个简单的强化学习任务的过程,对上述代码的具体说明如下:
(1)类QNetwork:这个类定义了一个简单的神经网络,用于估算状态动作对的 Q 值。
(2)函数epsilon_greedy_policy:这个函数定义了一个 ε-贪婪策略,用于在训练中和测试中选择动作。
(3)函数q_learning:这个函数实现了 Q-learning 算法,用于训练 Q 网络。在每个回合中,代理根据 Q 值选择动作,观察奖励和下一个状态,然后更新 Q 值以改进策略。优化器使用梯度下降来最小化 Q 值的均方误差损失。函数q_learning接受以下参数:
(4)类SimpleEnvironment:这个类定义了一个简单的环境,代理在其中执行 Q-learning 任务。
(5)主程序部分:创建了一个 SimpleEnvironment 环境和一个 QNetwork Q 网络。
执行后会输出 Q-learning 的训练过程和测试结果:
- State: 0, Action: 1, Reward: 0, Next State: 1
-
- State: 1, Action: 0, Reward: 1, Next State: 0
-
- State: 0, Action: 0, Reward: 0, Next State: 1
-
- State: 1, Action: 1, Reward: 1, Next State: 0
-
- ##省略部分输出...
-
- State: 3, Action: 1, Reward: 0, Next State: 2
-
- State: 2, Action: 0, Reward: 0, Next State: 3
-
- State: 3, Action: 0, Reward: 0, Next State: 2
-
- State: 2, Action: 1, Reward: 0, Next State: 3
这些输出是在测试学习后的 Q 网络时生成的,它显示了在 Q-learning 过程中代理在环境中移动的状态、选择的动作、获得的奖励以及下一个状态。在训练过程中,还可以选择记录其他信息,如每个回合的累积奖励,以评估学习进展,这将有助于了解代理是否成功地学会了任务。
最终,Q-learning 的目标是通过训练 Q 网络来获得一个优化的策略,使代理在环境中获得最大的累积奖励。上述输出中的信息有助于了解代理如何在环境中行动以及它是否学会了优化策略。
Q-learning 中的探索策略的变化和优化通常通过调整 ε-贪婪策略中的 ε 值来实现。ε 值决定了在选择动作时进行探索的概率。在下面列出了一些常用的探索策略的变化和优化方式。
1. 指数递减ε策略
一种常见的方法是使用指数递减函数来减小ε,例如,可以将 ε 设置为初始值,然后在每个回合后乘以一个小于1的因子。这样,ε将以指数方式减小,更快地减小到接近零。这种方法在训练初期进行更强烈的探索,并在训练后期更多地进行开发。请看下面的例子,演示了在深度学习中使用指数递减ε策略进行优化的用法,假设正在训练一个简单的神经网络来拟合一个曲线。
实例4-8:使用指数递减ε策略优化神经网络(源码路径:daima\4\zhi.py)
编写实例文件zhi.py的具体实现代码如下所示。
- import numpy as np
- import torch
- import torch.nn as nn
- import torch.optim as optim
-
- # 创建一个简单的深度学习模型
- class SimpleModel(nn.Module):
- def __init__(self):
- super(SimpleModel, self).__init__()
- self.fc1 = nn.Linear(1, 1) # 单输入、单输出的线性层
-
- def forward(self, x):
- return self.fc1(x)
-
- # 定义指数递减的ε策略
- def exponential_decay_epsilon(initial_epsilon, episode, decay_rate):
- return initial_epsilon * np.exp(-decay_rate * episode)
-
- # 示例数据:一条简单的曲线
- X = torch.FloatTensor(np.linspace(0, 1, 100)).unsqueeze(1)
- y = 2 * X + 1 + 0.2 * torch.randn(X.size())
-
- # 创建模型和优化器
- model = SimpleModel()
- optimizer = optim.SGD(model.parameters(), lr=0.01)
-
- # Q-learning 参数
- initial_epsilon = 1.0 # 初始ε
- decay_rate = 0.01 # ε的衰减率
- num_episodes = 100 # 总回合数
-
- # 定义损失函数
- criterion = nn.MSELoss()
-
- # 初始化损失
- loss = torch.tensor(0.0, requires_grad=True)
-
- for episode in range(num_episodes):
- epsilon = exponential_decay_epsilon(initial_epsilon, episode, decay_rate)
-
- # 根据ε-贪婪策略选择动作
- if np.random.rand() < epsilon:
- # 随机选择一个动作(在深度学习优化中通常是随机初始化模型参数的变化)
- model.fc1.weight.data += torch.randn_like(model.fc1.weight.data) * 0.1
- else:
- # 根据当前模型参数执行一个动作(在深度学习优化中通常是使用梯度下降来更新参数)
- model.zero_grad()
- y_pred = model(X)
- loss = criterion(y_pred, y)
- loss.backward()
- optimizer.step()
-
- if episode % 10 == 0:
- print(f"Episode {episode}, Epsilon: {epsilon}, Loss: {loss.item()}")
-
- # 打印学习后的模型参数
- print("Learned Model Parameters:")
- for name, param in model.named_parameters():
- if param.requires_grad:
- print(name, param.data)
这上述代码中,使用了一个简单的线性模型来拟合一条曲线。指数递减的ε策略被用于选择是随机初始化模型参数的变化还是根据当前模型参数执行梯度下降来更新参数。这个示例的目的是演示如何在深度学习中使用探索策略(ε-贪婪策略)来改变优化行为。执行后会输出:
- Episode 0, Epsilon: 1.0, Loss: 0.0
- Episode 10, Epsilon: 0.9048374180359595, Loss: 0.0
- Episode 20, Epsilon: 0.8187307530779818, Loss: 0.0
- Episode 30, Epsilon: 0.7408182206817179, Loss: 0.0
- Episode 40, Epsilon: 0.6703200460356393, Loss: 6.356284141540527
- Episode 50, Epsilon: 0.6065306597126334, Loss: 5.90389347076416
- Episode 60, Epsilon: 0.5488116360940265, Loss: 4.267116069793701
- Episode 70, Epsilon: 0.49658530379140947, Loss: 2.5693044662475586
- Episode 80, Epsilon: 0.44932896411722156, Loss: 2.223409414291382
- Episode 90, Epsilon: 0.4065696597405991, Loss: 1.5685646533966064
- Learned Model Parameters:
- fc1.weight tensor([[0.0810]])
- fc1.bias tensor([0.8807])
2. 线性递减ε策略
另一种方式是在训练过程中逐渐减小 ε 值,从而逐渐减少探索。在开始时,ε 可以设置为较大的值以促使代理更多地进行探索。随着训练的进行,ε 逐渐减小,代理更多地依赖于已知的最佳策略进行开发。这种方法的好处是可以平衡探索和开发,逐渐将探索的重点转向开发。当使用线性递减ε策略时,每个回合都会将ε的值线性减小,直到它达到某个阈值,例如下面是一个使用线性递减ε策略的例子。
实例4-9:使用线性递减ε策略优化神经网络(源码路径:daima\4\xian.py)
实例文件xian.py的具体实现代码如下所示。
- import numpy as np
- import torch
- import torch.nn as nn
- import torch.optim as optim
-
- # 创建一个简单的深度学习模型
- class SimpleModel(nn.Module):
- def __init__(self):
- super(SimpleModel, self).__init__()
- self.fc1 = nn.Linear(1, 1) # 单输入、单输出的线性层
-
- def forward(self, x):
- return self.fc1(x)
-
- # 定义线性递减的ε策略
- def linear_decay_epsilon(initial_epsilon, episode, total_episodes):
- epsilon = initial_epsilon - (initial_epsilon / total_episodes) * episode
- return max(epsilon, 0.0) # 确保ε不小于0
-
- # 示例数据:一条简单的曲线
- X = torch.FloatTensor(np.linspace(0, 1, 100)).unsqueeze(1)
- y = 2 * X + 1 + 0.2 * torch.randn(X.size())
-
- # 创建模型和优化器
- model = SimpleModel()
- optimizer = optim.SGD(model.parameters(), lr=0.01)
-
- # Q-learning 参数
- initial_epsilon = 1.0 # 初始ε
- num_episodes = 100 # 总回合数
-
- # 定义损失函数
- criterion = nn.MSELoss()
-
- # 初始化损失
- loss = torch.tensor(0.0, requires_grad=True)
-
- for episode in range(num_episodes):
- epsilon = linear_decay_epsilon(initial_epsilon, episode, num_episodes)
-
- # 根据ε-贪婪策略选择动作
- if np.random.rand() < epsilon:
- # 随机选择一个动作(在深度学习优化中通常是随机初始化模型参数的变化)
- model.fc1.weight.data += torch.randn_like(model.fc1.weight.data) * 0.1
- else:
- # 根据当前模型参数执行一个动作(在深度学习优化中通常是使用梯度下降来更新参数)
- model.zero_grad()
- y_pred = model(X)
- loss = criterion(y_pred, y)
- loss.backward()
- optimizer.step()
-
- if episode % 10 == 0:
- print(f"Episode {episode}, Epsilon: {epsilon}, Loss: {loss.item()}")
-
- # 打印学习后的模型参数
- print("Learned Model Parameters:")
- for name, param in model.named_parameters():
- if param.requires_grad:
- print(name, param.data)
在这个示例中,我定义了一个线性递减的ε策略 linear_decay_epsilon,并将它用于每个回合。ε的值会从初始值开始线性减小,直到达到0为止。这个策略允许模型在训练的早期更多地进行探索,在训练的后期更多地进行开发。你可以根据需要调整初始ε和总回合数来控制ε的线性减小速度。执行后会输出:
- Episode 0, Epsilon: 1.0, Loss: 0.0
- Episode 10, Epsilon: 0.9, Loss: 0.0
- Episode 20, Epsilon: 0.8, Loss: 0.8105946183204651
- Episode 30, Epsilon: 0.7, Loss: 0.8105946183204651
- Episode 40, Epsilon: 0.6, Loss: 0.39483004808425903
- Episode 50, Epsilon: 0.5, Loss: 0.49674656987190247
- Episode 60, Epsilon: 0.4, Loss: 0.36505362391471863
- Episode 70, Epsilon: 0.29999999999999993, Loss: 0.27561822533607483
- Episode 80, Epsilon: 0.19999999999999996, Loss: 0.28438231348991394
- Episode 90, Epsilon: 0.09999999999999998, Loss: 0.14754557609558105
- Learned Model Parameters:
- fc1.weight tensor([[1.5640]])
- fc1.bias tensor([1.0217])
3. 自适应ε策略
有些方法可以根据代理的学习进展自适应地调整 ε 值。如果代理的性能在一段时间内没有改善,可以增加 ε 值以增加探索的机会。如果代理的性能在一段时间内有所改善,可以减小 ε 值以便更多地进行开发。
4. UCB(Upper Confidence Bound)策略
UCB 策略是一种基于不确定性的探索方法,它使用置信上界来估计每个动作的不确定性,然后选择具有最高置信上界的动作进行探索。这种方法允许代理在探索和开发之间进行平衡,同时考虑到不确定性。
5. 贝叶斯方法
一些方法使用贝叶斯推断来估计不确定性,然后根据不确定性来选择探索的动作。这些方法通常需要更复杂的数学模型,但可以提供更精确的不确定性估计。例如下面是一个使用Q-learning的例子,使用贝叶斯方法估计状态转移概率的不确定性。
实例4-10:使用线性递减ε策略优化神经网络(源码路径:daima\4\beiye.py)
实例文件beiye.py的具体实现代码如下所示。
- import numpy as np
- import pymc3 as pm
-
- # 创建一个简单的 Q-learning 环境
- num_states = 2
- num_actions = 2
- transitions = np.array([[0.9, 0.1], [0.2, 0.8]])
-
- # 使用 PyMC3 定义状态转移概率的贝叶斯分布
- with pm.Model() as model:
- # 定义先验分布
- transition_probs = pm.Beta("transition_probs", alpha=2, beta=2, shape=(num_states, num_actions))
-
- # 定义 Q-learning 算法
- Q = np.zeros((num_states, num_actions))
- gamma = 0.9
- alpha = 0.1
-
- num_episodes = 1000
-
- for episode in range(num_episodes):
- state = 0
- done = False
-
- while not done:
- action = pm.Categorical("action", p=transition_probs[state], shape=1)
- next_state = pm.Categorical("next_state", p=transitions[state], shape=1)
-
- reward = transitions[state][action]
- target = reward + gamma * np.max(Q[next_state])
- Q[state][action] = Q[state][action] + alpha * (target - Q[state][action])
-
- state = next_state
-
- # 运行 MCMC 推断
- with model:
- trace = pm.sample(1000, tune=1000)
-
- # 打印后验分布的统计信息
- pm.summary(trace)
在上述代码中,使用 PyMC3 来定义状态转移概率的贝叶斯分布,并在 Q-learning 算法中使用它。通过运行 MCMC 推断,可以得到状态转移概率的后验分布,从而获得不确定性估计。执行后会输出:
在现实应用中,选择适当的探索策略取决于具体的问题和环境。通常,需要在训练过程中进行试验和调整,以找到最有效的探索策略。探索策略的目标是平衡探索和开发,以最大化长期累积奖励。
在本节的内容中,将通过一个简单的例子来演示探索策略对 Q-learning 性能的影响分析。在本实例中,将使用一个简单的 Q-learning 环境,并比较不同的探索策略对性能的影响。
实例4-11:比较Q-learning中贪婪策略和 ε-贪婪策略的性能(源码路径:daima\4\bi.py)
实例文件bi.py的具体实现代码如下所示。
- import numpy as np
-
- class SimpleEnvironment:
- def __init__(self):
- self.num_states = 4
- self.num_actions = 2
- self.transitions = np.array([[1, 0], [0, 1], [2, 3], [3, 2]]) # 状态转移矩阵
-
- def reset(self):
- return 0
-
- def step(self, action):
- next_state = self.transitions[action, 0]
- reward = self.transitions[action, 1]
- done = (next_state == 3)
- return next_state, reward, done
-
- # 贪婪策略
- def greedy_policy(q_values, epsilon):
- return np.argmax(q_values)
-
- # ε-贪婪策略
- def epsilon_greedy_policy(q_values, epsilon):
- if np.random.rand() < epsilon:
- return np.random.randint(len(q_values)) # 随机选择动作
- else:
- return np.argmax(q_values) # 选择Q值最大的动作
-
- # Q-learning算法
- def q_learning(env, num_episodes, learning_rate, gamma, epsilon, policy):
- Q = np.zeros((env.num_states, env.num_actions))
-
- for episode in range(num_episodes):
- state = env.reset()
- done = False
-
- while not done:
- action = policy(Q[state], epsilon)
- next_state, reward, done = env.step(action)
-
- # 更新Q值
- target = reward + gamma * np.max(Q[next_state])
- Q[state][action] = Q[state][action] + learning_rate * (target - Q[state][action])
-
- state = next_state
-
- return Q
-
- # 创建环境
- env = SimpleEnvironment()
- num_episodes = 1000
- learning_rate = 0.1
- gamma = 0.9
-
- # 比较贪婪策略和ε-贪婪策略
- q_greedy = q_learning(env, num_episodes, learning_rate, gamma, epsilon=0.0, policy=greedy_policy)
- q_epsilon_greedy = q_learning(env, num_episodes, learning_rate, gamma, epsilon=0.1, policy=epsilon_greedy_policy)
-
- # 输出最终的Q值
- print("Greedy Q-values:")
- print(q_greedy)
- print("\nε-Greedy Q-values:")
- print(q_epsilon_greedy)
在上述代码中,首先创建了一个简单的 Q-learning 环境,然后使用 Q-learning 算法测试不同的探索策略,比较贪婪策略和ε-贪婪策略的性能。执行后会输出:
- Greedy Q-values:
- [[0. 0. ]
- [0. 0. ]
- [0. 1. ]
- [0. 0. ]]
-
- ε-Greedy Q-values:
- [[0. 0. ]
- [0. 0. ]
- [0. 0.9025 ]
- [0. 0. ]]
在上面的输出中,Greedy Q-values部分将显示使用贪婪策略训练后的 Q 值函数,ε-Greedy Q-values部分将显示使用 ε-贪婪策略训练后的 Q 值函数。这些值表示在每个状态下,不同动作的 Q 值估计。贪婪策略的 Q 值通常会更加保守,而ε-贪婪策略的 Q 值可能会更加探索性。大家可以根据需要进一步调整参数、增加迭代次数或者尝试更复杂的环境来观察不同策略的效果。
在本节的内容中,将通过一个简单的例子来演示使用Q-learning算法寻找某股票的买卖点的用法。假设在文件stock_data.csv中保存了某股票的历史交易数据,内容格式如下所示:
实例4-12:使用Q-learning寻找某股票的买卖点(源码路径:daima\4\mg.py)
实例文件mg.py的具体实现代码如下所示。
- import pandas as pd
- import numpy as np
-
- # 从CSV文件加载股票数据
- data = pd.read_csv('stock_data.csv')
-
- # 定义Q-learning算法的参数
- learning_rate = 0.1
- discount_factor = 0.9
- exploration_prob = 0.2
- num_episodes = 100
-
- # 初始化Q-table,每个状态对应一个动作值
- num_states = len(data)
- num_actions = 2 # 0表示卖出,1表示持有
- Q = np.zeros((num_states, num_actions))
-
- # 定义一个函数来选择动作,使用ε-greedy策略
- def choose_action(state):
- if np.random.rand() < exploration_prob:
- return np.random.choice(num_actions)
- else:
- return np.argmax(Q[state, :])
-
- # 定义一个函数来更新Q值
- def update_Q(state, action, reward, next_state):
- best_next_action = np.argmax(Q[next_state, :])
- Q[state, action] += learning_rate * (reward + discount_factor * Q[next_state, best_next_action] - Q[state, action])
-
- # 训练Q-learning代理
- for episode in range(num_episodes):
- state = 0 # 初始状态
- total_reward = 0
-
- while state < num_states - 1:
- action = choose_action(state)
-
- # 模拟执行动作,计算奖励
- if action == 0: # 卖出
- reward = -data.iloc[state]['close']
- else: # 持有
- reward = 0
-
- next_state = state + 1
- update_Q(state, action, reward, next_state)
- total_reward += reward
- state = next_state
-
- print(f"Episode {episode + 1}: Total Reward = {total_reward}")
-
- # 根据训练后的Q-table找出买入和卖出点
- buy_points = []
- sell_points = []
-
- for state in range(num_states):
- action = choose_action(state)
- if action == 0: # 卖出
- sell_points.append(data.iloc[state]['close'])
- else: # 持有
- buy_points.append(data.iloc[state]['close'])
-
- print("买入点:", buy_points)
- print("卖出点:", sell_points)
上述代码的实现流程如下:
执行后会输出:
- Episode 1: Total Reward = -6789.550000000008
- Episode 2: Total Reward = -1250.8799999999999
- Episode 3: Total Reward = -668.9400000000002
- ###省略部分输出
- Episode 97: Total Reward = -609.9899999999999
- Episode 98: Total Reward = -776.4999999999999
- Episode 99: Total Reward = -600.7399999999999
- Episode 100: Total Reward = -702.2499999999998
- 买入点: [9.73, 9.83, 10.04, 10.1, 10.24, 10.72, 10.79, 10.63, 10.86, 11.05, 11.06, 11.18, 11.27, 10.9, 10.49, 10.29, 10.81, 10.81, 11.2, 10.92, 11.3, 11.12, 11.48, 11.71, 11.24, 11.69, 11.61, 11.85, 12.02, 12.12, 11.86, 11.91, 11.94, 11.87, 11.73, 11.57, 11.68, 11.98, 11.73, 11.55, 11.63, 12.0, 11.95, 12.1, 11.81, 11.63, 12.15, 12.02, 12.2, 12.48, 12.33, 12.55, 12.41, 12.54, 12.52, 12.65, 13.13, 13.19, 14.65, 15.87, 14.99, 15.14, 14.8, 15.21, 15.38, 14.58, 14.3, 14.14, 15.0, 15.22, 15.5, 14.6, 14.51, 13.95, 14.24, 13.36, 13.3, 13.44, 13.14, 13.36, 13.67, 14.39, 14.39, 14.03, 14.06, 15.11, 14.9, 15.49, 15.98, 15.45, 15.04, 15.97, 15.09, 17.7, 15.98, 16.86, 15.89, 16.42, 16.17, 15.5, 15.87, 16.9, 15.84, 16.74, 18.6, 18.97, 20.08, 19.2, 16.85, 16.46, 16.89, 17.98, 16.78, 15.25, 15.0, 14.07, 13.82, 13.27, 12.2, 11.51, 12.17, 11.8, 10.98, 11.57, 11.06, 11.36, 11.66, 11.53, 11.31, 10.79, 10.76, 11.72, 11.19, 11.65, 11.88, 11.01, 10.01, 10.84, 10.18, 9.69, 9.37, 9.7, 8.82, 8.02, 7.29, 7.27, 7.14, 7.19, 7.0, 7.11, 7.04, 7.04, 6.95, 6.8, 6.85, 6.72, 6.7, 6.67, 6.8, 6.75, 6.73, 6.78, 6.8, 6.74, 6.54, 6.48, 6.41, 6.54, 6.53, 6.58, 6.47, 6.48, 6.55, 6.59, 6.86, 6.97, 6.95, 6.98, 7.07, 7.15, 7.18, 7.14, 7.01, 6.89, 6.86, 6.72, 6.86, 6.84, 6.89, 6.98, 7.06, 7.26, 7.4, 7.22, 7.27, 7.21, 7.06, 7.1, 7.09, 7.21, 7.09, 7.06, 6.99, 7.13, 7.12, 7.08, 6.77, 7.18, 7.05, 6.78, 6.83, 6.94, 6.94, 6.91, 6.97, 7.01, 6.82, 6.7, 6.59, 6.38, 6.34, 6.5, 6.55, 6.63, 6.77, 6.67, 6.99, 7.06, 7.09, 7.13, 7.61, 7.5, 7.5, 7.39, 7.53, 7.51, 7.38, 7.32, 7.22, 7.3, 7.24, 7.45, 7.51, 7.49, 7.56, 7.62, 7.6, 7.65, 7.76, 7.8, 7.66, 7.73, 7.76, 7.57, 7.23, 7.12, 7.05, 7.43, 7.39, 7.48, 7.38, 7.4, 7.37, 7.47, 7.47, 7.41, 7.33, 7.87, 7.92, 7.96, 8.13, 8.11, 8.26, 8.42, 8.44, 8.52, 8.49, 8.48, 8.31, 8.4, 8.32, 8.15, 8.35, 8.38, 8.37, 8.51, 8.36, 8.3, 8.32, 8.4, 8.31, 8.43, 8.47, 8.37, 8.28, 8.28, 8.13, 8.16, 7.99, 8.38, 8.3, 8.2, 8.33, 8.14, 8.11, 8.24, 8.21, 8.22, 8.19, 8.17, 8.19, 8.19, 7.87, 7.93, 7.78, 8.0, 8.52, 8.73, 8.98, 9.12, 9.15, 9.16, 9.31, 9.28, 9.41, 9.22, 9.65, 9.7, 10.0, 9.98, 10.03, 9.89, 9.99, 9.94, 9.97, 10.12, 10.23, 10.06, 10.18, 10.05, 9.79, 9.61, 10.2, 10.51, 10.63, 10.63, 10.54, 10.53, 10.75, 10.71, 10.49, 10.46, 10.48, 10.44, 10.86, 10.79, 11.01, 10.8, 10.66, 10.77, 10.67, 10.61, 10.75, 10.86, 10.83, 10.75, 10.58, 10.44, 10.45, 10.94, 11.49, 12.05, 12.29, 12.57, 12.62, 12.55, 12.54, 12.29, 12.33, 12.44, 12.65, 12.7, 12.78, 12.84, 12.83, 12.72, 12.5, 12.51, 12.47, 12.38, 12.43, 12.28, 12.39, 12.02, 11.9, 12.04, 12.23, 12.16, 12.23, 12.4, 12.21, 12.32, 12.22, 12.21, 12.21, 12.26, 12.09, 12.13, 11.96, 11.89, 11.94, 12.04, 11.96, 11.96, 12.0, 12.01, 11.93, 12.22, 12.37, 12.33, 12.41, 12.43, 12.4, 12.06, 12.26, 12.34, 12.27, 12.33, 12.04, 11.91, 12.18, 11.86, 11.82, 11.8, 11.8, 11.99, 11.77, 11.88, 11.93, 11.88, 11.81, 11.98, 12.01, 11.88, 11.8, 11.96, 11.97, 11.83, 12.12, 12.09, 12.35, 12.55, 12.44, 12.49, 12.54, 12.38, 12.47, 12.74, 12.19, 12.18, 12.07, 12.08, 11.9, 11.66, 11.78, 11.96, 12.27, 12.11, 11.99, 11.83, 12.02, 12.04, 12.06, 12.15, 12.24, 12.16, 12.34, 12.36, 12.18, 12.09, 12.38, 12.26, 12.3, 11.93, 11.78, 12.16, 12.25, 12.56, 12.87, 12.97, 12.74, 12.96, 12.97, 12.88, 13.05, 13.07, 13.52, 12.61, 12.63, 12.36, 12.05, 11.82, 12.39, 12.29, 12.01, 12.2, 12.04, 12.12, 12.07, 12.4, 12.46, 12.7, 13.09, 11.9, 12.0, 11.85, 11.85, 11.93, 12.13, 12.44, 12.25, 12.17, 12.44, 12.24, 12.28, 12.19, 12.02, 12.19, 11.95, 12.0, 11.95, 12.05, 12.17, 12.68, 12.49, 12.46, 12.61, 12.89, 12.71, 13.09, 12.93, 13.05, 13.18, 13.97, 14.1, 13.95, 13.84, 13.82, 13.57, 13.61, 13.84, 14.01, 14.17, 14.27, 14.02, 13.97, 13.92, 14.06, 13.86, 13.89, 13.92, 13.87, 13.93, 13.75, 13.97, 13.98, 14.07, 14.29, 14.81, 15.32, 15.74, 15.6, 15.88, 16.1, 15.6, 15.96, 15.82, 16.32, 16.68, 16.29, 16.27, 16.01, 15.79, 15.5, 15.64, 15.65, 16.23, 16.39, 16.27, 17.21, 17.8, 17.05, 17.37, 17.6, 16.71, 16.66, 17.32, 17.59, 17.85, 17.47, 17.75, 17.51, 15.93, 16.4, 15.43]
- 卖出点: [10.37, 11.07, 11.2, 10.88, 12.16, 14.71, 15.19, 16.77, 16.58, 17.45, 11.34, 11.2, 6.8, 6.87, 6.86, 7.13, 7.53, 7.54, 7.48, 7.26, 7.32, 7.29, 7.56, 7.5, 7.92, 8.11, 8.45, 8.2, 8.13, 8.23, 9.99, 10.85, 12.53, 12.18, 11.97, 11.95, 12.5, 12.04, 11.94, 12.42, 12.21, 12.21, 12.05, 12.5, 12.14, 12.94, 14.27, 13.99, 14.02, 14.45, 14.33, 15.64, 15.67, 17.66, 17.28, 15.73]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。