赞
踩
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg & Demis Hassabis
Nature volume 518, pages529–533 (2015)Cite this article
Nature 518, 529–533 (2015) 引用本文
The theory of reinforcement learning provides a normative account1, deeply rooted in psychological2 and neuroscientific3 perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems4,5, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms3. While reinforcement learning agents have achieved some successes in a variety of domains6,7,8, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks9,10,11 to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games12. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
强化学习理论提供了一个规范性的解释 1 ,深深植根于动物行为的心理学 2 和神经科学 3 观点,即智能体如何优化对环境的控制。然而,为了在接近现实世界复杂性的情况下成功地使用强化学习,智能体面临着一项艰巨的任务:他们必须从高维感官输入中获得对环境的有效表示,并利用这些输入将过去的经验推广到新情况。值得注意的是,人类和其他动物似乎通过强化学习和分层感觉处理系统的 4,5 和谐结合来解决这个问题,前者被大量的神经数据所证明,揭示了多巴胺能神经元发出的相位信号与时间差异强化学习算法之间的显着相似之处 3 。虽然强化学习代理在各个领域取得了一些成功,但它们的适用性以前仅限于可以手工制作有用特征的领域 6,7,8 ,或者具有完全观察到的低维状态空间的领域。在这里,我们利用训练深度神经网络的最新进展来开发一种新型的人工代理,称为深度 Q 网络 9,10,11 ,它可以使用端到端强化学习直接从高维感觉输入中学习成功的策略。我们在经典 Atari 2600 游戏 12 的挑战性领域测试了这个代理。 我们证明,深度 Q 网络代理仅接收像素和游戏分数作为输入,能够超越所有先前算法的性能,并在一组 49 个游戏中达到与专业人类游戏测试人员相当的水平,使用相同的算法、网络架构和超参数。这项工作弥合了高维感官输入和动作之间的鸿沟,从而产生了第一个能够学习在各种具有挑战性的任务中表现出色的人工代理。
Similar content being viewed by others
其他人正在查看的类似内容
Using deep neural networks as a guide for modeling human planning
使用深度神经网络作为人类规划建模的指南
Article Open access 文章开放获取 20 November 2023 二零二三年十一月二十日
Grandmaster level in StarCraft II using multi-agent reinforcement learning
《星际争霸II》中使用多智能体强化学习的宗师级别
Article 30 October 2019
报道 2019-10-30
Main 主要
We set out to create a single algorithm that would be able to develop a wide range of competencies on a varied range of challenging tasks—a central goal of general artificial intelligence13 that has eluded previous efforts8,14,15. To achieve this, we developed a novel agent, a deep Q-network (DQN), which is able to combine reinforcement learning with a class of artificial neural network16 known as deep neural networks. Notably, recent advances in deep neural networks9,10,11, in which several layers of nodes are used to build up progressively more abstract representations of the data, have made it possible for artificial neural networks to learn concepts such as object categories directly from raw sensory data. We use one particularly successful architecture, the deep convolutional network17, which uses hierarchical layers of tiled convolutional filters to mimic the effects of receptive fields—inspired by Hubel and Wiesel’s seminal work on feedforward processing in early visual cortex18—thereby exploiting the local spatial correlations present in images, and building in robustness to natural transformations such as changes of viewpoint or scale.
我们着手创建一种单一的算法,该算法能够在各种具有挑战性的任务上发展广泛的能力——这是通用人工智能的核心目标 13 ,而以前的努力 8,14,15 却无法实现。为了实现这一目标,我们开发了一种新型智能体,即深度Q网络(DQN),它能够将强化学习与一类称为深度神经网络的人工神经网络 16 相结合。值得注意的是,深度神经网络的最新进展,其中使用多层节点来逐步建立更抽象的数据表示,使人工神经网络 9,10,11 能够直接从原始感官数据中学习对象类别等概念。我们使用了一种特别成功的架构,即深度卷积网络 17 ,它使用分层卷积滤波器的分层层来模拟感受野的影响——灵感来自 Hubel 和 Wiesel 在早期视觉皮层中前馈处理的开创性工作 18 ——从而利用图像中存在的局部空间相关性,并建立对自然转换(如视点或比例变化)的鲁棒性。
We consider tasks in which the agent interacts with an environment through a sequence of observations, actions and rewards. The goal of the agent is to select actions in a fashion that maximizes cumulative future reward. More formally, we use a deep convolutional neural network to approximate the optimal action-value function
我们考虑代理通过一系列观察、行动和奖励与环境互动的任务。智能体的目标是以最大化累积未来奖励的方式选择操作。更正式地说,我们使用深度卷积神经网络来逼近最佳动作值函数
which is the maximum sum of rewards rt discounted by γ at each time-step t, achievable by a behaviour policy π = P(a|s), after making an observation (s) and taking an action (a) (see Methods)19.
这是在进行观察 (s) 并采取行动 (a) 后,通过行为策略 π = P(a|s) 实现的每个时间步长 t 的 γ 折扣的最大奖励总 t 和(参见方法)。 19
Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action-value (also known as ) function20. This instability has several causes: the correlations present in the sequence of observations, the fact that small updates to may significantly change the policy and therefore change the data distribution, and the correlations between the action-values () and the target values . We address these instabilities with a novel variant of Q-learning, which uses two key ideas. First, we used a biologically inspired mechanism termed experience replay21,22,23 that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution (see below for details). Second, we used an iterative update that adjusts the action-values () towards target values that are only periodically updated, thereby reducing correlations with the target.
众所周知,当使用非线性函数逼近器(如神经网络)来表示动作值(也称为 )函数 20 时,强化学习是不稳定的,甚至会发散。这种不稳定性有几个原因:观察序列中存在的相关性,小的更新 可能会显着改变策略并因此改变数据分布,以及操作值 ( ) 和目标值 之间的相关性。我们通过一种新的Q-learning变体来解决这些不稳定性,它使用两个关键思想。首先,我们使用了一种称为经验回放 21,22,23 的生物学启发机制,该机制对数据进行随机化,从而消除了观察序列中的相关性,并平滑了数据分布的变化(详见下文)。其次,我们使用了迭代更新,将操作值 ( ) 调整为仅定期更新的目标值,从而减少与目标的相关性。
While other stable methods exist for training neural networks in the reinforcement learning setting, such as neural fitted Q-iteration24, these methods involve the repeated training of networks de novo on hundreds of iterations. Consequently, these methods, unlike our algorithm, are too inefficient to be used successfully with large neural networks. We parameterize an approximate value function using the deep convolutional neural network shown in Fig. 1, in which θi are the parameters (that is, weights) of the Q-network at iteration i. To perform experience replay we store the agent’s experiences et = (st,at,rt,st + 1) at each time-step t in a data set Dt = {e1,…,et}. During learning, we apply Q-learning updates, on samples (or minibatches) of experience (s,a,r,s′) ∼ U(D), drawn uniformly at random from the pool of stored samples. The Q-learning update at iteration uses the following loss function:
虽然存在其他用于在强化学习环境中训练神经网络的稳定方法,例如神经拟合 Q 迭代 24 ,但这些方法涉及在数百次迭代中重复训练网络。因此,与我们的算法不同,这些方法效率太低,无法成功用于大型神经网络。我们使用图 1 所示的深度卷积神经网络参数化近似值函数 ,其中 θ 是迭代 i 时 Q 网络的参数(即权重)。为了执行体验回放,我们将智能体的体验 e = (s ,a ,r t ,s t + 1 t ) 在每个时间步长 t 存储在数据集 D t = {e ,…,e 1 t t } 中。 t 在学习过程中,我们对经验 (s,a,r,s′) ∼ U(D) 的样本(或小批量)应用 Q 学习更新,从存储的样本池中随机均匀抽取。迭代时的 Q-learning 更新 使用以下损失函数:
in which γ is the discount factor determining the agent’s horizon, θi are the parameters of the Q-network at iteration i and are the network parameters used to compute the target at iteration i. The target network parameters are only updated with the Q-network parameters (θi) every C steps and are held fixed between individual updates (see Methods).
其中 γ 是确定智能体范围的贴现因子,θ 是迭代 i 时 Q 网络的参数, 是用于计算迭代 i 时目标的网络参数。目标网络参数 仅每 C 步使用 Q 网络参数 (θ) 进行更新,并在每次更新之间保持固定(参见方法)。
Figure 1: Schematic illustration of the convolutional neural network.
图 1:卷积神经网络的示意图。
figure 1
The details of the architecture are explained in the Methods. The input to the neural network consists of an 84 × 84 × 4 image produced by the preprocessing map , followed by three convolutional layers (note: snaking blue line symbolizes sliding of each filter across input image) and two fully connected layers with a single output for each valid action. Each hidden layer is followed by a rectifier nonlinearity (that is, ).
体系结构的详细信息在方法中进行了说明。神经网络的输入由预处理映射 生成的 84 × 84 × 4 图像组成,然后是三个卷积层(注意:蜿蜒的蓝线表示每个过滤器在输入图像上的滑动)和两个完全连接的层,每个有效动作都有一个输出。每个隐藏层后面都跟着一个整流器非线性(即 )。
PowerPoint slide PowerPoint 幻灯片
Full size image
To evaluate our DQN agent, we took advantage of the Atari 2600 platform, which offers a diverse array of tasks (n = 49) designed to be difficult and engaging for human players. We used the same network architecture, hyperparameter values (see Extended Data Table 1) and learning procedure throughout—taking high-dimensional data ( colour video at 60 Hz) as input—to demonstrate that our approach robustly learns successful policies over a variety of games based solely on sensory inputs with only very minimal prior knowledge (that is, merely the input data were visual images, and the number of actions available in each game, but not their correspondences; see Methods). Notably, our method was able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner— illustrated by the temporal evolution of two indices of learning (the agent’s average score-per-episode and average predicted Q-values; see Fig. 2 and Supplementary Discussion for details).
为了评估我们的DQN代理,我们利用了Atari 2600平台,该平台提供了各种各样的任务(n = 49),旨在为人类玩家提供困难和吸引人的任务。我们在整个过程中使用了相同的网络架构、超参数值(参见扩展数据表1)和学习过程(以高维数据(60 Hz 的彩色视频)作为输入),以证明我们的方法在各种游戏中稳健地学习成功的策略,仅基于感官输入,只有非常少的先验知识(即,输入数据只是视觉图像, 以及每个游戏中可用的动作数量,但不包括它们的对应关系;请参阅方法)。值得注意的是,我们的方法能够使用强化学习信号和随机梯度下降以稳定的方式训练大型神经网络——通过两个学习指标的时间演变(智能体的每集平均得分和平均预测 Q 值;详见图 2 和补充讨论)来说明。
Figure 2: Training curves tracking the agent’s average score and average predicted action-value.
图 2:跟踪智能体平均得分和平均预测动作值的训练曲线。
figure 2
a, Each point is the average score achieved per episode after the agent is run with ε-greedy policy (ε = 0.05) for 520 k frames on Space Invaders. b, Average score achieved per episode for Seaquest. c, Average predicted action-value on a held-out set of states on Space Invaders. Each point on the curve is the average of the action-value Q computed over the held-out set of states. Note that Q-values are scaled due to clipping of rewards (see Methods). d, Average predicted action-value on Seaquest. See Supplementary Discussion for details.
a,每个点是在《太空入侵者》上以ε贪婪策略 (ε = 0.05) 运行代理 520 k 帧后每集获得的平均分数。b,Seaquest 每集的平均得分。c,Space Invaders 上一组保持状态的平均预测行动值。曲线上的每个点都是在保持状态集中计算的动作值 Q 的平均值。请注意,Q 值会因奖励的剪裁而缩放(请参阅方法)。d,Seaquest 上的平均预测行动值。详见补充讨论。
PowerPoint slide PowerPoint 幻灯片
Full size image
We compared DQN with the best performing methods from the reinforcement learning literature on the 49 games where results were available12,15. In addition to the learned agents, we also report scores for a professional human games tester playing under controlled conditions and a policy that selects actions uniformly at random (Extended Data Table 2 and Fig. 3, denoted by 100% (human) and 0% (random) on y axis; see Methods). Our DQN method outperforms the best existing reinforcement learning methods on 43 of the games without incorporating any of the additional prior knowledge about Atari 2600 games used by other approaches (for example, refs 12, 15). Furthermore, our DQN agent performed at a level that was comparable to that of a professional human games tester across the set of 49 games, achieving more than 75% of the human score on more than half of the games (29 games; see Fig. 3, Supplementary Discussion and Extended Data Table 2). In additional simulations (see Supplementary Discussion and Extended Data Tables 3 and 4), we demonstrate the importance of the individual core components of the DQN agent—the replay memory, separate target Q-network and deep convolutional network architecture—by disabling them and demonstrating the detrimental effects on performance.
我们将 DQN 与 49 个有结果的游戏 12,15 的强化学习文献中表现最好的方法进行了比较。除了学习到的智能体之外,我们还报告了在受控条件下进行的专业人类游戏测试人员的分数,以及随机统一选择动作的策略(扩展数据表2和图3,在y轴上用100%(人类)和0%(随机)表示;参见方法)。我们的DQN方法在43款游戏中的表现优于现有最好的强化学习方法,而没有结合其他方法(例如,参考文献12,15)使用的有关Atari 2600游戏的任何额外先验知识。此外,我们的DQN代理在49场比赛中的表现与专业人类游戏测试人员相当,在超过一半的游戏中获得了超过75%的人类得分(29场比赛;见图3,补充讨论和扩展数据表2)。在其他仿真中(参见补充讨论和扩展数据表 3 和表 4),我们通过禁用 DQN 代理的各个核心组件(重放存储器、单独的目标 Q 网络和深度卷积网络架构)并演示对性能的不利影响来证明它们的重要性。
Figure 3: Comparison of the DQN agent with the best reinforcement learning methods15 in the literature.
图 3:DQN 智能体与文献中最佳强化学习方法 15 的比较。
figure 3
The performance of DQN is normalized with respect to a professional human games tester (that is, 100% level) and random play (that is, 0% level). Note that the normalized performance of DQN, expressed as a percentage, is calculated as: 100 × (DQN score − random play score)/(human score − random play score). It can be seen that DQN outperforms competing methods (also see Extended Data Table 2) in almost all the games, and performs at a level that is broadly comparable with or superior to a professional human games tester (that is, operationalized as a level of 75% or above) in the majority of games. Audio output was disabled for both human players and agents. Error bars indicate s.d. across the 30 evaluation episodes, starting with different initial conditions.
DQN 的性能相对于专业的人类游戏测试器(即 100% 级别)和随机游戏(即 0% 级别)进行标准化。请注意,DQN 的标准化性能(以百分比表示)的计算公式为:100 ×(DQN 分数 - 随机游戏分数)/(人类分数 - 随机游戏分数)。可以看出,DQN在几乎所有游戏中都优于竞争方法(另见扩展数据表2),并且在大多数游戏中的表现与专业人类游戏测试仪大致相当或优于(即操作为75%或以上的水平)。人类玩家和代理的音频输出均已禁用。误差线表示 30 个评估周期的 s.d.,从不同的初始条件开始。
PowerPoint slide PowerPoint 幻灯片
Full size image
We next examined the representations learned by DQN that underpinned the successful performance of the agent in the context of the game Space Invaders (see Supplementary Video 1 for a demonstration of the performance of DQN), by using a technique developed for the visualization of high-dimensional data called ‘t-SNE’25 (Fig. 4). As expected, the t-SNE algorithm tends to map the DQN representation of perceptually similar states to nearby points. Interestingly, we also found instances in which the t-SNE algorithm generated similar embeddings for DQN representations of states that are close in terms of expected reward but perceptually dissimilar (Fig. 4, bottom right, top left and middle), consistent with the notion that the network is able to learn representations that support adaptive behaviour from high-dimensional sensory inputs. Furthermore, we also show that the representations learned by DQN are able to generalize to data generated from policies other than its own—in simulations where we presented as input to the network game states experienced during human and agent play, recorded the representations of the last hidden layer, and visualized the embeddings generated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary Discussion). Extended Data Fig. 2 provides an additional illustration of how the representations learned by DQN allow it to accurately predict state and action values.
接下来,我们通过使用一种为高维数据可视化而开发的技术,称为“t-SNE” 25 (图 4),研究了 DQN 学习到的表示,这些表示支持智能体在游戏 Space Invaders 的上下文中成功执行(有关 DQN 性能的演示,请参阅补充视频 1)。正如预期的那样,t-SNE 算法倾向于将感知相似状态的 DQN 表示映射到附近点。有趣的是,我们还发现了t-SNE算法为DQN表示生成相似嵌入的实例,这些状态在预期奖励方面接近,但在感知上不同(图4,右下角,左上角和中间),这与网络能够从高维感官输入中学习支持适应性行为的表征的概念一致。此外,我们还表明,DQN学习的表征能够泛化到从其自身以外的策略生成的数据 - 在模拟中,我们将作为人类和代理游戏过程中经历的网络游戏状态的输入,记录最后一个隐藏层的表征,并可视化由t-SNE算法生成的嵌入(扩展数据图1和补充讨论)。扩展数据 图 2 提供了 DQN 学习的表示如何使其能够准确预测状态和动作值的附加说明。
Figure 4: Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced while playing Space Invaders.
图 4:DQN 分配给玩 Space Invaders 时体验的游戏状态的最后一个隐藏层中表示的二维 t-SNE 嵌入。
figure 4
The plot was generated by letting the DQN agent play for 2 h of real game time and running the t-SNE algorithm25 on the last hidden layer representations assigned by DQN to each experienced game state. The points are coloured according to the state values (V, maximum expected reward of a state) predicted by DQN for the corresponding game states (ranging from dark red (highest V) to dark blue (lowest V)). The screenshots corresponding to a selected number of points are shown. The DQN agent predicts high state values for both full (top right screenshots) and nearly complete screens (bottom left screenshots) because it has learned that completing a screen leads to a new screen full of enemy ships. Partially completed screens (bottom screenshots) are assigned lower state values because less immediate reward is available. The screens shown on the bottom right and top left and middle are less perceptually similar than the other examples but are still mapped to nearby representations and similar values because the orange bunkers do not carry great significance near the end of a level. With permission from Square Enix Limited.
通过让 DQN 代理播放 2 小时的实时游戏时间,并在 DQN 分配给每个体验游戏状态的最后一个隐藏层表示 25 上运行 t-SNE 算法来生成该图。根据 DQN 预测的相应游戏状态(范围从深红色(最高 V)到深蓝色(最低 V))的状态值(V,状态的最大预期奖励)对积分进行着色。将显示与所选点数相对应的屏幕截图。DQN 代理预测完整屏幕(右上角屏幕截图)和几乎完整屏幕(左下角屏幕截图)的状态值较高,因为它已经了解到,完成一个屏幕会导致一个充满敌舰的新屏幕。部分完成的屏幕(底部屏幕截图)被分配较低的状态值,因为可用的即时奖励较少。右下角、左上角和中间显示的屏幕在感知上不如其他示例相似,但仍映射到附近的表示和相似值,因为橙色沙坑在关卡末尾附近没有太大意义。经史克威尔艾尼克斯有限公司许可。
PowerPoint slide PowerPoint 幻灯片
Full size image
It is worth noting that the games in which DQN excels are extremely varied in their nature, from side-scrolling shooters (River Raid) to boxing games (Boxing) and three-dimensional car-racing games (Enduro). Indeed, in certain games DQN is able to discover a relatively long-term strategy (for example, Breakout: the agent learns the optimal strategy, which is to first dig a tunnel around the side of the wall allowing the ball to be sent around the back to destroy a large number of blocks; see Supplementary Video 2 for illustration of development of DQN’s performance over the course of training). Nevertheless, games demanding more temporally extended planning strategies still constitute a major challenge for all existing agents including DQN (for example, Montezuma’s Revenge).
值得注意的是,DQN擅长的游戏在性质上非常多样化,从横向卷轴射击游戏(River Raid)到拳击游戏(Boxing)和三维赛车游戏(Enduro)。事实上,在某些游戏中,DQN能够发现一个相对长期的策略(例如,突围:智能体学习最佳策略,即首先在墙的一侧挖一条隧道,让球被送到后面,以摧毁大量的方块;参见补充视频2,了解DQN在训练过程中的表现发展)。尽管如此,对于包括DQN在内的所有现有代理来说,需要更多时间扩展计划策略的游戏仍然是一个重大挑战(例如,Montezuma’s Revenge)。
In this work, we demonstrate that a single architecture can successfully learn control policies in a range of different environments with only very minimal prior knowledge, receiving only the pixels and the game score as inputs, and using the same algorithm, network architecture and hyperparameters on each game, privy only to the inputs a human player would have. In contrast to previous work24,26, our approach incorporates ‘end-to-end’ reinforcement learning that uses reward to continuously shape representations within the convolutional network towards salient features of the environment that facilitate value estimation. This principle draws on neurobiological evidence that reward signals during perceptual learning may influence the characteristics of representations within primate visual cortex27,28. Notably, the successful integration of reinforcement learning with deep network architectures was critically dependent on our incorporation of a replay algorithm21,22,23 involving the storage and representation of recently experienced transitions. Convergent evidence suggests that the hippocampus may support the physical realization of such a process in the mammalian brain, with the time-compressed reactivation of recently experienced trajectories during offline periods21,22 (for example, waking rest) providing a putative mechanism by which value functions may be efficiently updated through interactions with the basal ganglia22. In the future, it will be important to explore the potential use of biasing the content of experience replay towards salient events, a phenomenon that characterizes empirically observed hippocampal replay29, and relates to the notion of ‘prioritized sweeping’30 in reinforcement learning. Taken together, our work illustrates the power of harnessing state-of-the-art machine learning techniques with biologically inspired mechanisms to create agents that are capable of learning to master a diverse array of challenging tasks.
在这项工作中,我们证明了单个架构可以在一系列不同的环境中成功学习控制策略,只需非常少的先验知识,仅接收像素和游戏分数作为输入,并在每个游戏中使用相同的算法、网络架构和超参数,仅了解人类玩家的输入。与以前的工作 24,26 相比,我们的方法结合了“端到端”的强化学习,它使用奖励来连续塑造卷积网络内的表征,以促进价值估计的环境的显着特征。该原理借鉴了神经生物学证据,即知觉学习期间的奖励信号可能会影响灵长类动物视觉皮层内表征的特征 27,28 。值得注意的是,强化学习与深度网络架构的成功集成在很大程度上取决于我们是否整合了涉及最近经历的转换的存储和表示的重放算法 21,22,23 。收敛的证据表明,海马体可能支持哺乳动物大脑中这一过程的物理实现,在离线期间 21,22 (例如,清醒休息)最近经历的轨迹的时间压缩重新激活提供了一种假定的机制,通过该机制,可以通过与基底神经节 22 的相互作用有效地更新价值函数.未来,重要的是要探索将经验回放的内容偏向于显著事件的潜在用途,这种现象是经验观察到的海马回放 29 的特征,并与强化学习 30 中的“优先扫描”概念有关。 总而言之,我们的工作说明了利用最先进的机器学习技术与生物启发机制的力量,以创建能够学习掌握各种具有挑战性的任务的代理。
Methods 方法
Preprocessing 预处理
Working directly with raw Atari 2600 frames, which are 210 × 160 pixel images with a 128-colour palette, can be demanding in terms of computation and memory requirements. We apply a basic preprocessing step aimed at reducing the input dimensionality and dealing with some artefacts of the Atari 2600 emulator. First, to encode a single frame we take the maximum value for each pixel colour value over the frame being encoded and the previous frame. This was necessary to remove flickering that is present in games where some objects appear only in even frames while other objects appear only in odd frames, an artefact caused by the limited number of sprites Atari 2600 can display at once. Second, we then extract the Y channel, also known as luminance, from the RGB frame and rescale it to 84 × 84. The function from algorithm 1 described below applies this preprocessing to the m most recent frames and stacks them to produce the input to the Q-function, in which m = 4, although the algorithm is robust to different values of m (for example, 3 or 5).
直接使用原始的Atari 2600帧,即210×160像素的图像,具有128个调色板,在计算和内存要求方面可能要求很高。我们应用了一个基本的预处理步骤,旨在降低输入维度并处理Atari 2600模拟器的一些伪影。首先,为了对单个帧进行编码,我们取每个像素颜色值在被编码的帧和前一帧上的最大值。这对于消除游戏中存在的闪烁是必要的,其中某些对象仅出现在偶数帧中,而其他对象仅出现在奇数帧中,这是由于Atari 2600一次可以显示的精灵数量有限而造成的伪影。其次,我们从RGB帧中提取Y通道,也称为亮度,并将其重新缩放为84×84。下面描述的算法 1 中的函数 将此预处理应用于最近的 m 帧,并将它们堆叠以生成 Q 函数的输入,其中 m = 4,尽管该算法对 m 的不同值(例如,3 或 5)具有鲁棒性。
Code availability 代码可用性
The source code can be accessed at https://sites.google.com/a/deepmind.com/dqn for non-commercial uses only.
源代码只能 https://sites.google.com/a/deepmind.com/dqn 用于非商业用途。
Model architecture 模型架构
There are several possible ways of parameterizing Q using a neural network. Because Q maps history–action pairs to scalar estimates of their Q-value, the history and the action have been used as inputs to the neural network by some previous approaches24,26. The main drawback of this type of architecture is that a separate forward pass is required to compute the Q-value of each action, resulting in a cost that scales linearly with the number of actions. We instead use an architecture in which there is a separate output unit for each possible action, and only the state representation is an input to the neural network. The outputs correspond to the predicted Q-values of the individual actions for the input state. The main advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.
使用神经网络对 Q 进行参数化有几种可能的方法。由于 Q 将历史-动作对映射到其 Q 值的标量估计,因此历史和动作已被以前的一些方法 24,26 用作神经网络的输入。这种体系结构的主要缺点是需要单独的前向传递来计算每个操作的 Q 值,从而导致成本随操作数量线性扩展。取而代之的是,我们使用一种架构,其中每个可能的动作都有一个单独的输出单元,并且只有状态表示是神经网络的输入。输出对应于输入状态的各个操作的预测 Q 值。这种架构的主要优点是能够计算给定状态下所有可能操作的 Q 值,只需通过网络进行一次前向传递。
The exact architecture, shown schematically in Fig. 1, is as follows. The input to the neural network consists of an 84 × 84 × 4 image produced by the preprocessing map . The first hidden layer convolves 32 filters of 8 × 8 with stride 4 with the input image and applies a rectifier nonlinearity31,32. The second hidden layer convolves 64 filters of 4 × 4 with stride 2, again followed by a rectifier nonlinearity. This is followed by a third convolutional layer that convolves 64 filters of 3 × 3 with stride 1 followed by a rectifier. The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games we considered.
如图1所示,确切的架构如下。神经网络的输入由预处理映射 生成的 84 × 84 × 4 图像组成。第一个隐藏层将 32 个 8 × 8 的滤波器与输入图像卷积 4 步幅,并应用整流器非线性 31,32 。第二个隐藏层将 64 个 4 × 4 的滤波器卷积,步幅为 2,同样后跟整流器非线性。接下来是第三个卷积层,该卷积层将 64 个 3 × 3 的滤波器卷积,步幅为 1,然后是整流器。最后的隐藏层是完全连接的,由 512 个整流器单元组成。输出层是一个全连接的线性层,每个有效动作都有一个输出。在我们考虑的游戏中,有效操作的数量在 4 到 18 之间变化。
Training details 培训详情
We performed experiments on 49 Atari 2600 games where results were available for all other comparable methods12,15. A different network was trained on each game: the same network architecture, learning algorithm and hyperparameter settings (see Extended Data Table 1) were used across all games, showing that our approach is robust enough to work on a variety of games while incorporating only minimal prior knowledge (see below). While we evaluated our agents on unmodified games, we made one change to the reward structure of the games during training only. As the scale of scores varies greatly from game to game, we clipped all positive rewards at 1 and all negative rewards at −1, leaving 0 rewards unchanged. Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games. At the same time, it could affect the performance of our agent since it cannot differentiate between rewards of different magnitude. For games where there is a life counter, the Atari 2600 emulator also sends the number of lives left in the game, which is then used to mark the end of an episode during training.
我们在 49 款 Atari 2600 游戏上进行了实验,其中所有其他可比方法 12,15 都有结果。在每个游戏中训练了不同的网络:所有游戏都使用了相同的网络架构、学习算法和超参数设置(参见扩展数据表1),这表明我们的方法足够强大,可以在各种游戏中工作,同时只包含最少的先验知识(见下文)。虽然我们在未修改的游戏上评估了我们的代理,但我们仅在训练期间对游戏的奖励结构进行了一次更改。由于不同游戏的分数比例差异很大,我们将所有正奖励剪裁为 1,将所有负奖励剪裁为 -1,保持 0 奖励不变。以这种方式裁剪奖励限制了误差导数的规模,并使其更容易在多个游戏中使用相同的学习率。同时,它可能会影响我们代理的表现,因为它无法区分不同大小的奖励。对于有生命计数器的游戏,Atari 2600模拟器还会发送游戏中剩余的生命数,然后用于在训练期间标记一集的结束。
In these experiments, we used the RMSProp (see http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf ) algorithm with minibatches of size 32. The behaviour policy during training was ε-greedy with ε annealed linearly from 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter. We trained for a total of 50 million frames (that is, around 38 days of game experience in total) and used a replay memory of 1 million most recent frames.
在这些实验中,我们使用了 RMSProp (见 http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf )算法,大小为 32 的小批量。训练期间的行为策略是贪婪的ε,ε在前一百万帧中从 1.0 线性退火到 0.1,此后固定为 0.1。我们总共训练了 5000 万帧(即总共大约 38 天的游戏体验),并使用了 100 万最新帧的重放记忆。
Following previous approaches to playing Atari 2600 games, we also use a simple frame-skipping technique15. More precisely, the agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames. Because running the emulator forward for one step requires much less computation than having the agent select an action, this technique allows the agent to play roughly k times more games without significantly increasing the runtime. We use k = 4 for all games.
按照之前玩Atari 2600游戏的方法,我们还使用了一种 简单的跳帧技术 15 。更准确地说,代理在每 k 帧(而不是每一帧)上查看并选择操作,并且其最后一个操作在跳过的帧上重复。由于向前运行模拟器一个步骤所需的计算量比让代理选择操作少得多,因此此技术允许代理在不显著增加运行时间的情况下玩大约 k 倍的游戏。我们对所有游戏都使用 k = 4。
The values of all the hyperparameters and optimization parameters were selected by performing an informal search on the games Pong, Breakout, Seaquest, Space Invaders and Beam Rider. We did not perform a systematic grid search owing to the high computational cost. These parameters were then held fixed across all other games. The values and descriptions of all hyperparameters are provided in Extended Data Table 1.
通过对 Pong、Breakout、Seaquest、Space Invaders 和 Beam Rider 游戏进行非正式搜索来选择所有超参数和优化参数的值。由于计算成本高,我们没有进行系统的网格搜索。然后,这些参数在所有其他游戏中保持不变。扩展数据表 1 中提供了所有超参数的值和说明。
Our experimental setup amounts to using the following minimal prior knowledge: that the input data consisted of visual images (motivating our use of a convolutional deep network), the game-specific score (with no modification), number of actions, although not their correspondences (for example, specification of the up ‘button’) and the life count.
我们的实验设置相当于使用以下最低限度的先验知识:输入数据包括视觉图像(激励我们使用卷积深度网络)、特定于游戏的分数(不作修改)、动作数量,尽管不是它们的对应关系(例如,向上“按钮”的规范)和寿命计数。
Evaluation procedure 评估程序
The trained agents were evaluated by playing each game 30 times for up to 5 min each time with different initial random conditions (‘no-op’; see Extended Data Table 1) and an ε-greedy policy with ε = 0.05. This procedure is adopted to minimize the possibility of overfitting during evaluation. The random agent served as a baseline comparison and chose a random action at 10 Hz which is every sixth frame, repeating its last action on intervening frames. 10 Hz is about the fastest that a human player can select the ‘fire’ button, and setting the random agent to this frequency avoids spurious baseline scores in a handful of the games. We did also assess the performance of a random agent that selected an action at 60 Hz (that is, every frame). This had a minimal effect: changing the normalized DQN performance by more than 5% in only six games (Boxing, Breakout, Crazy Climber, Demon Attack, Krull and Robotank), and in all these games DQN outperformed the expert human by a considerable margin.
通过在不同的初始随机条件(“no-op”;参见扩展数据表1)和ε贪婪策略(ε = 0.05)下,对经过训练的智能体进行评估,每次最多5分钟。采用此程序是为了最大程度地减少评估过程中过拟合的可能性。随机代理作为基线比较,并选择一个 10 Hz 的随机动作,即每六帧一次,在中间帧上重复其最后一个动作。10 Hz 大约是人类玩家选择“开火”按钮的最快速度,将随机代理设置为此频率可以避免少数游戏中的虚假基线分数。我们还评估了随机代理的性能,该代理选择了 60 Hz(即每帧)的动作。这影响微乎其微:仅在六款游戏(拳击、突围、疯狂攀登者、恶魔攻击、克鲁尔和机械坦克)中,标准化 DQN 性能就改变了 5% 以上,并且在所有这些游戏中,DQN 的表现都远远超过了专家人类。
The professional human tester used the same emulator engine as the agents, and played under controlled conditions. The human tester was not allowed to pause, save or reload games. As in the original Atari 2600 environment, the emulator was run at 60 Hz and the audio output was disabled: as such, the sensory input was equated between human player and agents. The human performance is the average reward achieved from around 20 episodes of each game lasting a maximum of 5 min each, following around 2 h of practice playing each game.
专业的人工测试人员使用与代理相同的模拟器引擎,并在受控条件下进行游戏。不允许人工测试人员暂停、保存或重新加载游戏。与原始的Atari 2600环境一样,模拟器以60 Hz运行,音频输出被禁用:因此,感官输入在人类玩家和代理之间是等同的。人类表现是每场比赛大约 20 集获得的平均奖励,每集最多持续 5 分钟,经过大约 2 小时的练习。
Algorithm 算法
We consider tasks in which an agent interacts with an environment, in this case the Atari emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action from the set of legal game actions, . The action is passed to the emulator and modifies its internal state and the game score. In general the environment may be stochastic. The emulator’s internal state is not observed by the agent; instead the agent observes an image from the emulator, which is a vector of pixel values representing the current screen. In addition it receives a reward rt representing the change in game score. Note that in general the game score may depend on the whole previous sequence of actions and observations; feedback about an action may only be received after many thousands of time-steps have elapsed.
我们考虑代理与环境交互的任务,在本例中为Atari模拟器,以一系列操作,观察和奖励。在每个时间步长,代理从一组合法游戏操作中选择一个操作 。该操作将传递给模拟器,并修改其内部状态和游戏分数。一般来说,环境可能是随机的。代理不会观察到模拟器的内部状态;相反,代理会从仿真器中观察图像,该图像 是表示当前屏幕的像素值向量。此外,它还会收到一个奖励 r t ,代表游戏分数的变化。请注意,一般来说,游戏分数可能取决于之前的整个动作和观察序列;只有在经过数千个时间步长后,才能收到有关操作的反馈。
Because the agent only observes the current screen, the task is partially observed33 and many emulator states are perceptually aliased (that is, it is impossible to fully understand the current situation from only the current screen ). Therefore, sequences of actions and observations, , are input to the algorithm, which then learns game strategies depending upon these sequences. All sequences in the emulator are assumed to terminate in a finite number of time-steps. This formalism gives rise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state. As a result, we can apply standard reinforcement learning methods for MDPs, simply by using the complete sequence st as the state representation at time t.
由于代理仅观察当前屏幕,因此部分观察 33 任务,并且许多模拟器状态在感知上具有别名(即,仅从当前屏幕 无法完全理解当前情况)。因此,动作和观察的序列 被 输入到算法中,然后算法根据这些序列学习博弈策略。假定仿真器中的所有序列都以有限数量的时间步长终止。这种形式主义产生了一个大而有限的马尔可夫决策过程(MDP),其中每个序列都是一个不同的状态。因此,我们可以将标准的强化学习方法应用于 MDP,只需使用完整的序列 s t 作为时间 t 的状态表示即可。
The goal of the agent is to interact with the emulator by selecting actions in a way that maximizes future rewards. We make the standard assumption that future rewards are discounted by a factor of γ per time-step (γ was set to 0.99 throughout), and define the future discounted return at time t as , in which T is the time-step at which the game terminates. We define the optimal action-value function as the maximum expected return achievable by following any policy, after seeing some sequence and then taking some action a, in which π is a policy mapping sequences to actions (or distributions over actions).
代理的目标是通过以最大化未来奖励的方式选择操作来与模拟器进行交互。我们做出标准假设,即未来奖励在每个时间步长按 γ 倍贴现(γ始终设置为 0.99),并将时间 t 的未来贴现回报定义为 ,其中 T 是游戏终止的时间步长。我们将最优操作值函数 定义为遵循任何策略,在看到某个序列 然后采取一些操作 a 后可实现的最大预期回报, 其中π是将策略序列映射到操作(或操作上的分布)。
The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intuition: if the optimal value of the sequence s′ at the next time-step was known for all possible actions a′, then the optimal strategy is to select the action a′ maximizing the expected value of :
最优动作值函数服从一个称为贝尔曼方程的重要恒等式。这是基于以下直觉:如果序列 s′ 在下一个时间步的最优值对于所有可能的动作 a′ 都是已知的,那么最佳策略是选择动作 a′ 最大化期望值 :
The basic idea behind many reinforcement learning algorithms is to estimate the action-value function by using the Bellman equation as an iterative update, . Such value iteration algorithms converge to the optimal action-value function, as . In practice, this basic approach is impractical, because the action-value function is estimated separately for each sequence, without any generalization. Instead, it is common to use a function approximator to estimate the action-value function, . In the reinforcement learning community this is typically a linear function approximator, but sometimes a nonlinear function approximator is used instead, such as a neural network. We refer to a neural network function approximator with weights θ as a Q-network. A Q-network can be trained by adjusting the parameters θi at iteration i to reduce the mean-squared error in the Bellman equation, where the optimal target values are substituted with approximate target values , using parameters from some previous iteration. This leads to a sequence of loss functions Li(θi) that changes at each iteration i,
许多强化学习算法背后的基本思想是通过使用贝尔曼方程作为迭代更新来估计动作值函数 。这种值迭代算法收敛到最优动作值函数, 如 。在实践中,这种基本方法是不切实际的,因为动作值函数是针对每个序列单独估计的,没有任何泛化。相反,通常使用函数近似器来估计动作值函数 。在强化学习社区中,这通常是线性函数逼近器,但有时也会使用非线性函数逼近器,例如神经网络。我们将权重为 θ 的神经网络函数近似器称为 Q 网络。Q 网络可以通过在迭代 i 中调整参数 θ 来训练,以减少 Bellman 方程中的均方误差,其中最佳目标值被近似目标值替换为近似目标值 ,使用来自先前迭代的参数 。这导致了一系列损失函数 L(θ) 在每次迭代 i 时都会发生变化,
Note that the targets depend on the network weights; this is in contrast with the targets used for supervised learning, which are fixed before learning begins. At each stage of optimization, we hold the parameters from the previous iteration θi− fixed when optimizing the ith loss function Li(θi), resulting in a sequence of well-defined optimization problems. The final term is the variance of the targets, which does not depend on the parameters θi that we are currently optimizing, and may therefore be ignored. Differentiating the loss function with respect to the weights we arrive at the following gradient:
请注意,目标取决于网络权重;这与用于监督学习的目标形成鲜明对比,后者在学习开始之前就已确定。在优化的每个阶段,我们在优化第 i 个损失函数 L(θ) 时保持上一次迭代的参数 θ − 是固定的,从而产生一系列定义明确的优化问题。最后一项是目标的方差,它不依赖于我们当前正在优化的参数 θ,因此可以忽略不计。微分相对于权重的损失函数,我们得出以下梯度:
Rather than computing the full expectations in the above gradient, it is often computationally expedient to optimize the loss function by stochastic gradient descent. The familiar Q-learning algorithm19 can be recovered in this framework by updating the weights after every time step, replacing the expectations using single samples, and setting .
与其计算上述梯度中的全部期望值,不如通过随机梯度下降来优化损失函数,这在计算上通常是权宜之计。在这个框架中,熟悉的Q学习算法 19 可以通过在每个时间步后更新权重,使用单个样本替换期望值,并将 .
Note that this algorithm is model-free: it solves the reinforcement learning task directly using samples from the emulator, without explicitly estimating the reward and transition dynamics . It is also off-policy: it learns about the greedy policy , while following a behaviour distribution that ensures adequate exploration of the state space. In practice, the behaviour distribution is often selected by an ε-greedy policy that follows the greedy policy with probability 1 − ε and selects a random action with probability ε.
请注意,此算法是无模型的:它直接使用来自模拟器的样本求解强化学习任务,而无需明确估计奖励和过渡动态 。它也是非策略的:它学习贪婪的政策 ,同时遵循行为分布,确保对状态空间的充分探索。在实践中,行为分布通常由ε贪婪策略选择,该策略遵循概率为 1 − ε 的贪婪策略,并选择概率为 ε 的随机操作。
Training algorithm for deep Q-networks
深度Q网络的训练算法
The full algorithm for training deep Q-networks is presented in Algorithm 1. The agent selects and executes actions according to an ε-greedy policy based on . Because using histories of arbitrary length as inputs to a neural network can be difficult, our Q-function instead works on a fixed length representation of histories produced by the function ϕ described above. The algorithm modifies standard online Q-learning in two ways to make it suitable for training large neural networks without diverging.
算法 1 介绍了用于训练深度 Q 网络的完整算法。代理根据基于 ε贪婪策略选择并执行操作 。因为使用任意长度的历史记录作为神经网络的输入可能很困难,所以我们的 Q 函数改为处理由上述函数φ生成的历史记录的固定长度表示。该算法以两种方式修改了标准的在线 Q 学习,使其适用于训练大型神经网络而不会发散。
First, we use a technique known as experience replay23 in which we store the agent’s experiences at each time-step, et = (st, at, rt, st + 1), in a data set Dt = {e1,…,et}, pooled over many episodes (where the end of an episode occurs when a terminal state is reached) into a replay memory. During the inner loop of the algorithm, we apply Q-learning updates, or minibatch updates, to samples of experience, (s, a, r, s′) ∼ U(D), drawn at random from the pool of stored samples. This approach has several advantages over standard online Q-learning. First, each step of experience is potentially used in many weight updates, which allows for greater data efficiency. Second, learning directly from consecutive samples is inefficient, owing to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on. For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch. It is easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or even diverge catastrophically20. By using experience replay the behaviour distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters. Note that when learning by experience replay, it is necessary to learn off-policy (because our current parameters are different to those used to generate the sample), which motivates the choice of Q-learning.
首先,我们使用一种称为体验回放 23 的技术,在这种技术中,我们将代理在每个时间步长 e = (s , a t , r t , s t + 1 t ) 中存储到数据集 D t = {e ,…,e 1 t t } 中,汇集在许多情节中(当达到最终状态时发生情节结束)到重放记忆中。在算法的内部循环中,我们将 Q 学习更新或小批量更新应用于从存储样本池中随机抽取的经验样本 (s, a, r, s′) ∼ U(D)。与标准的在线 Q 学习相比,这种方法有几个优点。首先,经验的每一步都可能用于许多权重更新,从而提高数据效率。其次,由于样本之间的相关性很强,直接从连续样本中学习效率低下;随机化样本会破坏这些相关性,从而减少更新的方差。第三,在学习策略时,当前参数决定了训练参数的下一个数据样本。例如,如果最大化操作是向左移动,则训练样本将由左侧的样本主导;如果最大化操作随后切换到右侧,则训练分布也将切换。很容易看出,不需要的反馈回路是如何产生的,参数可能会卡在较差的局部最小值中,甚至灾难性地 20 发散。通过使用经验回放,行为分布在许多先前状态上取平均值,从而平滑学习并避免参数的振荡或发散。 请注意,当通过经验回放进行学习时,有必要学习策略外(因为我们当前的参数与用于生成样本的参数不同),这促使我们选择 Q 学习。
In practice, our algorithm only stores the last N experience tuples in the replay memory, and samples uniformly at random from D when performing updates. This approach is in some respects limited because the memory buffer does not differentiate important transitions and always overwrites with recent transitions owing to the finite memory size N. Similarly, the uniform sampling gives equal importance to all transitions in the replay memory. A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to prioritized sweeping30.
在实践中,我们的算法只将最后 N 个体验元组存储在回放存储器中,并在执行更新时从 D 中随机均匀采样。这种方法在某些方面是有局限性的,因为内存缓冲区不区分重要的转换,并且由于内存大小 N 有限,总是用最近的转换覆盖。同样,均匀采样对重放存储器中的所有转换都同等重要。更复杂的抽样策略可能会强调我们可以从中学到最多东西的过渡,类似于 优先扫描 30 。
The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the targets yj in the Q-learning update. More precisely, every C updates we clone the network to obtain a target network and use for generating the Q-learning targets yj for the following C updates to . This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases often also increases for all a and hence also increases the target yj, possibly leading to oscillations or divergence of the policy. Generating the targets using an older set of parameters adds a delay between the time an update to is made and the time the update affects the targets yj, making divergence or oscillations much more unlikely.
为了进一步提高我们使用神经网络的方法的稳定性,对在线 Q 学习的第二个修改是在 Q-learning 更新中使用单独的网络来生成目标 y。更准确地说,每次 C 更新,我们都会克隆网络以获得目标网络 ,并用于 生成 Q 学习目标 y,以便将以下 C 更新为 。与标准在线 Q 学习相比,这种修改使算法更加稳定,在标准在线 Q 学习中,增加的更新通常也会增加 所有 a,因此也会增加目标 y,可能导致策略的振荡或背离。使用一组较旧的参数生成目标会增加更新时间 与更新影响目标 y 的时间之间的延迟,从而使背离或振荡的可能性大大降低。
We also found it helpful to clip the error term from the update to be between −1 and 1. Because the absolute value loss function |x| has a derivative of −1 for all negative values of x and a derivative of 1 for all positive values of x, clipping the squared error to be between −1 and 1 corresponds to using an absolute value loss function for errors outside of the (−1,1) interval. This form of error clipping further improved the stability of the algorithm.
我们还发现,将更新 中的误差项剪裁为 −1 和 1 之间很有帮助。因为绝对值损失函数|x|对于 x 的所有负值,导数为 −1,对于 x 的所有正值,导数为 1,将平方误差剪裁为 −1 和 1 对应于对 (-1,1) 区间之外的误差使用绝对值损失函数。这种形式的误差裁剪进一步提高了算法的稳定性。
Algorithm 1: deep Q-learning with experience replay
算法 1:具有经验回放的深度 Q 学习
Initialize replay memory D to capacity N
将重放存储器 D 初始化为容量 N
Initialize action-value function with random weights θ
使用随机权重 θ 初始化动作值函数
Initialize target action-value function with weights θ− = θ
使用权重 θ − = θ 初始化目标动作值函数
For episode = 1, M do
对于第 1 集,M do
Initialize sequence and preprocessed sequence
初始化序列和预处理序列
For t = 1,T do
当 t = 1,T do
With probability ε select a random action at
使用概率ε选择一个随机动作 t
otherwise select
否则选择
Execute action at in emulator and observe reward rt and image xt + 1
在模拟器中执行操作 a t 并观察奖励 r t 和图像 x t + 1
Set and preprocess
设置 和预处理
Store transition in D
D 中的存储转换
Sample random minibatch of transitions from D
从 D 随机小批量的通道采样
Set 设置
Perform a gradient descent step on with respect to the network parameters θ
对网络参数 θ 执行梯度下降步骤
Every C steps reset
每 C 步重置 一次
End For 结束
End For 结束
References 引用
Sutton, R. & Barto, A. Reinforcement Learning: An Introduction (MIT Press, 1998)
Sutton, R. & Barto, A. Reinforcement Learning: An Introduction(麻省理工学院出版社,1998 年)
MATH
Google Scholar
Thorndike, E. L. Animal Intelligence: Experimental studies (Macmillan, 1911)
Thorndike, E. L. 动物智力:实验研究(麦克米伦,1911 年)
Book
Google Scholar
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997)
Schultz, W., Dayan, P. & Montague, P. R.预测和奖励的神经基质。科学 275, 1593–1599 (1997)
Article
CAS
Google Scholar
Serre, T., Wolf, L. & Poggio, T. Object recognition with features inspired by visual cortex. Proc. IEEE. Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. 994–1000 (2005)
Serre, T., Wolf, L. & Poggio, T. 受视觉皮层启发的特征的物体识别。IEEE程序。计算。Soc. Conf. Comput.Vis. 模式。识别。994–1000 (2005)
Google Scholar
Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980)
Fukushima, K. Neocognitron:一种自组织神经网络模型,用于不受位置移动影响的模式识别机制。生物赛伯恩。36, 193–202 (1980)
Article
CAS
Google Scholar
Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995)
Tesauro, G. 时间差分学习和 TD-Gammon。通讯。ACM 38, 58–68 (1995年)
Article
Google Scholar
Riedmiller, M., Gabel, T., Hafner, R. & Lange, S. Reinforcement learning for robot soccer. Auton. Robots 27, 55–73 (2009)
Riedmiller, M., Gabel, T., Hafner, R. & Lange, S. 机器人足球的强化学习。奥托恩。机器人 27, 55–73 (2009)
Article
Google Scholar
Diuk, C., Cohen, A. & Littman, M. L. An object-oriented representation for efficient reinforcement learning. Proc. Int. Conf. Mach. Learn. 240–247 (2008)
Diuk, C., Cohen, A. & Littman, M. L.用于高效强化学习的面向对象表示。Proc. Int. Conf. Mach. 学习。240–247 (2008)
Google Scholar
Bengio, Y. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1–127 (2009)
Bengio, Y. 学习 AI 的深度架构。机器学习的基础和趋势 2, 1–127 (2009)
Article
Google Scholar
Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1106–1114 (2012)
Krizhevsky, A., Sutskever, I. & Hinton, G. 使用深度卷积神经网络进行 ImageNet 分类。高级神经信息过程。系统 25, 1106–1114 (2012)
Google Scholar
Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)
Hinton, G. E. & Salakhutdinov, R. R. 用神经网络降低数据的维数。科学 313, 504–507 (2006)
Article
ADS
MathSciNet
CAS
Google Scholar
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013)
Bellemare, MG, Naddaf, Y., Veness, J. & Bowling, M.街机学习环境:面向总代理的评估平台。J.阿蒂夫。智力。第47号决议,第253–279号决议(2013年)
Article
Google Scholar
Legg, S. & Hutter, M. Universal Intelligence: a definition of machine intelligence. Minds Mach. 17, 391–444 (2007)
Legg, S. & Hutter, M. 通用智能:机器智能的定义。Minds Mach. 17, 391–444 (2007年)
Article
Google Scholar
Genesereth, M., Love, N. & Pell, B. General game playing: overview of the AAAI competition. AI Mag. 26, 62–72 (2005)
Genesereth, M., Love, N. & Pell, B. 一般游戏玩法:AAAI 比赛概述。人工智能杂志 26, 62–72 (2005)
Google Scholar
Bellemare, M. G., Veness, J. & Bowling, M. Investigating contingency awareness using Atari 2600 games. Proc. Conf. AAAI. Artif. Intell. 864–871 (2012)
Bellemare, M. G., Veness, J. & Bowling, M. 使用 Atari 2600 游戏调查应急意识。Proc. Conf. AAAI.阿蒂夫。智力。864–871 (2012)
Google Scholar
McClelland, J. L., Rumelhart, D. E. & Group, T. P. R. Parallel Distributed Processing: Explorations in the Microstructure of Cognition (MIT Press, 1986)
McClelland, J. L., Rumelhart, D. E. & Group, T. P. R. Parallel Distributed Processing: Explorations in the Microstructure of Cognition (麻省理工学院出版社,1986)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. 基于梯度的学习应用于文档识别。IEEE 86, 2278–2324 (1998) 提案
Article
Google Scholar
Hubel, D. H. & Wiesel, T. N. Shape and arrangement of columns in cat’s striate cortex. J. Physiol. 165, 559–568 (1963)
Hubel, D. H. & Wiesel, T. N. 猫纹状皮层中柱子的形状和排列。生理学杂志 165, 559–568 (1963)
Article
CAS
Google Scholar
Watkins, C. J. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992)
沃特金斯,CJ和Dayan,P.Q-learning。学习。8, 279–292 (1992)
MATH
Google Scholar
Tsitsiklis, J. & Roy, B. V. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr. 42, 674–690 (1997)
Tsitsiklis, J. & Roy, B. V.基于函数逼近的时间差分学习分析.IEEE翻译自动。Contr. 42, 674–690 (1997年)
Article
MathSciNet
Google Scholar
McClelland, J. L., McNaughton, B. L. & O’Reilly, R. C. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 102, 419–457 (1995)
McClelland, J. L., McNaughton, BL & O’Reilly, RC为什么海马体和新皮层存在互补的学习系统:从学习和记忆的联结主义模型的成功和失败中得出的见解。心理学修订版 102, 419–457 (1995)
Article
Google Scholar
O’Neill, J., Pleydell-Bouverie, B., Dupret, D. & Csicsvari, J. Play it again: reactivation of waking experience and memory. Trends Neurosci. 33, 220–229 (2010)
O’Neill, J., Pleydell-Bouverie, B., Dupret, D. & Csicsvari, J. 再玩一遍:重新激活清醒的体验和记忆。趋势神经科学。33, 220–229 (2010)
Article
Google Scholar
Lin, L.-J. Reinforcement learning for robots using neural networks. Technical Report, DTIC Document. (1993)
林林氏 (L.-J.)使用神经网络的机器人强化学习。技术报告,DTIC文件。(1993)
Riedmiller, M. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. Mach. Learn.: ECML 3720, 317–328 (Springer, 2005)
Riedmiller, M. 神经拟合 Q 迭代 - 首次体验数据高效的神经强化学习方法。Mach. Learn.: ECML 3720, 317–328 (Springer, 2005)
Google Scholar
Van der Maaten, L. J. P. & Hinton, G. E. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Van der Maaten, L. J. P. & Hinton, GE 使用 t-SNE 可视化高维数据。J. Mach. 学习。第9号决议,第2579–2605号决议(2008年)
MATH
Google Scholar
Lange, S. & Riedmiller, M. Deep auto-encoder neural networks in reinforcement learning. Proc. Int. Jt. Conf. Neural. Netw. 1–8 (2010)
Lange, S. & Riedmiller, M. 强化学习中的深度自动编码器神经网络。Proc. Int. Jt. Conf. 神经。净。1–8 (2010)
Google Scholar
Law, C.-T. & Gold, J. I. Reinforcement learning can account for associative and perceptual learning on a visual decision task. Nature Neurosci. 12, 655 (2009)
法,C.-T.& Gold, J. I. 强化学习可以解释视觉决策任务中的联想和知觉学习。自然神经科学。12, 655 (2009)
Article
CAS
Google Scholar
Sigala, N. & Logothetis, N. K. Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415, 318–320 (2002)
Sigala, N. & Logothetis, N. K. 视觉分类形状在灵长类动物颞叶皮层中具有选择性。自然 415, 318–320 (2002)
Article
ADS
CAS
Google Scholar
Bendor, D. & Wilson, M. A. Biasing the content of hippocampal replay during sleep. Nature Neurosci. 15, 1439–1444 (2012)
Bendor, D. & Wilson, M. A. 睡眠期间海马体回放的内容偏倚。自然神经科学。15, 1439–1444 (2012)
Article
CAS
Google Scholar
Moore, A. & Atkeson, C. Prioritized sweeping: reinforcement learning with less data and less real time. Mach. Learn. 13, 103–130 (1993)
Moore, A. & Atkeson, C. 优先扫描:数据更少、实时性更短的强化学习。学习。13, 103–130 (1993)
Google Scholar
Jarrett, K., Kavukcuoglu, K., Ranzato, M. A. & LeCun, Y. What is the best multi-stage architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–2153 (2009)
Jarrett, K., Kavukcuoglu, K., Ranzato, MA & LeCun, Y.什么是物体识别的最佳多阶段架构?IEEE程序。Int. Conf. Comput.Vis. 2146–2153 (2009年)
Google Scholar
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. Proc. Int. Conf. Mach. Learn. 807–814 (2010)
Nair, V. & Hinton, G. E. 整流线性单元改进了受限玻尔兹曼机。Proc. Int. Conf. Mach. 学习。807–814 (2010)
Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101, 99–134 (1994)
Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. 在部分可观察的随机域中规划和行动。人工智能 101, 99–134 (1994)
Article
MathSciNet
Google Scholar
Download references 下载参考资料
Acknowledgements 确认
We thank G. Hinton, P. Dayan and M. Bowling for discussions, A. Cain and J. Keene for work on the visuals, K. Keller and P. Rogers for help with the visuals, G. Wayne for comments on an earlier version of the manuscript, and the rest of the DeepMind team for their support, ideas and encouragement.
我们感谢 G. Hinton、P. Dayan 和 M. Bowling 的讨论,感谢 A. Cain 和 J. Keene 在视觉效果方面的工作,感谢 K. Keller 和 P. Rogers 在视觉效果方面的帮助,感谢 G. Wayne 对手稿早期版本的评论,以及 DeepMind 团队其他成员的支持、想法和鼓励。
Author information 作者信息
Author notes
Volodymyr Mnih, Koray Kavukcuoglu and David Silver: These authors contributed equally to this work.
Authors and Affiliations
Google DeepMind, 5 New Street Square, London EC4A 3TW, UK,
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg & Demis Hassabis
Contributions
V.M., K.K., D.S., J.V., M.G.B., M.R., A.G., D.W., S.L. and D.H. conceptualized the problem and the technical framework. V.M., K.K., A.A.R. and D.S. developed and tested the algorithms. J.V., S.P., C.B., A.A.R., M.G.B., I.A., A.K.F., G.O. and A.S. created the testing platform. K.K., H.K., S.L. and D.H. managed the project. K.K., D.K., D.H., V.M., D.S., A.G., A.A.R., J.V. and M.G.B. wrote the paper.
Corresponding authors
Correspondence to Koray Kavukcuoglu or Demis Hassabis.
Ethics declarations 道德宣言
Competing interests 利益争夺
The authors declare no competing financial interests.
作者声明没有相互竞争的经济利益。
Extended data figures and tables
扩展数据图和表格
Extended Data Figure 1 Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced during a combination of human and agent play in Space Invaders.
扩展数据 图 1:DQN 分配给 Space Invaders 中人类和智能体游戏组合期间所经历的游戏状态的最后一个隐藏层中表示的二维 t-SNE 嵌入。
The plot was generated by running the t-SNE algorithm25 on the last hidden layer representation assigned by DQN to game states experienced during a combination of human (30 min) and agent (2 h) play. The fact that there is similar structure in the two-dimensional embeddings corresponding to the DQN representation of states experienced during human play (orange points) and DQN play (blue points) suggests that the representations learned by DQN do indeed generalize to data generated from policies other than its own. The presence in the t-SNE embedding of overlapping clusters of points corresponding to the network representation of states experienced during human and agent play shows that the DQN agent also follows sequences of states similar to those found in human play. Screenshots corresponding to selected states are shown (human: orange border; DQN: blue border).
该图是通过 25 在 DQN 分配给人类(30 分钟)和智能体(2 小时)组合期间经历的游戏状态的最后一个隐藏层表示上运行 t-SNE 算法生成的。在二维嵌入中,对应于人类游戏(橙色点)和DQN游戏(蓝色点)期间所经历的状态的DQN表示,这一事实表明,DQN学习的表示确实泛化到从其自身以外的策略生成的数据。t-SNE嵌入重叠的点簇的存在,对应于人类和智能体游戏过程中所经历的状态的网络表示,表明DQN智能体也遵循与人类游戏相似的状态序列。显示与所选状态相对应的屏幕截图(人类:橙色边框;DQN:蓝色边框)。
Extended Data Figure 2 Visualization of learned value functions on two games, Breakout and Pong.
扩展数据 图 2 两个游戏(Breakout 和 Pong)上学习值函数的可视化。
a, A visualization of the learned value function on the game Breakout. At time points 1 and 2, the state value is predicted to be ∼17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ∼21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to −0.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of −1. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc.
a,在游戏Breakout上学习值函数的可视化。在时间点 1 和 2 处,状态值预计为 ∼17,并且代理正在清除最低级别的砖块。值函数曲线中的每个峰值都对应于通过清除砖块获得的奖励。在时间点 3 时,智能体即将突破到砖块的顶层,并且值增加到 ∼21,以期突破并清除一大堆砖块。在第 4 点,该值高于 23,代理已突破。在此之后,球将在砖块的上部反弹,自行清除许多砖块。b,在游戏 Pong 上学习到的动作值函数的可视化。在时间点 1 处,球向屏幕右侧智能体控制的球拍移动,所有动作的值都在 0.7 左右,反映了基于先前经验的该状态的期望值。在时间点 2,智能体开始将球拍移向球,“向上”动作的值保持高位,而“向下”动作的值下降到 -0.9。这反映了这样一个事实,即按下“向下”将导致代理失去球并产生 -1 的奖励。在第 3 个时间点,智能体通过按“向上”击球,预期奖励不断增加,直到时间点 4 时,当球到达屏幕的左边缘并且所有动作的值反映智能体即将获得 1 的奖励。请注意,虚线显示球的过去轨迹纯粹是为了说明目的(即在比赛中不显示)。经Atari Interactive, Inc.许可。
Extended Data Table 1 List of hyperparameters and their values
扩展数据 表1 超参数列表及其取值
Full size table
Extended Data Table 2 Comparison of games scores obtained by DQN agents with methods from the literature12,15 and a professional human games tester
扩展数据表 2 DQN 代理获得的游戏分数与文献 12,15 中的方法和专业人类游戏测试器的比较
Full size table
Extended Data Table 3 The effects of replay and separating the target Q-network
扩展数据 表3 重放和分离目标Q网络的影响
Full size table
Extended Data Table 4 Comparison of DQN performance with linear function approximator
扩展数据 表4 DQN与线性函数逼近器性能比较
Full size table
Supplementary information
补充资料
Supplementary Information
补充资料
This file contains a Supplementary Discussion. (PDF 110 kb)
本文件包含补充讨论。(PDF 110 kb)
Performance of DQN in the Game Space Invaders
DQN在游戏《太空入侵者》中的表现
This video shows the performance of the DQN agent while playing the game of Space Invaders. The DQN agent successfully clears the enemy ships on the screen while the enemy ships move down and sideways with gradually increasing speed. (MOV 5106 kb)
该视频显示了 DQN 代理在玩 Space Invaders 游戏时的性能。DQN特工成功地清除了屏幕上的敌舰,而敌舰则以逐渐增加的速度向下和侧向移动。(MOV 5106 KB)
Demonstration of Learning Progress in the Game Breakout
在游戏分组讨论中展示学习进度
This video shows the improvement in the performance of DQN over training (i.e. after 100, 200, 400 and 600 episodes). After 600 episodes DQN finds and exploits the optimal strategy in this game, which is to make a tunnel around the side, and then allow the ball to hit blocks by bouncing behind the wall. Note: the score is displayed at the top left of the screen (maximum for clearing one screen is 448 points), number of lives remaining is shown in the middle (starting with 5 lives), and the “1” on the top right indicates this is a 1-player game. (MOV 1500 kb)
该视频显示了 DQN 在训练中的表现(即在 100、200、400 和 600 集之后)的改进。在600集之后,DQN找到并利用了这个游戏中的最佳策略,即在侧面开一个隧道,然后让球通过在墙后弹跳来击中方块。注意:分数显示在屏幕的左上方(通关一个屏幕的最大值为448分),剩余的生命数显示在中间(从5条命开始),右上方的“1”表示这是一个1人游戏。(MOV 1500 KB)
PowerPoint slides PowerPoint 幻灯片
PowerPoint slide for Fig. 1
图 1 的 PowerPoint 幻灯片
PowerPoint slide for Fig. 2
图 2 的 PowerPoint 幻灯片
PowerPoint slide for Fig. 3
图 3 的 PowerPoint 幻灯片
PowerPoint slide for Fig. 4
图 4 的 PowerPoint 幻灯片
Rights and permissions 权利和权限
Reprints and permissions
重印和权限
About this article
Check for updates. Verify currency and authenticity via CrossMark
Cite this article
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
Download citation
Received
10 July 2014
Accepted
16 January 2015
Published
25 February 2015
Issue Date
26 February 2015
DOI
https://doi.org/10.1038/nature14236
Share this article
Anyone you share the following link with will be able to read this content:
Get shareable link
Provided by the Springer Nature SharedIt content-sharing initiative
Subjects
Computer science
This article is cited by
本文被引用
Comparing explanations in RL
比较 RL 中的解释
Britt Davis PiersonDustin ArendtMatthew E. Taylor
布里特·戴维斯·皮尔森达斯汀·阿伦特马修·泰勒
Neural Computing and Applications (2024)
神经计算与应用 (2024)
Comments 评论
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
提交评论即表示您同意遵守我们的条款和社区准则。如果您发现滥用或不符合我们的条款或准则的内容,请将其标记为不当内容。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。