当前位置:   article > 正文

deep Q-network (DQN)_deep q-network经典论文

deep q-network经典论文

DQN


deep Q-network (DQN)

  • DQN can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning.
  • For complex problems with continuous states, the Q-function cannot be expressed as a lookup table, requiring a continuous approximation. DQN uses a deep convolutional neural network to approximate the optimal action-value function.

Q-network

Notation

  • We parameterize an approximate value function Q ( s , a ; θ i ) Q(s,a;\theta_i) Q(s,a;θi) using the deep convolutional neural network, in which θ i \theta_i θi are the parameters of the Q-network at iteration i i i.

Model architecture

  • Input: the state representation → \rightarrow preprocessed by a function ϕ \phi ϕ
    (在论文的测试游戏 Atari 2600 中,agent 每一个 time step 能得到的状态信息就是 210 × 160 210\times 160 210×160 的彩色游戏图像。为了减小计算量以及内存消耗,先对这些像素点进行预处理 (预处理方法在论文中有详细介绍),最后得到 84 × 84 × 4 84\times84\times4 84×84×4 的四幅灰度图像,因此 Q-network 的输入就是 84 × 84 × 4 84\times84\times4 84×84×4)
  • Output: a separate output unit for each possible action. The outputs correspond to the predicted Q-values of the individual actions for the input state.
    (Atari 2600 中共有 18 种可能的 action,因此输出的大小为 18)
  • The exact model architecture is as follows.
    在这里插入图片描述

这实际上是一个经典的卷积神经网络,包含 3 个卷积层加上 2 个全连接层。注意到这里没有包含 pooling 层。但如果你好好想想这里的情况,你会明白,pooling 层会带来变换不变性 —— 网络会对图像中的对象的位置没有很强的感知。这个特性在诸如 ImageNet 这样的分类问题中会有用,但是在这里游戏的球的位置其实是潜在能够决定收益的关键因素,我们自然不希望失去这样的信息了!


Loss function

  • Q-value 可能是任何实数,其实这是一个回归任务,我们可以通过简单的平方误差来进行优化,最后使用反向传播来更新网络参数
    在这里插入图片描述

Advantage

  • The main advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.
    (这里的网络结构采用输入 s s s,输出所有的 Q ( s , a ) , a ∈ A Q(s,a),a\in \mathcal A Q(s,a),aA;而非输入 s s s a a a,输出对应的 Q ( s , a ) Q(s,a) Q(s,a),因此只需要一次前向传播就可以求得状态 s s s 下的所有动作的 Q 值,最后只需要选择 Q 值最大的那个动作就行了)

Challenges

现在已经介绍了网络结构,剩下的就是 如何生成训练数据 以及 训练数据对应的目标值 来训练网络,这也就是后面会讲到的 Experience replay 以及 Fiexed Q-Targets。在介绍它们之前,先有必要看一下为什么采用这两个方法

Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action-value (also known as Q) function. This instability has several causes:

  • the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy and therefore change the data distribution
    • 交互得到的序列存在一定的相关性。而对于最大似然法的机器学习模型来说,有一个重要的假设:训练样本是独立同分布的,一旦这个假设不成立,模型的效果就会大打折扣。而上面提到的相关性恰好打破了独立同分布的假设,那么学习得到的值函数模型可能会存在很大的波动
  • the correlations between the action-values (Q) and the target values r + γ m a x a ′ Q ( s ′ , a ′ ) r+\gamma max_{a'}Q(s',a') r+γmaxaQ(s,a)
    • 假设我们只有一个网络来预测 Q 值,那么我们使用的训练样本的目标值也是使用该网络生成的,也就是说,训练网络时使用的目标 Q 值 与 当前 Q 值 具有很强的关联性,不利于神经网络的训练

We address these instabilities with a novel variant of Q-learning, which uses two key ideas.

Experience replay

  • First, we used a biologically inspired mechanism termed experience replay that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution.

Fiexed Q-Targets

  • Second, we used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.

Experience replay

  • We store the agent’s experiences at each time-step, e t = ( s D , a D , r D , s t + 1 ) e_t=(s_{\mathcal D},a_{\mathcal D},r_{\mathcal D},s_{t+1}) et=(sD,aD,rD,st+1), in a data set D t = { e 1 , … , e t } D_t=\{e_1,…,e_t\} Dt={e1,,et}, pooled over many episodes into a replay memory.
  • During the inner loop of the algorithm, we apply Q-learning updates, or minibatch updates, to samples of experience, ( s , a , r , s ′ ) ∼ U ( D ) (s, a, r, s')\sim U(D) (s,a,r,s)U(D), drawn at random from the pool of stored samples.
    • 也就是说,智能体每次与环境进行一次交互 ( s , a → r , s ′ s,a\rightarrow r,s' s,ar,s),都将这个四元组 ( s , a , r , s ′ ) (s, a, r, s') (s,a,r,s) 存到 replay memory 中,之后在生成 Q-network 的训练数据的时候,就随机从中抽取一个 minibatch 来进行梯度下降,这样就有效解决了连续样本之间的相似性问题

  • In practice, our algorithm only stores the last N N N experience tuples in the replay memory, and samples uniformly at random from D D D when performing updates.
    • This approach is in some respects limited because the memory buffer does not differentiate important transitions and always overwrites with recent transitions owing to the finite memory size N N N. Similarly, the uniform sampling gives equal importance to all transitions in the replay memory.
    • A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to prioritized sweeping.

Advantages

  • First, each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
    • 事实上,通过 experience replay 这个技巧,我们不仅可以用智能体自己的经验 ( s , a , r , s ′ ) (s, a, r, s') (s,a,r,s) 来训练 Q-network,也可以直接使用其他任何人类玩家的经验
  • Second, learning directly from consecutive samples is inefficient, owing to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. (即解决了之前提到的相关性问题)
  • Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on.
    • For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch.
    • It is easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or even diverge catastrophically.
    • By using experience replay the behaviour distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters.

Note that when learning by experience replay, it is necessary to learn off-policy (because our current parameters are different to those used to generate the sample), which motivates the choice of Q-learning.

Target network (Fiexed Q-Targets)

  • The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the targets y j y_j yj in the Q-learning update.
  • More precisely, every C C C updates we clone the network Q to obtain a target network Q ^ \hat Q Q^ and use Q ^ \hat Q Q^ for generating the Q-learning targets y j y_j yj for the following C C C updates to Q Q Q.

Advantage

  • This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases Q ( s D , a D ) Q(s_{\mathcal D},a_{\mathcal D}) Q(sD,aD) often also increases Q ( s t + 1 , a ) Q(s_{t+1},a) Q(st+1,a) for all a a a and hence also increases the target y j y_j yj, possibly leading to oscillations or divergence of the policy. Generating the targets using an older set of parameters adds a delay between the time an update to Q Q Q is made and the time the update affects the targets y j y_j yj, making divergence or oscillations much more unlikely.

Training algorithm for deep Q-networks

在这里插入图片描述
Notation

  • x t ∈ R d x_t\in\mathbb R^d xtRd: The image observed by the agent
  • θ i \theta_i θi: parameters of the Q-network at iteration i i i
  • θ i − \theta_i^- θi: the network parameters used to compute the target at iteration i i i
    • The target network parameters θ i − \theta_i^- θi are only updated with the Q-network parameters ( θ i \theta_i θi) every C C C steps and are held fixed between individual updates .

Double DQN


  • Q-learning is known to sometimes learn unrealistically high action values because it includes a maximization step over estimated action values, which tends to prefer overestimated to underestimated values.
    • Overestimations can occur when the action values are inaccurate, irrespective of the source of approximation error, which indicates that overestimations may be very common
    • It is an open question whether, if the overestimations do occur, this negatively affects performance in practice.
      • Overoptimistic value estimates are not necessarily a problem in and of themselves.
        • If all values would be uniformly higher then the relative action preferences are preserved and we would not expect the resulting policy to be any worse.
        • Furthermore, it is known that sometimes it is good to be optimistic: optimism in the face of uncertainty is a well-known exploration technique
      • If, however, the overestimations are not uniform and not concentrated at states about which we wish to learn more, then they might negatively affect the quality of the resulting policy. The value estimates can further deteriorate if we bootstrap off of action values that are already overoptimistic, since this causes overestimations to propagate throughout our estimates, thus propagating the wrong relative information about which states are more valuable than others, directly affecting the quality of the learned policies.
  • The DQN algorithm also suffers from substantial overestimations under certain conditions.

Double Q-learning

  • The max operator in standard Q-learning and DQN uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates.
  • To prevent this, we can decouple the selection from the evaluation.

Original Double Q-learning algorithm

  • In the original Double Q-learning algorithm, two value functions are learned by assigning each experience randomly to update one of the two value functions, such that there are two sets of weights, θ \theta θ and θ ′ θ' θ.
  • For each update, one set of weights is used to determine the greedy policy and the other to determine its value.
    • For a clear comparison, we can first untangle the selection and evaluation in Q-learning and rewrite its target as
      在这里插入图片描述
    • The Double Q-learning error can then be written as
      在这里插入图片描述
    • Notice that the selection of the action, in the arg max ⁡ \argmax argmax, is still due to the online weights θ t θ_t θt. This means that, as in Q-learning, we are still estimating the value of the greedy policy according to the current values, as defined by θ t θ_t θt. However, we use the second set of weights θ t ′ θ_t' θt to fairly evaluate the value of this policy. This second set of weights can be updated symmetrically by switching the roles of θ θ θ and θ ′ θ' θ.

Overoptimism due to estimation errors

  • Typically, the overoptimism increases with the number of actions as shown in Figure 1.
    在这里插入图片描述

Double DQN

  • The idea of Double Q-learning is to reduce overestimations by decomposing the max operation in the target into action selection (current network) and action evaluation (target network).
    • Although not fully decoupled, the target network in the DQN architecture provides a natural candidate for the the evaluation of the current greedy policy, without having to introduce additional networks.
    • Its update is the same as for DQN, but replacing the target Y D Q N Y^{DQN} YDQN with
      在这里插入图片描述

Empirical results

  • Double DQN improves over DQN both in terms of value accuracy and in terms of policy quality
    在这里插入图片描述

Dueling DQN

Advantage function

  • We define an important quantity, the advantage function, relating the value and Q functions:
    在这里插入图片描述
    • Note that
      在这里插入图片描述

  • Intuitively, the value function V V V measures the how good it is to be in a particular state s s s.
  • The advantage function obtains a relative measure of the importance of each action.

The Dueling Network Architecture

  • The key insight is that for many states, it is unnecessary to estimate the value of each action choice. (particularly when actions in a state do not affect the environment in any relevant way)
  • The Dueling Network explicitly separates the representation of state values and (state-dependent) action advantages. Intuitively, the dueling architecture can learn which states are (or are not) valuable, without having to learn the effect of each action for each state.
    在这里插入图片描述在这里插入图片描述

Aggregation layer

(1) Problem

  • Let us consider the dueling network shown in Figure 1, where we make one stream of fully-connected layers output a scalar V ( s ; θ , β ) V (s;\theta,\beta) V(s;θ,β), and the other stream output an ∣ A ∣ |\mathcal A| A-dimensional vector A ( s , a ; θ , α ) A(s, a; \theta,\alpha) A(s,a;θ,α).
    • Here, θ \theta θ denotes the parameters of the convolutional layers, while α \alpha α and β \beta β are the parameters of the two streams of fully-connected layers.
  • Using the definition of advantage, we might be tempted to construct the aggregating module as follows:
    在这里插入图片描述
    • However, we need to keep in mind that Q ( s , a ; θ , α , β ) Q(s, a; \theta,\alpha,\beta) Q(s,a;θ,α,β) is only a parameterized estimate of the true Q-function. Moreover, it would be wrong to conclude that V ( s ; θ , β ) V(s;\theta,\beta) V(s;θ,β) is a good estimator of the state-value function, or likewise that A ( s , a ; θ , α ) A(s, a; \theta,\alpha) A(s,a;θ,α) provides a reasonable estimate of the advantage function.
    • Equation (7) is unidentifiable in the sense that given Q Q Q we cannot recover V V V and A A A uniquely. This lack of identifiability is mirrored by poor practical performance when this equation is used directly.

(2) Solution

  • From the expressions for advantage Q π ( s ; a ) = V π ( s ) + A π ( s ; a ) Q^{\pi}(s; a) = V^\pi(s) +A^\pi(s; a) Qπ(s;a)=Vπ(s)+Aπ(s;a) and state-value V π ( s ) = E a ∼ π ( s ) [ Q π ( s ; a ) ] V^\pi(s) = E_{a\sim\pi(s)} [Q^\pi(s; a)] Vπ(s)=Eaπ(s)[Qπ(s;a)], it follows that E a ∼ π ( s ) [ A π ( s ; a ) ] = 0 E_{a\sim\pi(s)}[A^\pi(s; a)] = 0 Eaπ(s)[Aπ(s;a)]=0. Moreover, for a deterministic policy, a ∗ = arg max ⁡ a ′ ∈ A Q ( s ; a ′ ) a^* = \argmax_{a'\in\mathcal A} Q(s; a') a=aAargmaxQ(s;a), it follows that Q ( s ; a ∗ ) = V ( s ) Q(s; a^*) = V (s) Q(s;a)=V(s) and hence A ( s ; a ∗ ) = 0 A(s; a^*) = 0 A(s;a)=0.
  • To address the issue of identifiability, we can force the advantage function estimator to have zero advantage at the chosen action. That is, we let the last module of the network implement the forward mapping
    在这里插入图片描述
    • Note that Q ( s , a ∗ ; θ , α , β ) = V ( s ; θ , β ) Q(s, a^*; \theta,\alpha,\beta) =V (s;\theta,\beta) Q(s,a;θ,α,β)=V(s;θ,β). Hence, the stream V ( s ; θ , β ) V (s; \theta,\beta) V(s;θ,β) provides an estimate of the value function, while the other stream produces an estimate of the advantage function.
  • An alternative module replaces the max operator with an average:
    在这里插入图片描述
    • On the one hand this loses the original semantics of V V V and A A A because they are now off-target by a constant, but on the other hand it increases the stability of the optimization: with (9) the advantages only need to change as fast as the mean, instead of having to compensate any change to the optimal action’s advantage in (8).
    • 同时,该公式也可以理解为是在动作优势估值相差无几的情况下,最终的 Q Q Q 值更多由状态估值网络决定,最后将处理后的状态估值及动作优势估值相加得到最终的 Q Q Q
  • We also experimented with a softmax version of equation (8), but found it to deliver similar results to the simpler module of equation (9). Hence, all the experiments reported in this paper use the module of equation (9).

Discussion

  • The advantage of the dueling architecture lies partly in its ability to learn the state-value function efficiently. With every update of the Q Q Q values in the dueling architecture, the value stream V V V is updated – this contrasts with the updates in a single-stream architecture where only the value for one of the actions is updated, the values for all other actions remain untouched. This more frequent updating of the value stream in our approach allocates more resources to V V V , and thus allows for better approximation of the state values, which in turn need to be accurate for temporal-difference-based methods like Q-learning to work
    • This phenomenon is reflected in the experiments, where the advantage of the dueling architecture over single-stream Q Q Q networks grows when the number of actions is large.
  • Furthermore, the differences between Q-values for a given state are often very small relative to the magnitude of Q Q Q. This difference in scales can lead to small amounts of noise in the updates can lead to reorderings of the actions, and thus make the nearly greedy policy switch abruptly. The dueling architecture with its separate advantage stream is robust to such effects.
    • dueling architecture can more quickly identify the correct action during policy evaluation as redundant or similar actions are added to the learning problem.

Prioritized Replay


  • The key idea was to increase the replay probability of experience tuples that have a high expected learning progress (as measured via the proxy of absolute TD-error).
    • This led to both faster learning and to better final policy quality across most games of the Atari benchmark suite, as compared to uniform experience replay.
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/296016
推荐阅读
相关标签
  

闽ICP备14008679号