赞
踩
Thanks Richard S. Sutton and Andrew G. Barto for their great work of Reinforcement Learning: An Introduction - 2nd Edition.
Here we summarize some basic notions and formulations in most reinforcement learning problems. This note DO NOT include detailed explanantion of each notion. Refer to the references above if you want a deeper insight.
Markov decision processes are a classcial formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequentsituations, or states, and through those future rewards. MDPs are a mathematically idealized form of the reinforcement learning problem.
MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations (state) to the agent. The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.
More speci cally, the agent and environment interact at each of a sequence of discrete time steps,
In a finite MDP, the sets of states, actions and rewards all have a finite number of elements. In this case, the random variables
That all of what we mean by goals and purpose can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal, called reward.
The agent always learns to maximize its reward. If we want it to do something for us, we must provide rewards to it in such a way that in maximizing them the agent will also achieve our goals. It is thus critical that the rewards we set up truly indicate what we want accomplished.
We seek to maximize the expected return, where the return, denoted
On the other hand, in many cases the agent-environment interaction does not break naturally into identiable episodes, but goes on continually without limit. We call these continuing tasks. Then we introduce an additional concept, discounting. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses
Almost all reinforcement learning algorithms involve estimating value functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of “how good” here is defi ned in terms of future rewards that can be expected, or, to be precise, in terms of expected return.
A policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy
The value of a state
Similarly, we de fine the value of taking action
The value functions can be estimated from experience.
Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the long run. There is always at least one policy that is better than or equal to all other policies, i.e. optimal policy, denoted by
Because
The Bellman optimality equation for
Assume that the environment is a finite MDP.
First we consider how to compute the state-value function
Our reason for computing the value function for a policy is to help nd better policies. Suppose we have determined the value function
The key criterion is whether this is greater than or less than
The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.
Once a policy,
This way of nding an optimal policy is called policy iteration.
One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration. It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps:
We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes. Almost all reinforcement learning methods are well described as GPI. That is, all have identi able policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right.
The evaluation and improvement processes in GPI can be viewed as both competing and cooperating. They compete in the sense that they pull in opposing directions. Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy, and making the value function consistent with the policy typically causes that policy no longer to be greedy.
DP may not be practical for very large problems, but compared with other methods for solving MDPs, DP methods are actually quite effcient.
Here we give a proof of the convergence of the policy evaluation process. The proof is based on contraction mapping and fixed point principle, but we do not discuss the mathematic basics.
To prove that some iteration sequence is convergent, we only have to prove that the corresponded mapping is a contraction mapping.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。