当前位置:   article > 正文

ai人工智能编程_从人工智能动态编程:Q学习

what isthe markov decision

ai人工智能编程

A failure is not always a mistake, it may simply be the best one can do under the circumstances. The real mistake is to stop trying. — B. F. Skinner

失败并不总是错误,它可能只是在这种情况下可以做的最好的事情。 真正的错误是停止尝试。 — BF斯金纳

Reinforcement learning models are beating human players in games around the world. Huge international companies are investing millions in reinforcement learning. Reinforcement learning in today’s world is so powerful because it requires neither data nor labels. It could be a technique that leads to general artificial intelligence.

强化学习模型正在全球游戏中击败人类玩家。 庞大的国际公司在强化学习上投入了数百万美元。 当今世界,强化学习是如此强大,因为它既不需要数据也不需要标签。 它可能是导致通用人工智能的技术。

有监督和无监督学习 (Supervised and Unsupervised Learning)

As a summary, in supervised learning, a model learns to map input to outputs using predefined and labeled data. An unsupervised learning approach teaches a model to cluster and group data using predefined data.

总而言之,在监督学习中,模型学习使用预定义和标记的数据 将输入映射到输出 。 一种无监督的学习方法,教一个模型使用预定义的数据对数据进行聚类和分组

强化学习 (Reinforcement Learning)

However, in reinforcement learning, the model receives no data set and guidance, using a trial and error approach.

但是,在强化学习中,该模型没有使用试错法接收任何数据集和指导。

Reinforcement learning is an area of machine learning defined by how some model (called agent in reinforcement learning) behaves in an environment to maximize a given reward. The most similar real-world example is of a wild animal trying to find food in its ecosystem. In this example, the animal is the agent, the ecosystem is the environment, and the food is the reward.

强化学习是机器学习的一个领域,它是由某种模型(在强化学习中称为主体)在环境中的行为来定义的,以最大化给定的奖励。 现实世界中最相似的例子是野生动物试图在其生态系统中寻找食物。 在此示例中,动物是主体,生态系统是环境,食物是奖励。

Reinforcement learning is frequently used in the domain of game playing, where there is no immediate way to label how “good” an action was, since we would need to consider all future outcomes.

强化学习经常用于游戏领域,因为我们需要考虑所有未来的结果,因此无法立即标记动作的“良好”程度。

马尔可夫决策过程 (Markov Decision Processes)

The Markov Decision Process is the most fundamental concept of reinforcement learning. There are a few components in an MDP that interact with each other:

马尔可夫决策过程是强化学习的最基本概念。 MDP中有一些相互影响的组件:

  • Agent — the model

    代理-模型
  • Environment — the overall situation

    环境-总体情况
  • State — the situation at a specific time

    状态-特定时间的情况
  • Action — how the agent acts

    行动-代理如何行动
  • Reward — feedback from the environment

    奖励-来自环境的反馈

MDP表示法 (MDP Notation)

Image for post
Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning Sutton,RS和Barto,AG强化学习简介

To repeat what was previously discussed in more mathematically formal terms, some notation must be defined.

为了以数学上更正式的形式重复先前讨论的内容,必须定义一些符号。

  • t represents the current time step

    t代表当前时间步长

  • S is the set of all possible states, with S_t being the state at time t

    S是所有可能状态的集合,其中S_t是时间t的状态

  • A is the set of all possible actions, with A_t being the action performed at time t

    A是所有可能动作的集合, A_t是在时间t执行的动作

  • R is the set of all possible rewards, with R_t being the reward received after performing A_(t-1)

    R是所有可能的奖励的集合, R_t是执行A_(t-1)后收到的奖励

  • T is the last time step (the last step happens when a certain condition is reached or t is higher than a value)

    T是最后一个时间步(当达到某个条件或t大于某个值时,最后一步发生)

The process can be written as:

该过程可以写成:

  1. The agent receives a state S_t

    代理收到状态S_t

  2. The agent performs an action A_t based on S_t

    代理基于S_t执行动作A_t

  3. The agent receives a reward R_(t+1)

    代理收到奖励R_(t + 1)

  4. The environments transitions into a new state S_(t+1)

    环境过渡到新状态S_(t + 1)

  5. The cycle repeats for t+1

    循环重复t + 1

预期折扣收益(做出长期决策) (Expected Discounted Return (Making Long-Term Decisions))

We discussed that in order for an agent to play a game well, it would need to take future rewards into consideration. This can be described as:

我们讨论过,为了使特工更好地玩游戏,需要考虑未来的收益。 这可以描述为:

G(t) = R_(t+1) + R_(t+2) +… + R_(T), where G(t) is the sum of the rewards the agent expects after time t.

G(t)= R_(t + 1)+ R_(t + 2)+…+ R_(T) ,其中G(t)是代理商在时间t之后期望得到的报酬之和。

However, if T is infinite, in order to to make G(t) converge to a single number, we define the discount rate γ to be a number smaller than 1, and define:

但是,如果T为无穷大,为了使G(t)收敛为一个数,我们将折现率γ定义为小于1的数,并定义:

G(t) = R_(t+1) + γR_(t+2) +γ²R_(t+2)+…

G(t)= R_(t + 1)+γR_(t + 2)+γ²R_(t + 2)+…

This can also be written as:

也可以写成:

G(t) = R_(t+1) + γG(t+1)

G(t)= R_(t + 1)+γG(t + 1)

价值与质量(Q学习是质量学习) (Value and Quality (Q-Learning is Quality-Learning))

A policy describes how an agent will act given any state it finds itself in. An agent is said to follow a policy. Value and Quality functions describe how “good” it is for an agent to be in a state, or a state and perform an action.

策略描述了座席在发现自己所处的任何状态下将如何行动。据说座席遵循策略。 价值和质量功能描述了代理处于一种状态或一种状态并执行一个动作的“良好”程度。

Specifically, the value function v_p(s) is equal to the expected discounted return while starting in state s and following a policy p. The quality function q_p(s, a) is equal to the best expected discounted return possible while starting in state s, performing action a, and then following policy p.

具体来说,值函数v_p(s)等于从状态s开始并遵循策略p时的预期折现收益。 质量函数q_p(s,a)等于在状态s中开始执行动作a然后遵循策略p时可能的最佳预期折现收益。

v_p(s) = (G(t) | S_t=s)

v_p(s)=(G(t)| S_t = s)

q_p(s, a) = (G(t) | S_t=s, A_t = a)

q_p(s,a)=(G(t)| S_t = s,A_t = a)

A policy is better or equal to another policy if it has a greater or equal discounted expected return for every state. The optimal value and quality functions v* and q* use the best possible policy.

如果一个保单的每个州的预期收益都等于或大于折算的期望收益,则该保单优于或等于另一个保单。 最优值和质量函数v *q *使用最佳策略。

Q的Bellman方程* (Bellman Equation for Q*)

The Bellman Equation another extremely important concept that turns q-learning into dynamic programming combined with a gradient descent-like idea.

贝尔曼方程式是另一个极其重要的概念,它结合了类似梯度下降的思想,将q学习转变为动态编程。

It states that when following the best policy, the q value of a state and action (q_p(s, a)) is the same as the reward received for performing a during s plus the maximum expected discounted reward after performing a during s multiplied by the discount rate.

它指出,继最好的政策,状态和动作的Q值时(q_p(S,A))是一样的在S 进行最大期望在S乘以执行后折扣奖励获得奖励折现率。

q*(s_t, a_t) = R_(t+1) + γq*(s_(t+1), a_(t+1))

q *(s_t,a_t)= R_(t + 1)+γq*(s_(t + 1),a_(t + 1))

The quality of the best action is equal to the reward plus the quality of best action on the next time step times the discount rate.

最佳动作的质量等于奖励加上下一个步骤中最佳动作的质量乘以折现率。

Once we find q*, we can find the best policy by using q-learning to find the best policy.

一旦找到q * ,就可以通过使用q-learning找到最佳策略来找到最佳策略。

Q学习 (Q-Learning)

Q-learning is a technique which attempts to maximize the expected reward over all time steps by finding the best q function. In other words, the objective of q-learning is the same as the objective of dynamic programming, but with the discount rate.

Q学习是一种试图通过找到最佳q函数在所有时间步长上最大化预期回报的技术。 换句话说,q学习的目标与动态规划的目标相同,但是具有折扣率。

In q-learning, a table with all possible state-action pairs is created, and the algorithm iteratively updates all the values of the table using the bellman equation until the optimal q-values are found.

在q学习中,创建了一个具有所有可能的状态-动作对的表,该算法使用bellman方程迭代更新表的所有值,直到找到最佳q值。

We define a learning rate, a number between 0 and 1 describing how much of the old q-value we overwrite and the new one we keep each iteration.

我们定义一个学习率,一个介于0和1之间的数字,描述了我们覆盖的旧q值和每次迭代保留的新q值的数量。

The process can be described like with the pseudocode:

可以使用伪代码来描述该过程:

Q = np.zeros((state_size, action_size))for i in range(max_t):  action = np.argmax(Q[current_state,:])  new_state, reward = step(action)  Q[state, action] = Q[state, action] * (1-learning_rate) + \  (reward + gamma * np.argmax(Q[new_state,:])) * learning_rate  state = new_state  if(game_over(state)):    break

勘探与开发 (Exploration and Exploitation)

In the beginning, we do not know anything about our environment, so we want to prioritize exploring and gathering information, even it it means we do not get as much reward as possible.

最初,我们对环境一无所知,因此我们希望优先探索和收集信息,即使这意味着我们不会获得尽可能多的回报。

Later, we want to increase our high score and prioritize finding ways to getting more rewards by exploiting the q-table.

后来,我们希望增加我们的高分,并优先考虑通过利用 q表获得更多奖励的方法。

To do this, we can create the variable epsilon, described by hyperparameters to describe when to explore, and when to exploit. Specifically, when a random number generated is higher than epsilon, we exploit, otherwise, we explore.

为此,我们可以创建由超参数描述的变量epsilon,以描述何时进行探索以及何时进行利用。 具体来说,当生成的随机数高于epsilon时,我们会利用,否则,我们会进行探索。

The new code is as follows:

新代码如下:

Q = np.zeros((state_size, action_size))epsilon = 1for _ in range(batches):  for i in range(max_t):    if(epsilon > random.uniform(0, 1)):      action = np.argmax(Q[state,:])    else:      action = np.random.rand(possible_actions(state))    new_state, reward = time_step(action)    Q[state, action] = Q[state, action] * (1-learning_rate) + \(reward + gamma * np.argmax(Q[new_state,:])) * learning_rate    state = new_state  epsilon *= epsilon_decay_rate  if(game_over(state)):      break

摘要 (Summary)

  • Reinforcement learning focuses on a situation where an agent receives no data set, and learns from the actions and rewards it receives from the environment.

    强化学习的重点是代理没有收到任何数据集,而是从行为中学习并从环境中获得奖励。
  • The Markov Decision Process is a control process that models decision making of an agent placed in an environment.

    马尔可夫决策过程是一个控制过程,可对放置在环境中的代理的决策进行建模。
  • The Bellman Equation describes a characteristic that the best policy has that turns the problem into modified dynamic programming.

    贝尔曼方程式描述了一种最佳策略所具有的特征,该特征将问题转化为修改后的动态规划。
  • The agent prioritizes exploring in the beginning, but eventually transitions to exploiting

    该代理从一开始就优先进行探索,但最终过渡到利用

翻译自: https://medium.com/swlh/dynamic-programming-to-artificial-intelligence-q-learning-51a189fc0441

ai人工智能编程

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/605154
推荐阅读
相关标签
  

闽ICP备14008679号