当前位置:   article > 正文

强化学习 --- 前沿技术_exploitation vs

exploitation vs

C. 人工智能 — 强化学习 - 前沿技术

难点

  • Exploitation VS Exploration
  • Sample Efficiency

Model-based RL

  • 概述
    • 针对真实环境建模
    • 通过Model-Network 反馈给 Policy Network
  • 应用场景
    • 棋类游戏
  • 特点
    • 优点
      • 更好的基于环境做规划
    • 缺点
      • 很难完美复现真实环境
  • 算法
    • Alpha Go
      • Training
        • Pre-train the policy network using Supervised Learning
        • Self-play and improve the policy network using Policy Gradient
        • Train value network with state-result pairs(collected during Self-play)
      • Inferencing using MCTS
        • Expand a tree node according to the policy network
        • Evaluate states with the help of value network
    • AlphaGo Zero
      • No pre-training
      • Self-play(with v.s. without MCTS)
      • Network training(sperately v.s. jointly trained networks)
    • Alpha Zero
    • MulZero
      • 需要跟盘面编码(embeddings?)
    • Dream to Control
      • 应用场景
        • 无法对环境做完全建模
      • 思路
        • 环境建模和训练,交替进行,不断完善
      • 细节
        • Learn dynamics using representation learning
          • Representation
          • Transition
          • Reward
        • Learn behavior with imagined trajectories
          • Action
          • Value

Large-scale RL projects

  • 机器手臂解魔方
    • 问题定义
      • 观察:通过多个角度的摄像头观察
      • State:通过CNN转换成 state(vector)
      • Action:事先指定
      • Reward
    • Sim2Real Transfer
      • 通过模拟环境,而非真实环境训练
    • Automatic Domain Randomization
      • 由于真实环境跟模拟环境的差异
        • 摩擦
        • 重力
        • 魔方表面的污点
        • 等等
      • 思路
        • 不断增加环境的复杂度

Meta-RL

  • 需要追溯历史
  • 可以用 Meta-RL 学习 RL 的超参数、Loss Functions 、Exploration Strategies 。

Priors

  • 概述
    • To obtain effective and fast-adapting agents, the agent can rely upon previously distilled knowledge in the form of a prior distribution.
  • 论文
    • Simultaneous learning of a goal-agnostic default policy
    • Learning a dense embedding space to represent a large set of expert behaviors

Multi-agent RL

  • 定义
    • 不同Agent在同一个环境里面,互相学习,互相影响
  • 难点
    • Optimal policy is dependent on the other agents’ policies
    • Convergence to optimal behavior is not guaranteed
  • 任务分类
    • Analysis of emergent behaviors
      • 没有明确的目标,观察一堆agent最后的行为
    • Learning communication
      • 先教agent沟通的行为
    • Learning cooperation
      • 先教agent合作的行为
    • Agents modeling agents
      • 互相学习的能力
  • 算法
    • Social Influence as Intrinsic Motivation
      • A mechanism for achieving coordination in multi-agent RL through rewarding agents for having causal Influence over other agents actions.
        • Actions that lead to bigger changes in other agents behavior are considered influential and are rewarded.
        • Influence is assessed using counterfactual reasoning.
      • in agent’s immediate reward is modified:
        • environmental reward + causal influence reward
    • AlphaStar:星际争霸机器人
      • 先从人类经验中学习。在最顶上的一条,进行自我对弈。
      • 但是,它把进化中的历史“自我”也存储起来,用来与自己对弈,防止进化方向错误。
      • 此外,还保存了一些过去打败自己的“自己”,然后也用于与自己对弈。
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Cpp五条/article/detail/423233
推荐阅读
相关标签
  

闽ICP备14008679号