当前位置:   article > 正文

MALib:基于群体的多智能体强化学习并行框架_多智能体强化学习框架

多智能体强化学习框架

MALib: A Parallel Framework for Population-based Multi-agent Reinforcement Learning

基于群体的多智能体强化学习并行框架

Abstract 摘要

        Population-based multi-agent reinforcement learning (PB-MARL) refers to the series of methods nested with reinforcement learning (RL) algorithms, which produces a self-generated sequence of tasks arising from the coupled population dynamics. By leveraging auto-curricula to induce a population of distinct emergent strategies, PB-MARL has achieved impressive success in tackling multi-agent tasks. Despite remarkable prior arts of distributed RL frameworks, PB-MARL poses new challenges for parallelizing the training frameworks due to the additional complexity of multiple nested workloads between sampling, training and evaluation involved with heterogeneous policy interactions. To solve these problems, we present MALib, a scalable and efficient computing framework for PB-MARL. Our framework is comprised of three key components: (1) a centralized task dispatching model, which supports the self-generated tasks and scalable training with heterogeneous policy combinations; (2) a programming architecture named Actor-Evaluator-Learner, which achieves high parallelism for both training and sampling, and meets the evaluation requirement of auto-curriculum learning; (3) a higher-level abstraction of MARL training paradigms, which enables efficient code reuse and flexible deployments on different distributed computing paradigms. Experiments on a series of complex tasks such as multi-agent Atari Games show that MALib achieves throughput higher than 40K FPS on a single machine with 32 CPU cores; 5× speedup than RLlib and at least 3× speedup than OpenSpiel in multi-agent training tasks. MALib is publicly available at https://github.com/sjtu-marl/malib.

        基于种群的多智能体强化学习(PB-MARL)是指嵌套强化学习(RL)算法的一系列方法,这些方法通过耦合种群动态产生自生成的任务序列。通过利用自动课程来诱导不同的紧急策略群体,PB-MARL在处理多智能体任务方面取得了令人印象深刻的成功。尽管分布式强化学习框架的现有技术非常出色,但由于涉及异构策略交互的采样、训练和评估之间的多个嵌套工作负载的额外复杂性,PB-MARL对并行化训练框架提出了新的挑战。为了解决这些问题,我们提出了一个可扩展的、高效的PB-MARL计算框架MALib。该框架由三个关键部分组成:(1)集中任务调度模型,支持异构策略组合下的自生成任务和可扩展训练;(2)基于Actor-Evaluator-Learner的编程体系结构,实现了训练和采样的高度并行性,满足了自动课程学习的评估要求;(3)对MARL训练范式进行了更高层次的抽象,实现了在不同分布式计算范式上的高效代码重用和灵活部署。在多智能体Atari Games等一系列复杂任务上的实验表明,MALib在32个CPU核的单机上实现了高于40K FPS的吞吐量;在多智能体训练任务中,比RLlib加速5倍,比OpenSpiel加速至少3倍。MALib可在https://github.com/sjtu-marl/malib公开获取。

 1 Introduction

1 介绍

        Training intelligent agents that can adapt to a diverse set of complex environments and agents has been a long-standing challenge. A feasible way to handle these tasks is multi-agent reinforcement learning (MARL) [2], which has shown great potentials to solve multi-agent tasks such as real-time strategy games [45], traffic light control [47] and ride-hailing [50]. In particular, the PB-MARL algorithms combine deep reinforcement learning (DRL) and dynamical population selection methodologies (e.g., game theory [9], evolutionary strategies [34]) to generate auto-curricula. In such a way, PB-MARL continually generates advanced intelligence and has achieved impressive successes in some non-trivial tasks, like Dota2 [30], StrarCraftII [44] and Leduc Poker [23].

        However, due to the intrinsic dynamics arising from multi-agent and population, these algorithms have intricately nested structure and are extremely data-thirsty, requiring a flexible and scalable training framework to ground their effectiveness.

        训练能够适应各种复杂环境和代理的智能代理一直是一个长期存在的挑战。处理这些任务的可行方法是多智能体强化学习(MARL)[2],它在解决实时策略游戏[45]、交通灯控制[47]和网约车[50]等多智能体任务方面显示出巨大的潜力。特别是,PB-MARL算法结合了深度强化学习(DRL)和动态种群选择方法(例如,博弈论[9],进化策略[34])来生成自动课程。通过这种方式,PB-MARL不断产生高级智能,并在一些重要任务中取得了令人印象深刻的成功,如Dota2[30]、StrarCraftII[44]和Leduc Poker[23]。

        然而,由于多智能体和种群的内在动态性,这些算法具有复杂的嵌套结构,并且非常需要数据,

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/一键难忘520/article/detail/953700
推荐阅读
相关标签
  

闽ICP备14008679号