RT-2论文翻译: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control_rt-2: vision-language-action models transfer web k

作者：Monodyee | 2024-04-14 00:58:02

踩

rt-2: vision-language-action models transfer web knowledge to robotic contro

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

RT-2: 用互联网知识训练的视觉语言模型融入到机器人控制中

RT1 论文翻译：https://blog.csdn.net/weixin_43334869/article/details/135850410

文章目录

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2: 用互联网知识训练的视觉语言模型融入到机器人控制中

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

我们研究了如何将在互联网规模数据上训练的视觉语言模型直接融入端到端的机器人控制，以提升泛化能力并实现新兴的语义推理。我们的目标是使单一的端到端训练模型既能学会将机器人观察映射到动作，又能享受来自Web的语言和视觉语言数据的大规模预训练的益处。为此，我们提出在机器人轨迹数据和互联网规模的视觉语言任务（如视觉问答）上共同微调最先进的视觉语言模型。与其他方法不同，我们提出了一个简单而通用的方法来实现这一目标：为了将自然语言响应和机器人动作适应于相同的格式，我们将动作表达为文本标记(tokens)，并将其直接纳入模型的训练集中，与自然语言标记相同。我们将这类模型称为视觉语言行动模型（VLA），并实例化了一个这样的模型，我们称之为RT-2。我们进行了广泛的评估（6,000次评估试验），结果表明我们的方法可以带来高性能的机器人策略，并使RT-2能够从互联网规模的训练中获得一系列新的能力。这包括对新物体的显着改进的泛化能力，解释机器人训练数据中不存在的命令（例如将物体放在特定的数字或图标上），以及对用户命令做出初步推理的能力（例如拾取最小或最大的物体，或离其他物体最近的物体）。我们进一步展示，纳入思维链（chain of thought）推理使RT-2能够进行多阶段语义推理，例如找出哪个物体可以用作临时的锤子（一块岩石），或者对于疲劳的人来说哪种类型的饮料最适合（能量饮料）。

1.Introduction

1.介绍

High-capacity models pretrained on broad web-scale datasets provide an effective and powerful platform for a wide range of downstream tasks: large language models can enable not only fluent text generation (Anil et al., 2023; Brohan et al., 2022; OpenAI, 2023) but emergent problem-solving (Cobbe et al., 2021; Lewkowycz et al., 2022; Polu et al., 2022) and creative generation of prose (Brown et al., 2020; OpenAI, 2023) and code (Chen et al., 2021), while vision-language models enable open-vocabulary visual recognition (Kirillov et al., 2023; Minderer et al., 2022; Radford et al., 2021) and can even make complex inferences about object-agent interactions in images (Alayrac et al., 2022; Chen et al., 2023a,b; Driess et al., 2023; Hao et al., 2022; Huang et al., 2023; Wang et al., 2022). Such semantic reasoning, problem solving, and visual interpretation capabilities would be tremendously useful for generalist robots that must perform a variety of tasks in real-world environments. However, it is unclear how robots should acquire such capabilities. While a brute force approach might entail collecting millions of robotic interaction trials, the most capable language and vision-language models are trained on billions of tokens and images from the web (Alayrac et al., 2022; Chen et al., 2023a,b; Huang et al., 2023) – an amount unlikely to be matched with robot data in the near future. On the other hand, directly applying such models to robotic tasks is also difficult: such models reason about semantics, labels, and textual prompts, whereas robots require grounded low-level actions, such as Cartesian end-effector commands. While a number of recent works have sought to incorporate language models (LLMs) and vision-language models (VLMs) into robotics (Ahn et al., 2022; Driess et al., 2023; Vemprala et al., 2023), such methods generally address only the “higher level” aspects of robotic planning, essentially taking the role of a state machine that interprets commands and parses them into individual primitives (such as picking and placing objects), which are then executed by separate low-level controllers that themselves do not benefit from the rich semantic knowledge of Internet-scale models during training. Therefore, in this paper we ask: can large pretrained visionlanguage models be integrated directly into low-level robotic control to boost generalization and enable emergent semantic reasoning?

在广泛的Web规模数据集上预训练的高容量模型为各种下游任务提供了有效且强大的平台：大型语言模型不仅可以实现流畅的文本生成（Anil等，2023；Brohan等，2022；OpenAI，2023），还能实现新兴的问题解决（Cobbe等，2021；Lewkowycz等，2022；Polu等，2022）以及散文（Brown等，2020；OpenAI，2023）和代码（Chen等，2021）的创造性生成，而视觉语言模型则实现了开放词汇的视觉识别（Kirillov等，2023；Minderer等，2022；Radford等，2021），甚至可以对图像中的物体-代理交互进行复杂推理（Alayrac等，2022；Chen等，2023a,b；Driess等，2023；Hao等，2022；Huang等，2023；Wang等，2022）。这种语义推理、问题解决和视觉解释的能力对于必须在真实环境中执行各种任务的通用型机器人将是非常有用的。然而，尚不清楚机器人应该如何获得这些能力。虽然蛮力方法可能包括收集数百万次机器人交互试验，但最有能力的语言和视觉语言模型是在来自Web的数十亿个标记和图像上训练的（Alayrac等，2022；Chen等，2023a,b；Huang等，2023） - 在可预见的未来内，机器人数据可能无法匹配这一数量。另一方面，直接将这些模型应用于机器人任务也很困难：此类模型对语义、标签和文本提示进行推理，而机器人需要基于实际低级动作，如笛卡尔末端执行器命令。虽然最近的一些工作尝试将语言模型（LLMs）和视觉语言模型（VLMs）纳入机器人（Ahn等，2022；Driess等，2023；Vemprala等，2023），但这些方法通常只涉及机器人规划的“更高层”方面，基本上扮演状态机的角色，解释命令并将其解析为单个基元（例如拾取和放置物体），然后由单独的低级控制器执行，这些低级控制器本身在训练期间并不受益于互联网规模模型的丰富语义知识。因此，在本文中，我们提出一个问题：能否将大型预训练的视觉语言模型直接整合到低级机器人控制中，以提升泛化能力并实现新兴的语义推理？

To this end, we explore an approach that is both simple and surprisingly effective: we directly train vision-language models designed for open-vocabulary visual question answering and visual dialogue to output low-level robot actions, along with solving other Internet-scale vision-language tasks. Although such models are typically trained to produce natural language tokens, we can train them on robotic trajectories by tokenizing the actions into text tokens and creating “multimodal sentences” (Driess et al., 2023) that “respond” to robotic instructions paired with camera observations by producing corresponding actions. In this way, vision-language models can be directly trained to act as instruction following robotic policies. This simple approach is in contrast with prior alternatives for incorporating VLMs into robot policies (Shridhar et al., 2022a) or designing new vision-languageaction architectures from scratch (Reed et al., 2022): instead, pre-existing vision-language models, with already-amortized significant compute investment, are trained without any new parameters to output text-encoded actions. We refer to this category of models as vision-language-action (VLA) models. We instantiate VLA models by building on the protocol proposed for RT-1 (Brohan et al., 2022), using a similar dataset, but expanding the model to use a large vision-language backbone. Hence we refer to our model as RT-2 (Robotics Transformer 2). We provide an overview in Figure 1.

为此，我们探索了一种既简单又出奇地有效的方法：我们直接训练设计用于开放词汇的视觉问答和视觉对话的视觉语言模型，使其输出低级机器人动作，同时解决其他互联网规模的视觉语言任务。尽管这些模型通常是为生成自然语言标记而训练的，但我们可以通过将动作标记成文本标记并创建“多模态句子”（Driess等，2023）来对机器人轨迹进行训练，这些句子通过生成相应的动作来“响应”与相机观察配对的机器人指令。通过这种方式，视觉语言模型可以直接训练成为遵循指令的机器人策略。这种简单的方法与以前将VLMs纳入机器人策略（Shridhar等，2022a）或从头开始设计新的视觉语言动作架构（Reed等，2022）的替代方法形成鲜明对比。相反，我们使用已经进行了大量计算投资的现有视觉语言模型进行训练，而无需引入任何新的参数，以输出文本编码的动作。我们将这类模型称为视觉语言动作（VLA）模型。我们通过在RT-1提出的协议基础上构建VLA模型，使用类似的数据集，扩展模型以使用大型视觉语言骨干，因此我们将我们的模型称为RT-2（Robotics Transformer 2）。我们在图1中提供了一个概述。
在这里插入图片描述

Figure 1 | RT-2 overview: we represent robot actions as another language, which can be cast into text tokens and trained together with Internet-scale vision-language datasets. During inference, the text tokens are de-tokenized into robot actions, enabling closed loop control. This allows us to leverage the backbone and pretraining of vision-language models in learning robotic policies, transferring some of their generalization, semantic understanding, and reasoning to robotic control. We demonstrate examples of RT-2 execution on the project website: robotics-transformer2.github.io.

图1 | RT-2概述：我们将机器人动作表示为另一种语言，可以转换为文本标记并与互联网规模的视觉语言数据集一起训练。在推理过程中，文本标记被解标记为机器人动作，实现闭环控制。这使我们能够在学习机器人策略时利用视觉语言模型的骨干结构和预训练，将它们的泛化、语义理解和推理部分转移到机器人控制。我们在项目网站上展示了RT-2执行的示例：robotics-transformer2.github.io。

We observe that robotic policies derived from such vision-language models exhibit a range of remarkable capabilities, combining the physical motions learned from the robot data with the ability to interpret images and text learned from web data into a single model. Besides the expected benefit of dramatically improving generalization to novel objects and semantically varied instructions, we observe a number of emergent capabilities. While the model’s physical skills are still limited to the distribution of skills seen in the robot data, the model acquires the ability to deploy those skills in new ways by interpreting images and language commands using knowledge gleaned from the web. Some example highlights are shown in Figure 2. The model is able to re-purpose pick and place skills learned from robot data to place objects near semantically indicated locations, such as specific numbers or icons, despite those cues not being present in the robot data. The model can also interpret relations between objects to determine which object to pick and where to place it, despite no such relations being provided in the robot demonstrations. Furthermore, if we augment the command with chain of thought prompting, the model is able to make even more complex semantic inferences, such as figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

我们观察到，从这些视觉语言模型派生的机器人策略展示了一系列引人注目的能力，将从机器人数据学到的物理运动与从Web数据中学到的图像和文本解释能力结合成一个单一的模型。除了显著提高对新颖对象和语义多样指令的泛化能力的预期益处之外，我们还观察到一些新兴的能力。虽然模型的物理技能仍然受限于机器人数据中看到的技能分布，但通过使用从Web获取的知识解释图像和语言命令，模型获得了以新方式运用这些技能的能力。图2中展示了一些示例亮点。该模型能够重新利用从机器人数据中学到的拾取和放置技能，将物体放置在语义指示的位置附近，例如特定的数字或图标，尽管在机器人数据中不存在这些提示。模型还可以解释物体之间的关系，确定应该拾取哪个物体以及放置在哪里，尽管机器人演示中未提供这些关系。此外，如果我们用思维链提示来增强命令，模型能够进行更复杂的语义推理，例如找出哪个物体可以用作临时的锤子（一块岩石），或者哪种类型的饮料最适合疲劳的人（能量饮料）。

在这里插入图片描述
Figure 2 | RT-2 is able to generalize to a variety of real-world situations that require reasoning, symbol understanding, and human recognition. We study these challenging scenarios in detail in Section 4.

图2 | RT-2 能够推广到需要推理、符号理解和人类识别的各种现实场景。我们将在第4节详细研究这些具有挑战性的情景。

Our main contribution is RT-2, a family of models derived from fine-tuning large vision-language models trained on web-scale data to directly act as generalizable and semantically aware robotic policies. Our experiments investigate models with up to 55B parameters trained on Internet data and instruction-annotated robotic trajectories from previous work (Brohan et al., 2022). Over the course of 6k robotic evaluations, we show that RT-2 enable significant improvements to generalization over objects, scenes, and instructions, and exhibit a breadth of emergent capabilities inherited from web-scale vision-language pretraining.

我们的主要贡献是RT-2，这是一系列从在Web规模数据上训练的大型视觉语言模型微调而来的模型，直接作为通用且语义感知的机器人策略。我们的实验涉及在Internet数据和先前工作中（Brohan等，2022）的指令注释机器人轨迹上训练的具有最多55B参数的模型。在6,000次机器人评估的过程中，我们展示了RT-2在对象、场景和指令的泛化方面取得了显著的改进，并展示了从Web规模视觉语言预训练中继承的新兴能力的广度。

2.Related Work

2.相关工作

Vision-language models. There are several categories of Vision-Language Models (VLMs) (Gan et al., 2022), with perhaps two most relevant: (1) representation-learning models, e.g. CLIP (Radford et al., 2021), which learn common embeddings for both modalities, and (2) visual language models of the form {vision, text} → {text} which learn to take vision and language as input and provide free-form text. Both categories have been used to provide pretraining for a wide variety of applied to downstream applications such as object classification (Radford et al., 2021), detection (Gu et al., 2021), and segmentation (Ghiasi et al., 2021). In this work, we focus on the latter category (Alayrac et al., 2022; Chen et al., 2023a,b; Driess et al., 2023; Hao et al., 2022; Li et al., 2023, 2019; Lu et al., 2019). These models are generally trained on many different tasks, such as image captioning, vision-question answering (VQA), and general language tasks on multiple datasets at the same time. While prior works study VLMs for a wide range of problems and settings including in robotics, our focus is on how the capabilities of VLMs can be extended to robotics closed-loop control by endowing them with the ability to predict robot actions, thus leveraging the knowledge already present in VLMs to enable new levels of generalization.

视觉语言模型。 有几类视觉语言模型（VLMs）(Gan等，2022)，其中可能最相关的是两类：(1) 表征学习模型，例如 CLIP（Radford等，2021），它学习两种模态的共同嵌入，以及(2) 形式为 {视觉，文本} → {文本} 的视觉语言模型，它们学习将视觉和语言作为输入，输出为自由格式文本的模型。这两类模型已被用于为各种下游应用提供预训练，如对象分类（Radford等，2021），检测（Gu等，2021）和分割（Ghiasi等，2021）。在这项工作中，我们关注后一类模型（Alayrac等，2022；Chen等，2023a,b；Driess等，2023；Hao等，2022；Li等，2023, 2019；Lu等，2019）。这些模型通常在多个数据集上同时进行训练，涉及图像字幕、视觉问答（VQA）和多个通用语言任务。尽管以前的研究，研究了VLMs在广泛的问题和设置中的应用，包括在机器人学中，我们的焦点是如何通过赋予它们预测机器人动作的能力，将VLMs的能力扩展到机器人闭环控制，从而利用VLMs中已有的知识以实现新的泛化水平。

Generalization in robot learning. Developing robotic controllers that can broadly succeed in a variety of scenarios is a long-standing goal in robotics research (Kaelbling, 2020; Smith and Coles, 1973). A promising approach for enabling generalization in robotic manipulation is by learning from large and diverse datasets (Dasari et al., 2019; Levine et al., 2018; Pinto and Gupta, 2016). By doing so, prior methods have demonstrated how robots can generalize to novel object instances (Finn and Levine, 2017; Levine et al., 2018; Mahler et al., 2017; Pinto and Gupta, 2016; Young et al., 2021), to tasks involving novel combinations of objects and skills (Dasari and Gupta, 2021; Finn et al., 2017; James et al., 2018; Jang et al., 2021; Yu et al., 2018), to new goals or language instructions (Jang et al., 2021; Jiang et al., 2022; Liu et al., 2022; Mees et al., 2022; Nair et al., 2022a; Pong et al.,2019), to tasks with novel semantic object categories (Shridhar et al., 2021; Stone et al., 2023), and to unseen environments (Cui et al., 2022; Du et al., 2023a; Hansen et al., 2020). Unlike most of these prior works, we aim to develop and study a single model that can generalize to unseen conditions along all of these axes. A key ingredient of our approach is to leverage pre-trained models that have been exposed to data that is much broader than the data seen by the robot.

机器人学习中的泛化。 开发能够在各种场景中取得广泛成功的机器人控制器是机器人研究中的长期目标（Kaelbling, 2020；Smith and Coles, 1973）。在机器人操纵中实现泛化的一个有前途的方法是通过从大规模且多样化的数据集中学习（Dasari等，2019；Levine等，2018；Pinto和Gupta，2016）。通过这样做，先前的方法已经展示了机器人如何对新颖的对象实例（Finn和Levine，2017；Levine等，2018；Mahler等，2017；Pinto和Gupta，2016；Young等，2021）、涉及新颖对象和技能组合的任务（Dasari和Gupta，2021；Finn等，2017；James等，2018；Jang等，2021；Yu等，2018）、新目标或语言指令（Jang等，2021；Jiang等，2022；Liu等，2022；Mees等，2022；Nair等，2022a；Pong等，2019）、具有新的语义对象类别的任务（Shridhar等，2021；Stone等，2023）以及未见过的环境（Cui等，2022；Du等，2023a；Hansen等，2020）进行泛化。与大多数这些先前的工作不同，我们的目标是开发和研究一个能够在所有这些轴上，对未见条件进行泛化的单一模型。我们方法的一个关键要素是利用经过预训练的模型，这些模型相比机器人看到的数据，可以接触到更广泛的数据。

Pre-training for robotic manipulation. Pre-training has a long history in robotic learning. Most works focus on pre-trained visual representations that can be used to initialize the encoder of the robot’s camera observations, either via supervised ImageNet classification (Shah and Kumar, 2021), data augmentation (Kostrikov et al., 2020; Laskin et al., 2020a,b; Pari et al., 2021) or objectives that are tailored towards robotic control (Karamcheti et al., 2023; Ma et al., 2022; Majumdar et al., 2023b; Nair et al., 2022b; Xiao et al., 2022b). Other works have incorporated pre-trained language models, often either as an instruction encoder (Brohan et al., 2022; Hill et al., 2020; Jang et al., 2021; Jiang et al., 2022; Lynch and Sermanet, 2020; Nair et al., 2022a; Shridhar et al., 2022b) or for high-level planning (Ahn et al., 2022; Driess et al., 2023; Huang et al., 2022; Mu et al., 2023; Singh et al., 2023; Wu et al., 2023). Rather than using pre-training vision models or pre-trained language models, we specifically consider the use of pre-trained vision-language models (VLMs), which provide rich, grounded knowledge about the world. Prior works have studied the use of VLMs for robotics (Driess et al., 2023; Du et al., 2023b; Gadre et al., 2022; Karamcheti et al., 2023; Shah et al., 2023; Shridhar et al., 2021; Stone et al., 2023), and form part of the inspiration for this work. These prior approaches use VLMs for visual state representations (Karamcheti et al., 2023), for identifying objects (Gadre et al., 2022; Stone et al., 2023), for high-level planning (Driess et al., 2023), or for providing supervision or success detection (Du et al., 2023b; Ma et al., 2023; Sumers et al., 2023; Xiao et al., 2022a; Zhang et al., 2023). While CLIPort (Shridhar et al., 2021) and MOO (Stone et al., 2023) integrate pre-trained VLMs into end-to-end visuomotor manipulation policies, both incorporate significant structure into the policy that limits their applicability. Notably, our work does not rely on a restricted 2D action space and does not require a calibrated camera. Moreover, a critical distinction is that, unlike these works, we leverage VLMs that generate language, and the unified output space of our formulation enables model weights to be entirely shared across language and action tasks, without introducing action-only model layer components.

机器人操作的预训练模型。 预训练在机器人学习中有着悠久的历史。大多数研究关注的是可以用于初始化机器人摄像头观察的编码器的预训练视觉表示，可以通过监督式ImageNet分类（Shah和Kumar，2021）、数据增强（Kostrikov等，2020；Laskin等，2020a,b；Pari等，2021）或专为机器人控制定制的目标（Karamcheti等，2023；Ma等，2022；Majumdar等，2023b；Nair等，2022b；Xiao等，2022b）来实现。其他作品已经整合了预训练的语言模型，通常是作为指令编码器（Brohan等，2022；Hill等，2020；Jang等，2021；Jiang等，2022；Lynch和Sermanet，2020；Nair等，2022a；Shridhar等，2022b）或用于高层规划（Ahn等，2022；Driess等，2023；Huang等，2022；Mu等，2023；Singh等，2023；Wu等，2023）。与使用预训练视觉模型或预训练语言模型不同，我们专门考虑了预训练的视觉语言模型（VLMs）的使用，这提供了关于世界的丰富而基础的知识。先前的研究已经研究了在机器人学中使用VLMs的方法（Driess等，2023；Du等，2023b；Gadre等，2022；Karamcheti等，2023；Shah等，2023；Shridhar等，2021；Stone等，2023），并且这些研究部分启发了这项工作。这些先前的方法使用VLMs进行视觉状态表示（Karamcheti等，2023）、识别对象（Gadre等，2022；Stone等，2023）、高层规划（Driess等，2023）或提供监督或成功检测（Du等，2023b；Ma等，2023；Sumers等，2023；Xiao等，2022a；Zhang等，2023）。虽然 CLIPort（Shridhar等，2021）和 MOO（Stone等，2023）将预训练的VLMs整合到端到端的视觉运动操作策略中，但两者都在策略中整合了显著的结构，限制了它们的适用性。值得注意的是，我们的工作不依赖于受限的2D动作空间，并且不需要校准的摄像头。而且，一个关键的区别是，与这些作品不同，我们利用生成语言的VLMs，并且我们的公式的统一输出空间使模型权重可以在语言和动作任务之间完全共享，而无需引入仅用于动作的模型层组件。

3.Vision-Language-Action Models

3.视觉语言动作模型

In this section, we present our model family and the design choices for enabling training VLMs to directly perform closed-loop robot control. First, we describe the general architecture of our models and how they can be derived from models that are commonly used for vision-language tasks. Then, we introduce the recipe and challenges of fine-tuning large VLMs that are pre-trained on web-scale data to directly output robot actions, becoming VLA models. Finally, we describe how to make these models practical for robot tasks, addressing challenges with model size and inference speed to enable real-time control.

在本节中，我们介绍我们的模型系列以及设计选择，使得训练VLMs能够直接执行闭环机器人控制成为可能。首先，我们描述我们模型的通用架构以及它们如何可以从通常用于视觉语言任务的模型中派生出来。然后，我们介绍了对大型 VLM 进行微调的方法和挑战，这些 VLM 在网络规模数据上进行了预训练，以直接输出机器人动作，形成 VLA 模型。最后，我们描述了如何使这些模型适用于机器人任务，解决模型大小和推理速度方面的挑战以实现实时控制。

3.1. Pre-Trained Vision-Language Models

3.1. 预训练的视觉语言模型

The vision-language models (Chen et al., 2023a; Driess et al., 2023) that we build on in this work take as input one or more images and produce a sequence of tokens, which conventionally represents natural language text. Such models can perform a wide range of visual interpretation and reasoning tasks, from inferring the composition of an image to answering questions about individual objects and their relations to other objects (Alayrac et al., 2022; Chen et al., 2023a; Driess et al., 2023; Huang et al., 2023). Representing the knowledge necessary to perform such a wide range of tasks requires large models and web-scale datasets. In this work, we adapt two previously proposed VLMs to act as VLA models: PaLI-X (Chen et al., 2023a) and PaLM-E (Driess et al., 2023). We will refer to vision-language-action versions of these models as RT-2-PaLI-X and RT-2-PaLM-E. We leverage instantiations of these models that range in size from billions to tens of billions of parameters. We provide a detailed description of the architecture of these two models in Appendix D.

我们在这项工作中构建的视觉语言模型（Chen等，2023a；Driess等，2023）以一个或多个图像作为输入，并生成一系列标记，这系列标记通常代表自然语言文本。这样的模型可以执行各种视觉解释和推理任务：从推断图像的组成，到回答有关个别对象及其与其他对象关系的问题（Alayrac等，2022；Chen等，2023a；Driess等，2023；Huang等，2023）。代表执行如此广泛任务所需的知识，需要大型模型和Web规模数据集。在这项工作中，我们调整了两个先前提出的VLMs，使其充当VLA模型：PaLI-X（Chen等，2023a）和PaLM-E（Driess等，2023）。我们将这些模型的视觉语言动作版本称为RT-2-PaLI-X和RT-2-PaLM-E。我们利用这些模型的实例，其参数范围从数十亿到数百亿。我们在附录D中提供了对这两个模型架构的详细描述。

3.2. Robot-Action Fine-tuning

3.2. 机器人动作微调

To enable vision-language models to control a robot, they must be trained to output actions. We take a direct approach to this problem, representing actions as tokens in the model’s output, which are treated in the same way as language tokens. We base our action encoding on the discretization proposed by Brohan et al. (2022) for the RT-1 model. The action space consists of 6-DoF positional and rotational displacement of the robot end-effector, as well as the level of extension of the robot gripper and a special discrete command for terminating the episode, which should be triggered by the policy to signal successful completion. The continuous dimensions (all dimensions except for the discrete termination command) are discretized into 256 bins uniformly. Thus, the robot action can be represented using ordinals of the discrete bins as 8 integer numbers. In order to use these discretized actions to finetune a vision-language into a vision-language-action model, we need to associate tokens from the model’s existing tokenization with the discrete action bins. This requires reserving 256 tokens to serve as action tokens. Which tokens to choose depends on the particular tokenization used by each VLM, which we discuss later in this section. In order to define a target for VLM fine-tuning we convert the action vector into a single string by simply concatenating action tokens for each dimension with a space character:
“terminate Δpos

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/Monodyee/article/detail/419363