当前位置:   article > 正文

NLP之T5:T5的简介(论文Exploring the Limits of Transfer Learning with a Unified Text-to-T)、安装和使用方法、案例应用之详细攻略_如何下载t51.1-large模型

如何下载t51.1-large模型

NLP之T5:T5的简介(论文Exploring the Limits of Transfer Learning with a Unified Text-to-T)、安装和使用方法、案例应用之详细攻略

目录

相关论文

《Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer》翻译与解读

Abstract

3、Experiments实验

3.1 Baseline基线

3.1.1 Model模型

3.1.2 Training训练

3.1.3 Vocabulary词汇

3.1.4 Unsupervised Objective无监督目标

3.1.5 Baseline Performance基准性能

3.2 Architectures架构

3.2.1 Model Structures模型结构

3.2.2 Comparing Different Model Structures比较不同的模型结构

3.2.3 Objectives目标

3.5 Training Strategy训练策略

3.5.1 Fine-tuning Methods微调方法

3.5.2 Multi-task Learning

Examples-proportional mixing

Temperature-scaled mixing

Equal mixing平等的混合

3.5.3 Combining Multi-Task Learning with Fine-Tuning

4、Reflection反思

4.1 Takeaways要点

Text-to-text文本到文本

Architectures架构

Unsupervised objectives无监督目标

Data sets 数据集

Training strategies训练策略

Scaling扩展

Pushing the limits挑战极限,11B的参数,1万亿个token

4.2 Outlook前景

The inconvenience of large models 大模型带来的不便

More efficient knowledge extraction更高效的知识提取

Formalizing the similarity between tasks 形式化任务之间的相似性

Language-agnostic models语言无关模型

T5的简介

1、已发布的模型检查点

T5的安装和使用方法

1、数据集准备

2、使用任务

TfdsTask

C4

TextLineTask

直接使用TSV文件

3、安装

在GCP上设置TPU

请使用以下命令在Cloud VM中创建TPU设备。

4、训练

5、微调

6、评估

7、解码

8、导出

9、GPU使用

10、重现我们的实验

11、有用选项

T5的案例应用


相关论文

《Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer》翻译与解读

地址

论文地址:https://arxiv.org/abs/1910.10683

时间

2019年10月23日

作者

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Google

总结

该论文提出了一个新的通用文本到文本转换框架T5,将各种自然语言处理任务制定为文本到文本格式。

论文背景:近年来,大量工作提出了各种预训练目标和架构,取得了杰出效果。但是这些方法难以比较,不明研究方向。

解决方案

>> 将所有NLP任务均格式化为文本到文本的问题,通过任务前缀来区分不同任务。

>>提出T5(Text-to-Text Transfer Transformer,文本到文本传输转换器)模型,是一个通用的编码器-解码器结构的Transformer模型。

>>对大规模网上文本爬取数据集C4进行清洗,构建训练数据集。

>>对不同预训练目标、模型结构、数据集大小等多种因素进行对比实验。

>>通过增大模型尺寸和训练数据规模,推进现有状态。

核心特点

>>提供了一个统一的实验平台来研究迁移学习

>>进行了广泛实验来发现迁移学习中重要的因素。

>>通过规模化训练,在包括GLUE、SQuAD、CNN/DM等多项语言理解测验上取得SOTA效果。

论文优势

>>给出一个通用且可复制的迁移学习框架

>>發现许多原创見解,如spam消除规则、任务前缀设计等。

>>发现规模化训练是提升模型质量的关键。

>>公开了模型和代码,推动迁移学习研究。

Abstract

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.1

Keywords: transfer learning, natural language processing, multi-task learning, attention-based models, deep learning

迁移学习是一种在模型首先在数据丰富的任务上进行预训练,然后在下游任务上进行微调的技术,在自然语言处理(NLP)领域已经成为一种强大的技术。迁移学习有效性引发了各种方法、方法论和实践的多样性。在本文中,我们通过引入一个统一的框架,将所有基于文本的语言问题转换为文本到文本的格式,来探索NLP的迁移学习技术的领域。我们的系统研究比较了数10个语言理解任务上的预训练目标、架构、未标记数据集、迁移方法和其他因素。通过将我们探索的见解与规模和我们的新的“庞大干净爬取语料库”相结合,我们在涵盖摘要、问答、文本分类等多个基准上取得了最先进的结果。为了促进未来关于NLP迁移学习的研究,我们发布了我们的数据集、预训练模型和代码。

关键词:迁移学习,自然语言处理,多任务学习,基于注意力的模型,深度学习

3、Experiments实验

Recent advances in transfer learning for NLP have come from a wide variety of developments, such as new pre-training objectives, model architectures, unlabeled data sets, and more. In this section, we carry out an empirical survey of these techniques in hopes of teasing apart their contribution and significance. We then combine the insights gained to attain state-of-the-art in many of the tasks we consider. Since transfer learning for NLP is a rapidly growing area of research, it is not feasible for us to cover every possible technique or idea in our empirical study. For a broader literature review, we recommend a recent survey by Ruder et al. (2019).

We systematically study these contributions by taking a reasonable baseline (described in Section 3.1) and altering one aspect of the setup at a time. For example, in Section 3.3 we measure the performance of different unsupervised objectives while keeping the rest of our experimental pipeline fixed. This “coordinate ascent” approach might miss second-order effects (for example, some particular unsupervised objective may work best on a model larger than our baseline setting), but performing a combinatorial exploration of all of the factors in our study would be prohibitively expensive. In future work, we expect it could be fruitful to more thoroughly consider combinations of the approaches we study.

NLP迁移学习的最新进展来自各种各样的发展,例如新的预训练目标、模型架构、未标记数据集等等。在本节中,我们对这些技术进行了实证调查,希望能梳理出它们的贡献和意义。然后,我们结合所获得的见解,在我们考虑的许多任务中达到最先进的水平。由于NLP迁移学习是一个快速发展的研究领域,我们不可能在我们的实证研究中涵盖所有可能的技术或想法。对于更广泛的文献综述,我们推荐Ruder等人(2019)最近的一项调查。

我们系统地研究了这些贡献,采用了一个合理的基线(在3.1节中描述),并一次改变设置的一个方面。例如,在3.3节中,我们测量了不同无监督目标的性能,同时保持实验管道的其余部分固定。这种“坐标上升”方法可能会错过二阶效应(例如,一些特定的无监督目标可能在比我们的基线设置更大的模型上工作得最好),但是在我们的研究中对所有因素进行组合探索将是非常昂贵的。在未来的工作中,我们期望更彻底地考虑我们所研究的方法的组合会取得丰硕成果。

Our goal is to compare a variety of different approaches on a diverse set of tasks while keeping as many factors fixed as possible. In order to satisfy this aim, in some cases we do not exactly replicate existing approaches. For example, “encoder-only” models like BERT (Devlin et al., 2018) are designed to produce a single prediction per input token or a single prediction for an entire input sequence. This makes them applicable for classification or span prediction tasks but not for generative tasks like translation or abstractive summarization. As such, none of the model architectures we consider are identical to BERT or consist of an encoder-only structure. Instead, we test approaches that are similar in spirit—for example, we consider an analogous objective to BERT’s “masked language modeling” objective in Section 3.3 and we consider a model architecture that behaves similarly to BERT on text classification tasks in Section 3.2.

我们的目标是在保持尽可能多的因素不变的情况下,比较不同任务组的各种不同方法。为了实现这一目标,在某些情况下,我们并不完全复制现有的方法。例如,像BERT (Devlin等人,2018)这样的“仅编码器”模型旨在为每个输入令牌生成单个预测或为整个输入序列生成单个预测。这使得它们适用于分类或跨度预测任务,但不适用于生成任务,如翻译或抽象摘要。因此,我们考虑的模型体系结构中没有一个与BERT相同,也没有一个只包含编码器结构。相反,我们测试了精神上相似的方法——例如,我们在3.3节中考虑了一个与BERT的“掩模语言建模”目标类似的目标,我们在3.2节中考虑了一个在文本分类任务上与BERT行为相似的模型体系结构。

After outlining our baseline experimental setup in the following subsection, we undertake an empirical comparison of model architectures (Section 3.2), unsupervised objectives (Section 3.3), pre-training data sets (Section 3.4), transfer approaches (Section 3.5), and scaling (Section 3.6). At the culmination of this section, we combine insights from our study with scale to obtain state-of-the-art results in many tasks we consider (Section 3.7).

在下一小节中概述了我们的基线实验设置之后,我们对模型架构(第3.2节)、无监督目标(第3.3节)、预训练数据集(第3.4节)、迁移方法(第3.5节)和缩放(第3.6节)进行了实证比较。在本节的最后,我们将研究的见解与规模相结合,以在我们考虑的许多任务中获得最先进的结果(第3.7节)。

3.1 Baseline基线

Our goal for our baseline is to reflect typical, modern practice. We pre-train a standard Transformer (described in Section 2.1) using a simple denoising objective and then separately fine-tune on each of our downstream tasks. We describe the details of this experimental setup in the following subsections.

我们的基线目标是反映典型的现代实践。我们使用一个简单的去噪目标预训练一个标准变压器(在2.1节中描述),然后分别对我们的每个下游任务进行微调。我们将在下面的小节中描述这个实验设置的细节。

3.1.1 Model模型

For our model, we use a standard encoder-decoder Transformer as proposed by Vaswani et al.(2017). While many modern approaches to transfer learning for NLP use a Transformer architecture consisting of only a single “stack” (e.g. for language modeling (Radford et al., 2018; Dong et al., 2019) or classification and span prediction (Devlin et al., 2018; Yang et al., 2019)), we found that using a standard encoder-decoder structure achieved good results on both generative and classification tasks. We explore the performance of different model architectures in Section 3.2.

Our baseline model is designed so that the encoder and decoder are each similar in size and configuration to a “BERTBASE” (Devlin et al., 2018) stack. Specifically, both the encoder and decoder consist of 12 blocks (each block comprising self-attention, optional encoder-decoder attention, and a feed-forward network). The feed-forward networks in each block consist of a dense layer with an output dimensionality of dff = 3072 followed by a ReLU nonlinearity and another dense layer. The “key” and “value” matrices of all attention mechanisms have an inner dimensionality of dkv = 64 and all attention mechanisms have 12 heads. All other sub-layers and embeddings have a dimensionality of dmodel = 768. In total, this results in a model with about 220 million parameters. This is roughly twice the number of parameters of BERTBASE since our baseline model contains two layer stacks instead of one. For regularization, we use a dropout probability of 0.1 everywhere dropout is applied in the model.

对于我们的模型,我们使用Vaswani等人(2017)提出的标准编码器-解码器变压器。虽然许多用于NLP迁移学习的现代方法使用仅由单个“堆栈”组成的Transformer架构(例如用于语言建模)(Radford等人,2018;Dong et al., 2019)或分类和跨度预测(Devlin et al., 2018;Yang等人,2019)),我们发现使用标准编码器-解码器结构在生成和分类任务上都取得了很好的效果。我们将在第3.2节探讨不同模型架构的性能。

我们的基线模型的设计使得编码器和解码器在大小和配置上都与“BERTBASE”(Devlin等人,2018)堆栈相似。具体来说,编码器和解码器都由12个块组成(每个块包括自注意、可选编码器-解码器注意和前馈网络)。每个块中的前馈网络由一个输出维数为dff = 3072的致密层和一个ReLU非线性和另一个致密层组成。所有注意机制的“关键”矩阵和“值”矩阵的内维数都是dkv = 64,所有注意机制都有12个头。所有其他子层和嵌入的维数为dmodel = 768。总的来说,这导致了一个大约有2.2亿个参数的模型。这大约是BERTBASE参数数量的两倍,因为我们的基线模型包含两层堆栈而不是一个。对于正则化,我们在模型中应用dropout的任何地方使用0.1的dropout概率。

3.1.2 Training训练

As described in Section 2.4, all tasks are formulated as text-to-text tasks. This allows us to always train using standard maximum likelihood, i.e. using teacher forcing (Williams and Zipser, 1989) and a cross-entropy loss. For optimization, we use AdaFactor (Shazeer and Stern, 2018). At test time, we use greedy decoding (i.e. choosing the highest-probability logit at every timestep).

We pre-train each model for 219 = 524,288 steps on C4 before fine-tuning. We use a maximum sequence length of 512 and a batch size of 128 sequences. Whenever possible, we “pack” multiple sequences into each entry of the batch10 so that our batches contain roughly 216 = 65,536 tokens. In total, this batch size and number of steps corresponds to pre-training on 235 ≈ 34B tokens. This is considerably less than BERT (Devlin et al., 2018), which used roughly 137B tokens, or RoBERTa (Liu et al., 2019c), which used roughly 2.2T tokens. Using only 235 tokens results in a reasonable computational budget while still providing a sufficient amount of pre-training for acceptable performance. We consider the effect of pre-training for more steps in Sections 3.6 and 3.7. Note that 235 tokens only covers a fraction of the entire C4 data set, so we never repeat any data during pre-training.

如第2.4节所述,所有任务都被表述为文本对文本任务。这使我们能够始终使用标准最大似然进行训练,即使用教师强迫(Williams and Zipser, 1989)和交叉熵损失。为了优化,我们使用AdaFactor (Shazeer and Stern, 2018)。在测试时,我们使用贪婪解码(即在每个时间步选择概率最高的logit)。

在微调之前,我们在C4上对每个模型进行219 = 524,288步的预训练。我们使用的最大序列长度为512,批量大小为128个序列。只要有可能,我们就将多个序列“打包”到batch10的每个条目中,以便我们的批包含大约216 = 65,536个令牌。总的来说,这个批大小和步骤数对应于235≈34B个令牌的预训练。这比BERT (Devlin等人,2018)少得多,BERT使用了大约137B代币,RoBERTa (Liu等人,2019)使用了大约2.2T代币。仅使用235个令牌可以获得合理的计算预算,同时仍然提供足够的预训练以获得可接受的性能。我们将在第3.6节和3.7节中考虑预训练对更多步骤的影响。请注意,235个令牌只覆盖了整个C4数据集的一小部分,因此我们在<s:1>预训练期间从不重复任何数据。

During pre-training, we use an “inverse square root” learning rate schedule: 1 max(n, k) where n is the current training iteration and k is the number of warm-up steps (set to 104 in all of our experiments). This sets a constant learning rate of 0.01 for the first 104 steps, then exponentially decays the learning rate until pre-training is over. We also experimented with using a triangular learning rate (Howard and Ruder, 2018), which produced slightly better results but requires knowing the total number of training steps ahead of time. Since we will be varying the number of training steps in some of our experiments, we opt for the more generic inverse square root schedule.

Our models are fine-tuned for 218 = 262,144 steps on all tasks. This value was chosen as a trade-off between the high-resource tasks (i.e. those with large data sets), which benefit from additional fine-tuning, and low-resource tasks (smaller data sets), which overfit quickly. During fine-tuning, we continue using batches with 128 length-512 sequences (i.e. 216 tokens per batch). We use a constant learning rate of 0.001 when fine-tuning. We save a checkpoint every 5,000 steps and report results on the model checkpoint corresponding to the highest validation performance. For models fine-tuned on multiple tasks, we choose the best checkpoint for each task independently. For all of the experiments except those in Section 3.7, we report results in the validation set to avoid performing model selection on the test set.

在预训练期间,我们使用“平方根反比”学习率计划:1 max(n, k),其中n是当前训练迭代,k是热身步骤数(在我们所有的实验中设置为104)。这为前104步设置了0.01的恒定学习率,然后指数衰减学习率,直到预训练结束。我们还尝试使用三角形学习率(Howard and Ruder, 2018),这产生了稍好的结果,但需要提前知道训练步骤的总数。由于我们将在一些实验中改变训练步骤的数量,我们选择更通用的平方根反比计划。

我们的模型对所有任务的218 = 262,144步进行了微调。选择这个值是为了在高资源任务(即具有大数据集的任务)和低资源任务(较小的数据集)之间进行权衡,高资源任务可以从额外的微调中受益,而低资源任务(较小的数据集)会很快过拟合。在微调期间,我们继续使用具有128 -512序列的批次(即每个批次216个令牌)。我们在微调时使用恒定的学习率0.001。我们每5000步保存一个检查点,并报告与最高验证性能相对应的模型检查点的结果。对于在多个任务上微调的模型,我们为每个任务独立选择最佳检查点。对于除3.7节中的实验外的所有实验,我们都在验证集中报告结果,以避免在测试集中进行模型选择。

3.1.3 Vocabulary词汇

We use SentencePiece (Kudo and Richardson, 2018) to encode text as WordPiece tokens (Sennrich et al., 2015; Kudo, 2018). For all experiments, we use a vocabulary of 32,000 wordpieces. Since we ultimately fine-tune our model on English to German, French, and Romanian translation, we also require that our vocabulary covers these non-English languages. To address this, we classified pages from the Common Crawl scrape used in C4 as German, French, and Romanian. Then, we trained our SentencePiece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian. This vocabulary was shared across both the input and output of our model. Note that our vocabulary makes it so that our model can only process a predetermined, fixed set of languages.

我们使用sentencepece (Kudo and Richardson, 2018)将文本编码为WordPiece令牌(Sennrich et al., 2015;荣誉,2018)。在所有的实验中,我们使用了32000个单词的词汇。由于我们最终将英语模型微调为德语、法语和罗马尼亚语翻译,因此我们还要求我们的词汇表涵盖这些非英语语言。为了解决这个问题,我们将C4中使用的Common Crawl抓取的页面分类为德语、法语和罗马尼亚语。然后,我们在10部分英语C4数据的混合物上训练我们的sentencepece模型,其中每一部分数据被分类为德语,法语或罗马尼亚语。这个词汇表在我们模型的输入和输出之间共享。注意,我们的词汇表使得我们的模型只能处理预定的、固定的一组语言。

3.1.4 Unsupervised Objective无监督目标

Leveraging unlabeled data to pre-train our model necessitates an objective that does not require labels but (loosely speaking) teaches the model generalizable knowledge that will be useful in downstream tasks. Preliminary work that applied the transfer learning paradigm of pre-training and fine-tuning all of the model’s parameters to NLP problems used a causal language modeling objective for pre-training (Dai and Le, 2015; Peters et al., 2018; Radford et al., 2018; Howard and Ruder, 2018). However, it has recently been shown that “denoising” objectives (Devlin et al., 2018; Taylor, 1953) (also called “masked language modeling”) produce better performance and as a result they have quickly become standard. In a denoising objective, the model is trained to predict missing or otherwise corrupted tokens in the input. Inspired by BERT’s “masked language modeling” objective and the “word dropout” regularization technique (Bowman et al., 2015), we design an objective that randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence. The sentinel IDs are special tokens which are added to our vocabulary and do not correspond to any wordpiece. The target then corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence. Our choices to mask consecutive spans of tokens and only predict dropped-out tokens were made to reduce the computational cost of pre-training. We perform thorough investigation into pre-training objectives in Section 3.3. An example of the transformation resulting from applying this objective is shown in Figure 2. We empirically compare this objective to many other variants in Section 3.3.

利用未标记的数据来预训练我们的模型需要一个不需要标签的目标,但(松散地说)教会模型可推广的知识,这些知识将在下游任务中有用。将迁移学习范式的预训练和微调所有模型参数应用于NLP问题的初步工作使用因果语言建模目标进行预训练(Dai和Le, 2015;Peters et al., 2018;Radford et al., 2018;Howard and Ruder, 2018)。然而,最近有研究表明,“去噪”目标(Devlin et al., 2018;Taylor, 1953)(也称为“掩码语言建模”)产生更好的性能,因此它们很快成为标准。在去噪目标中,训练模型来预测输入中缺失或损坏的标记。受BERT的“掩码语言建模”目标和“word dropout”正则化技术(Bowman et al., 2015)的启发,我们设计了一个目标,该目标随机采样,然后在输入序列中删除15%的令牌。所有连续的退出令牌都被替换为单个哨兵令牌。每个哨兵令牌被分配一个令牌ID,该令牌ID是该序列唯一的。哨兵id是添加到我们词汇表中的特殊标记,不对应于任何单词。然后,目标对应于所有退出的标记范围,由输入序列中使用的相同哨兵标记加上标记目标序列结束的最后哨兵标记来分隔。为了减少预训练的计算成本,我们选择屏蔽连续跨度的令牌,只预测退出的令牌。我们在3.3节中对预训练目标进行了彻底的调查。图2显示了应用此目标所产生的转换示例。我们根据经验将这个目标与3.3节中的许多其他变体进行比较。

3.1.5 Baseline Performance基准性能

In this section, we present results using the baseline experimental procedure described above to get a sense of what kind of performance to expect on our suite of downstream tasks. Ideally, we would repeat every experiment in our study multiple times to get a confidence interval on our results. Unfortunately, this would be prohibitively expensive due to the large number of experiments we run. As a cheaper alternative, we train our baseline model 10 times from scratch (i.e. with different random initializations and data set shuffling) and assume that the variance over these runs of the base model also applies to each experimental variant. We don’t expect most of the changes we make to have a dramatic effect on the inter-run variance, so this should provide a reasonable indication of the significance of different changes. Separately, we also measure the performance of training our model for 218 steps (the same number we use for fine-tuning) on all downstream tasks without pre-training. This gives us an idea of how much pre-training benefits our model in the baseline setting.

When reporting results in the main text, we only report a subset of the scores across all the benchmarks to conserve space and ease interpretation. For GLUE and SuperGLUE, we report the average score across all subtasks (as stipulated by the official benchmarks) under the headings “GLUE” and “SGLUE”. For all translation tasks, we report the BLEU score (Papineni et al., 2002) as provided by SacreBLEU v1.3.0 (Post, 2018) with “exp” smoothing and “intl” tokenization. We refer to scores for WMT English to German, English to French, and English to Romanian as EnDe, EnFr, and EnRo, respectively. For CNN/Daily Mail, we find the performance of models on the ROUGE-1-F, ROUGE-2-F, and ROUGE-L-F metrics (Lin, 2004) to be highly correlated so we report the ROUGE-2-F score alone under the heading “CNNDM”. Similarly, for SQuAD we find the performance of the “exact match” and “F1” scores to be highly correlated so we report the “exact match” score alone. We provide every score achieved on every task for all experiments in Table 16, Appendix E.

在本节中,我们将使用上面描述的基线实验过程来展示结果,以了解在我们的下游任务套件中预期的性能。理想情况下,我们会多次重复研究中的每个实验,以获得结果的置信区间。不幸的是,由于我们运行的大量实验,这将是非常昂贵的。作为一种更便宜的选择,我们从头开始训练基线模型10次(即使用不同的随机初始化和数据集洗牌),并假设基础模型的这些运行的方差也适用于每个实验变量。我们不期望我们所做的大多数更改对运行间方差产生戏剧性的影响,因此这应该提供不同更改的重要性的合理指示。另外,我们还测量了在没有预训练的情况下,对所有下游任务进行218步(与我们用于微调的步骤相同)的模型训练的性能。这让我们了解了在基线设置中预训练对我们的模型有多大好处。

当在正文中报告结果时,我们只报告所有基准测试分数的一个子集,以节省空间并简化解释。对于GLUE和SuperGLUE,我们在“GLUE”和“SGLUE”标题下报告所有子任务的平均分数(根据官方基准规定)。对于所有翻译任务,我们报告由SacreBLEU v1.3.0 (Post, 2018)提供的BLEU分数(Papineni等人,2002),并使用“exp”平滑和“intl”标记化。我们将WMT英语到德语、英语到法语和英语到罗马尼亚语的分数分别称为EnDe、EnFr和EnRo。对于CNN/Daily Mail,我们发现模型在ROUGE-1-F、ROUGE-2-F和ROUGE-L-F指标上的表现(Lin, 2004)是高度相关的,所以我们在“CNNDM”标题下单独报告了ROUGE-2-F得分。类似地,对于SQuAD,我们发现“精确匹配”和“F1”分数的表现高度相关,所以我们单独报告“精确匹配”分数。我们在附录E的表16中给出了所有实验中每个任务的得分。

Our results tables are all formatted so that each row corresponds to a particular experi-mental configuration with columns giving the scores for each benchmark. We will include the mean performance of the baseline configuration in most tables. Wherever a baseline configuration appears, we will mark it with a ⋆ (as in the first row of Table 1). We also will boldface any score that is within two standard deviations of the maximum (best) in a given experiment.

Our baseline results are shown in Table 1. Overall, our results are comparable to existing models of similar size. For example, BERTBASE achieved an exact match score of 80.8 on SQuAD and an accuracy of 84.4 on MNLI-matched, whereas we achieve 80.88 and 84.24, respectively (see Table 16). Note that we cannot directly compare our baseline to BERTBASE because ours is an encoder-decoder model and was pre-trained for roughly 1⁄4 as many steps. Unsurprisingly, we find that pre-training provides significant gains across almost all benchmarks. The only exception is WMT English to French, which is a large enough data set that gains from pre-training tend to be marginal. We include this task in our experiments to test the behavior of transfer learning in the high-resource regime. Since we perform early stopping by selecting the best-performing checkpoint, the large disparity between our baseline and “no pre-training” emphasize how much pre-training improves performance on tasks with limited data. While we do not explicitly measure improvements in data efficiency in this paper, we emphasize that this is one of the primary benefits of the transfer learning paradigm.

我们的结果表都是格式化的,以便每行对应于一个特定的实验配置,列给出每个基准的分数。我们将在大多数表中包含基线配置的平均性能。无论基线配置出现在哪里,我们都会用-标记它(如表1的第一行所示)。在给定的实验中,我们还将在最大值(最佳)的两个标准差范围内的任何分数加粗。

我们的基线结果如表1所示。总的来说,我们的结果与现有的类似规模的模型相当。例如,BERTBASE在SQuAD上的精确匹配得分为80.8,在mnli匹配上的准确率为84.4,而我们分别达到80.88和84.24(见表16)。请注意,我们不能直接将我们的基线与BERTBASE进行比较,因为我们的是一个编码器-解码器模型,并且预训练了大约1 / 4的步骤。不出所料,我们发现预训练在几乎所有基准测试中都有显著的提高。唯一的例外是英语到法语的WMT,这是一个足够大的数据集,从预训练中获得的收益往往是微不足道的。我们将这一任务纳入我们的实验中,以测试高资源环境下迁移学习的行为。由于我们通过选择性能最好的检查点来执行早期停止,我们的基线和“无预训练”之间的巨大差异强调了预训练在有限数据下提高任务性能的程度。虽然我们在本文中没有明确衡量数据效率的提高,但我们强调这是迁移学习范式的主要好处之一。

As for inter-run variance, we find that for most tasks the standard deviation across runs is smaller than 1% of the task’s baseline score. Exceptions to this rule include CoLA, CB, and COPA, which are all low-resource tasks from the GLUE and SuperGLUE benchmarks. For example, on CB our baseline model had an average F1 score of 91.22 with a standard deviation of 3.237 (see Table 16), which may be partly due to the fact that CB’s validation set contains only 56 examples. Note that the GLUE and SuperGLUE scores are computed as the average of scores across the tasks comprising each benchmark. As a result, we caution that the high inter-run variance of CoLA, CB, and COPA can make it harder to compare models using the GLUE and SuperGLUE scores alone.

至于运行间方差,我们发现对于大多数任务,运行间的标准差小于任务基线得分的1%。这条规则的例外包括CoLA、CB和COPA,它们都是GLUE和SuperGLUE基准测试中的低资源任务。例如,在CB上,我们的基线模型的平均F1分数为91.22,标准差为3.237(见表16),这可能部分是由于CB的验证集只包含56个示例。请注意,GLUE和SuperGLUE的分数是作为包含每个基准的任务的分数的平均值计算的。因此,我们警告说,CoLA, CB和COPA的高运行间方差可能会使单独使用GLUE和SuperGLUE分数比较模型变得更加困难。

3.2 Architectures架构

While the Transformer was originally introduced with an encoder-decoder architecture, much modern work on transfer learning for NLP uses alternative architectures. In this section, we review and compare these architectural variants.

虽然Transformer最初是使用编码器-解码器架构引入的,但NLP迁移学习的许多现代工作使用了替代架构。在本节中,我们将回顾并比较这些体系结构变体。

3.2.1 Model Structures模型结构

A major distinguishing factor for different architectures is the “mask” used by different attention mechanisms in the model. Recall that the self-attention operation in a Transformer takes a sequence as input and outputs a new sequence of the same length. Each entry of the output sequence is produced by computing a weighted average of entries of the input sequence. Specifically, let yi refer to the ith element of the output sequence and xj refer to the jth entry of the input sequence. yi is computed as j wi,j xj , where wi,j is the scalar weight produced by the self-attention mechanism as a function of xi and xj . The attention mask is then used to zero out certain weights in order to constrain which entries of the input can be attended to at a given output timestep. Diagrams of the masks we will consider are shown in Figure 3. For example, the causal mask (Figure 3, middle) sets any wi,j to zero if j > i.

The first model structure we consider is an an encoder-decoder Transformer, which consists of two layer stacks: The encoder, which is fed an input sequence, and the decoder, which produces a new output sequence. A schematic of this architectural variant is shown in the left panel of Figure 4.

不同架构的主要区别因素是模型中不同注意机制所使用的“掩码”。回想一下,Transformer中的自关注操作将一个序列作为输入,并输出一个相同长度的新序列。输出序列的每个条目都是通过计算输入序列条目的加权平均值来产生的。具体来说,设yi为输出序列的第i个元素,xj为输入序列的第j个元素。Yi的计算公式为j wi,j xj,其中wi,j为自注意机制产生的标量权值,是xi和xj的函数。然后使用注意力掩码将某些权重归零,以便约束在给定的输出时间步上可以关注哪些输入项。我们将考虑的遮罩示意图如图3所示。例如,如果j > i,因果掩码(图3,中间)将任意wi,j设置为零。

我们考虑的第一个模型结构是一个编码器-解码器变压器,它由两层堆栈组成:编码器,它被馈送一个输入序列,解码器,它产生一个新的输出序列。这种体系结构变体的示意图如图4的左面板所示。

The encoder uses a “fully-visible” attention mask. Fully-visible masking allows a self-attention mechanism to attend to any entry of the input when producing each entry of its output. We visualize this masking pattern in Figure 3, left. This form of masking is appropriate when attending over a “prefix”, i.e. some context provided to the model that is later used when making predictions. BERT (Devlin et al., 2018) also uses a fully-visible masking pattern and appends a special “classification” token to the input. BERT’s output at the timestep corresponding to the classification token is then used to make a prediction for classifying the input sequence.

编码器使用“完全可见”的注意力掩码。完全可见屏蔽允许自关注机制在产生其输出的每个条目时关注输入的任何条目。我们在图3(左)中可视化这个掩蔽模式。当处理“前缀”时,这种形式的屏蔽是合适的,即提供给模型的一些上下文,稍后在进行预测时使用。BERT (Devlin et al., 2018)也使用了完全可见的屏蔽模式,并在输入中附加了一个特殊的“分类”令牌。然后使用BERT在与分类令牌对应的时间步长的输出来对输入序列进行分类预测。

The self-attention operations in the Transformer’s decoder use a “causal” masking pattern. When producing the ith entry of the output sequence, causal masking prevents the model from attending to the jth entry of the input sequence for j > i. This is used during training so that the model can’t “see into the future” as it produces its output. An attention matrix for this masking pattern is shown in Figure 3, middle.

The decoder in an encoder-decoder Transformer is used to autoregressively produce an output sequence. That is, at each output timestep, a token is sampled from the model’s predicted distribution and the sample is fed back into the model to produce a prediction for the next output timestep, and so on. As such, a Transformer decoder (without an encoder) can be used as a language model (LM), i.e. a model trained solely for next-step prediction (Liu et al., 2018; Radford et al., 2018; Al-Rfou et al., 2019). This constitutes the second model structure we consider. A schematic of this architecture is shown in Figure 4, middle. In fact, early work on transfer learning for NLP used this architecture with a language modeling objective as a pre-training method (Radford et al., 2018).

Transformer解码器中的自注意操作使用“因果”掩蔽模式。当产生输出序列的第i个条目时,因果掩蔽阻止模型关注输入序列的第j个条目,因为j > i。这在训练期间使用,以便模型在产生输出时无法“看到未来”。此掩蔽模式的注意矩阵如图3中所示。

编码器-解码器转换器中的解码器用于自回归地产生输出序列。也就是说,在每个输出时间步,从模型的预测分布中采样一个令牌,并将样本反馈到模型中,以生成下一个输出时间步的预测,以此类推。因此,Transformer解码器(没有编码器)可以用作语言模型(LM),即仅为下一步预测训练的模型(Liu et al., 2018;Radford et al., 2018;al - rfu et al., 2019)。这构成了我们考虑的第二个模型结构。该体系结构的示意图如图4中所示。事实上,NLP迁移学习的早期工作使用这种架构和语言建模目标作为预训练方法(Radford et al., 2018)。

Language models are typically used for compression or sequence generation (Graves, 2013). However, they can also be used in the text-to-text framework simply by concatenating the inputs and targets. As an example, consider the case of English to German translation: If we have a training datapoint with input sentence “That is good.” and target “Das ist gut.”, we would simply train the model on next-step prediction over the concatenated input sequence “translate English to German: That is good. target: Das ist gut.” If we wanted to obtain the model’s prediction for this example, the model would be fed the prefix “translate English to German: That is good. target:” and would be asked to generate the remainder of the sequence autoregressively. In this way, the model can predict an output sequence given an input, which satisfies the needs of text-to-text tasks. This approach was recently used to show that language models can learn to perform some text-to-text tasks without supervision (Radford et al., 2019).

语言模型通常用于压缩或序列生成(Graves, 2013)。但是,它们也可以在文本到文本框架中使用,只需将输入和目标连接起来。举个例子,考虑一下英语到德语的翻译:如果我们有一个训练数据点,输入句子“That is good”。并以“Das ist gut”为目标。,我们将简单地训练模型在串联输入序列上进行下一步预测,“将英语翻译成德语:这很好。”目标:Das ist gut。”如果我们想要获得这个例子的模型预测,模型将被输入前缀“将英语翻译成德语:That is good”。目标:”,并将被要求自回归地生成序列的剩余部分。这样,该模型可以在给定输入的情况下预测输出序列,从而满足文本到文本任务的需要。这种方法最近被用来证明语言模型可以在没有监督的情况下学习执行一些文本到文本的任务(Radford et al., 2019)。

A fundamental and frequently cited drawback of using a language model in the text-to-text setting is that causal masking forces the model’s representation of the ith entry of the input sequence to only depend on the entries up until i. To see why this is potentially disadvantageous, consider the text-to-text framework where the model is provided with a prefix/context before being asked to make predictions (e.g., the prefix is an English sentence and the model is asked to predict the German translation). With fully causal masking, the model’s representation of a prefix state can only depend on prior entries of the prefix. So, when predicting an entry of the output, the model will attend to a representation of the prefix that is unnecessarily limited. Similar arguments have been made against using a unidirectional recurrent neural network encoder in sequence-to-sequence models (Bahdanau et al., 2015).

在文本到文本设置中使用语言模型的一个基本且经常被引用的缺点是,因果屏蔽迫使模型对输入序列的第i个条目的表示仅依赖于直到i的条目。要了解为什么这可能是不利的,请考虑文本到文本框架,其中模型在被要求进行预测之前提供了前缀/上下文(例如,前缀是一个英语句子,并要求模型预测德语翻译)。使用完全因果屏蔽,模型对前缀状态的表示只能依赖于前缀的先前条目。因此,在预测输出的条目时,模型将关注前缀的表示,而前缀的表示是不必要的限制。类似的论点也反对在序列到序列模型中使用单向循环神经网络编码器(Bahdanau et al., 2015)。

This issue can be avoided in a Transformer-based language model simply by changing the masking pattern. Instead of using a causal mask, we use fully-visible masking during the prefix portion of the sequence. This masking pattern and a schematic of the resulting “prefix LM” (the third model structure we consider) are illustrated in the rightmost panels of Figures 3 and 4, respectively. In the English to German translation example mentioned above, fully-visible masking would be applied to the prefix “translate English to German: That is good. target:” and causal masking would be used during training for predicting the target “Das ist gut.” Using a prefix LM in the text-to-text framework was originally proposed by Liu et al. (2018). More recently, Dong et al. (2019) showed that this architecture is effective on a wide variety of text-to-text tasks. This architecture is similar to an encoder-decoder model with parameters shared across the encoder and decoder and with the encoder-decoder attention replaced with full attention across the input and target sequence.

We note that when following our text-to-text framework, the prefix LM architecture closely resembles BERT (Devlin et al., 2018) for classification tasks. To see why, consider an example from the MNLI benchmark where the premise is “I hate pigeons.”, the hypothesis is “My feelings towards pigeons are filled with animosity.” and the correct label is “entailment”. To feed this example into a language model, we would transform it into the sequence “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity. target: entailment”. In this case, the fully-visible prefix would correspond to the entire input sequence up to the word “target:”, which can be seen as being analogous to the “classification” token used in BERT. So, our model would have full visibility over the entire input, and then would be tasked with making a classification by outputting the word “entailment”. It is easy for the model to learn to output one of the valid class labels given the task prefix (“mnli” in this case). As such, the main difference between a prefix LM and the BERT architecture is that the classifier is simply integrated into the output layer of the Transformer decoder in the prefix LM.

在基于transformer的语言模型中,只需更改屏蔽模式就可以避免这个问题。我们没有使用因果掩码,而是在序列的前缀部分使用完全可见的掩码。图3和图4的最右边的面板分别说明了这种屏蔽模式和由此产生的“前缀LM”(我们考虑的第三个模型结构)的示意图。在上面提到的英语到德语的翻译示例中,完全可见掩蔽将应用于前缀“将英语翻译成德语:那很好”。在训练中,因果掩蔽将用于预测目标“Das ist gut”。在文本到文本框架中使用前缀LM最初是由Liu等人(2018)提出的。最近,Dong等人(2019)表明,这种架构在各种文本对文本任务中都是有效的。该架构类似于编码器-解码器模型,其参数在编码器和解码器之间共享,并且编码器-解码器的注意力被输入和目标序列的全部注意力所取代。

我们注意到,当遵循我们的文本到文本框架时,前缀LM架构非常类似于BERT (Devlin et al., 2018)的分类任务。要了解原因,请考虑MNLI基准中的一个例子,其中前提是“我讨厌鸽子”。,假设是“我对鸽子充满了敌意。”正确的标签应该是“蕴涵”。为了将这个示例输入语言模型,我们将其转换为序列“mnli前提:我讨厌鸽子。假设:我对鸽子充满了敌意。目标:蕴涵”。在这种情况下,完全可见的前缀将对应于直到单词“target:”的整个输入序列,可以将其视为类似于BERT中使用的“classification”标记。因此,我们的模型将对整个输入具有完全的可见性,然后将通过输出单词“蕴涵”来进行分类。模型很容易学会在给定任务前缀(本例中为“mnli”)的情况下输出一个有效的类标签。因此,前缀LM和BERT体系结构之间的主要区别在于,分类器只是简单地集成到前缀LM中的Transformer解码器的输出层中。

3.2.2 Comparing Different Model Structures比较不同的模型结构

In the interest of experimentally comparing these architectural variants, we would like each model we consider to be equivalent in some meaningful way. We might say that two models are equivalent if they either have the same number of parameters or they require roughly the same amount of computation to process a given (input-sequence, target-sequence) pair. Unfortunately, it is not possible to compare an encoder-decoder model to a language model architecture (comprising a single Transformer stack) according to both of these criteria at the same time. To see why, first note an encoder-decoder model with L layers in the encoder and L layers in the decoder has approximately the same number of parameters as a language model with 2L layers. However, the same L + L encoder-decoder model will have approximately the same computational cost as a language model with only L layers. This is a consequence of the fact that the L layers in the language model must be applied to both the input and output sequence, while the encoder is only applied to the input sequence and the decoder is only applied to the output sequence. Note that these equivalences are approximate—there are some extra parameters in the decoder due to the encoder-decoder attention and there are also some computational costs in the attention layers that are quadratic in the sequence lengths. In practice, however, we observed nearly identical step times for L-layer language models versus L + L-layer encoder-decoder models, suggesting a roughly equivalent computational cost. Further, for the model sizes we consider, the number of parameters in the encoder-decoder attention layers is about 10% of the total parameter count, so we make the simplifying assumption that an L + L-layer encoder-decoder model has the same number of parameters as an 2L-layer language model.

为了通过实验比较这些架构变体,我们希望我们认为的每个模型以某种有意义的方式是等效的。如果两个模型具有相同数量的参数,或者它们需要大致相同的计算量来处理给定的(输入序列、目标序列)对,我们可以说它们是等效的。不幸的是,根据这两个标准同时比较编码器-解码器模型和语言模型体系结构(包含单个Transformer堆栈)是不可能的。要了解原因,首先注意编码器中有L层,解码器中有L层的编码器-解码器模型与具有2L层的语言模型具有大致相同的参数数量。然而,相同的L + L编码器-解码器模型将与只有L层的语言模型具有大致相同的计算成本。这是因为语言模型中的L层必须同时应用于输入和输出序列,而编码器仅应用于输入序列,解码器仅应用于输出序列。请注意,这些等价是近似值——由于编码器-解码器的注意,解码器中有一些额外的参数,并且在序列长度的二次注意层中也有一些计算成本。然而,实际上,我们观察到L层语言模型与L + L层编码器-解码器模型的步长几乎相同,这表明计算成本大致相当。此外,对于我们考虑的模型规模,编码器-解码器注意层中的参数数量约占总参数数量的10%,因此我们做出简化假设,即L + L层编码器-解码器模型与2l层语言模型具有相同数量的参数。

To provide a reasonable means of comparison, we consider multiple configurations for our encoder-decoder model. We will refer to the number of layers and parameters in a BERTBASE-sized layer stack as L and P , respectively. We will use M to refer to the number of FLOPs required for an L + L-layer encoder-decoder model or L-layer decoder-only model to process a given input-target pair. In total, we will compare:

>> An encoder-decoder model with L layers in the encoder and L layers in the decoder. This model has 2P parameters and a computation cost of M FLOPs.

>> An equivalent model, but with parameters shared across the encoder and decoder, resulting in P parameters and an M-FLOP computational cost.

An encoder-decoder model with L/2 layers each in the encoder and decoder, giving P parameters and an M/2-FLOP cost.

>> A decoder-only language model with L layers and P parameters and a resulting computational cost of M FLOPs.

>> A decoder-only prefix LM with the same architecture (and thus the same number of parameters and computational cost), but with fully-visible self-attention over the input.

为了提供一种合理的比较方法,我们考虑了编码器-解码器模型的多种配置。我们将bertbase大小的层堆栈中的层数和参数分别称为L和P。我们将使用M来表示L + L层编码器-解码器模型或仅L层解码器模型处理给定输入-目标对所需的FLOPs数。总的来说,我们将比较:

一个编码器和解码器各有L层的编码器-解码器模型。该模型参数为2P,计算成本为M个FLOPs。

>>等效模型,但在编码器和解码器之间共享参数,导致P参数和M-FLOP计算成本。

一个编码器-解码器模型,编码器和解码器各有L/2层,给出P个参数和M/2- flop成本。

>>具有L层和P个参数的纯解码器语言模型,其计算成本为M个FLOPs。

>>具有相同架构(因此具有相同数量的参数和计算成本)的仅解码器前缀LM,但在输入上具有完全可见的自关注。

3.2.3 Objectives目标

As an unsupervised objective, we will consider both a basic language modeling objective as well as our baseline denoising objective described in Section 3.1.4. We include the language modeling objective due to its historic use as a pre-training objective (Dai and Le, 2015; Ramachandran et al., 2016; Howard and Ruder, 2018; Radford et al., 2018; Peters et al., 2018) as well as its natural fit for the language model architectures we consider. For models that ingest a prefix before making predictions (the encoder-decoder model and prefix LM), we sample a span of text from our unlabeled data set and choose a random point to split it into prefix and target portions. For the standard language model, we train the model to predict the entire span from beginning to end. Our unsupervised denoising objective is designed for text-to-text models; to adapt it for use with a language model we concatenate the inputs and targets as described in Section 3.2.1.

作为一个无监督的目标,我们将考虑一个基本的语言建模目标以及3.1.4节中描述的基线去噪目标。我们将语言建模目标纳入其中,是因为它曾被用作预训练目标(Dai和Le, 2015;Ramachandran et al., 2016;Howard and Ruder, 2018;Radford et al., 2018;Peters等人,2018),以及它自然适合我们考虑的语言模型架构。对于在做出预测之前摄取前缀的模型(编码器-解码器模型和前缀LM),我们从未标记的数据集中采样一段文本,并选择一个随机点将其分成前缀和目标部分。对于标准语言模型,我们训练模型从头到尾预测整个跨度。我们的无监督去噪目标是为文本到文本模型设计的;为了使其适应语言模型的使用,我们将输入和目标连接起来,如第3.2.1节所述。

3.5 Training Strategy训练策略

So far we have considered the setting where all parameters of a model are pre-trained on an unsupervised task before being fine-tuned on individual supervised tasks. While this approach is straightforward, various alternative methods for training the model on down-stream/supervised tasks have been proposed. In this section, we compare different schemes for fine-tuning the model in addition to the approach of training the model simultaneously on multiple tasks.

到目前为止,我们已经考虑了这样一种设置,即在对单个监督任务进行微调之前,在无监督任务上对模型的所有参数进行预训练。虽然这种方法很简单,但已经提出了各种用于在下游/监督任务上训练模型的替代方法。在本节中,除了在多个任务上同时训练模型的方法外,我们还比较了微调模型的不同方案。

3.5.1 Fine-tuning Methods微调方法

It has been argued that fine-tuning all of the model’s parameters can lead to suboptimal results, particularly on low-resource tasks (Peters et al., 2019). Early results on transfer learning for text classification tasks advocated fine-tuning only the parameters of a small classifier that was fed sentence embeddings produced by a fixed pre-trained model (Subra-manian et al., 2018; Kiros et al., 2015; Logeswaran and Lee, 2018; Hill et al., 2016; Conneau et al., 2017). This approach is less applicable to our encoder-decoder model because the entire decoder must be trained to output the target sequences for a given task. Instead, we focus on two alternative fine-tuning approaches that update only a subset of the parameters of our encoder-decoder model.

The first, “adapter layers” (Houlsby et al., 2019; Bapna et al., 2019), is motivated by the goal of keeping most of the original model fixed while fine-tuning. Adapter layers are additional dense-ReLU-dense blocks that are added after each of the preexisting feed-forward networks in each block of the Transformer. These new feed-forward networks are designed so that their output dimensionality matches their input. This allows them to be inserted into the network with no additional changes to the structure or parameters. When fine-tuning, only the adapter layer and layer normalization parameters are updated. The main hyperparameter of this approach is the inner dimensionality d of the feed-forward network, which changes the number of new parameters added to the model. We experiment with various values for d.

有人认为,微调模型的所有参数可能导致次优结果,特别是在低资源任务上(Peters et al., 2019)。文本分类任务的迁移学习的早期结果主张只微调小分类器的参数,该分类器由固定的预训练模型生成的句子嵌入(Subra-manian等人,2018;Kiros等,2015;Logeswaran and Lee, 2018;Hill et al., 2016;Conneau et al., 2017)。这种方法不太适用于我们的编码器-解码器模型,因为整个解码器必须经过训练才能输出给定任务的目标序列。相反,我们专注于两种可选的微调方法,它们只更新编码器-解码器模型的一部分参数。

第一个是“适配器层”(Houlsby等人,2019;Bapna等人,2019)的动机是在微调时保持大部分原始模型的固定。适配器层是附加的密集- relu密集块,它们被添加到Transformer的每个块中每个预先存在的前馈网络之后。这些新的前馈网络的设计使其输出维度与输入维度相匹配。这使得它们可以插入到网络中,而不需要对结构或参数进行额外的更改。微调时,只更新适配器层和层规范化参数。该方法的主要超参数是前馈网络的内部维数d,它改变了添加到模型中的新参数的数量。我们用不同的d值做实验。

The second alternative fine-tuning method we consider is “gradual unfreezing” (Howard and Ruder, 2018). In gradual unfreezing, more and more of the model’s parameters are fine-tuned over time. Gradual unfreezing was originally applied to a language model architecture consisting of a single stack of layers. In this setting, at the start of fine-tuning only the parameters of the final layer are updated, then after training for a certain number of updates the parameters of the second-to-last layer are also included, and so on until the entire network’s parameters are being fine-tuned. To adapt this approach to our encoder-decoder model, we gradually unfreeze layers in the encoder and decoder in parallel, starting from the top in both cases. Since the parameters of our input embedding matrix and output classification matrix are shared, we update them throughout fine-tuning. Recall that our baseline model consists of 12 layers each in the encoder and decoder and is fine-tuned for 218 steps. As such, we subdivide the fine-tuning process into 12 episodes of 218/12 steps each and train from layers 12 − n to 12 in the nth episode. We note that Howard and Ruder (2018) suggested fine-tuning an additional layer after each epoch of training. However, since our supervised data sets vary so much in size and since some of our downstream tasks are actually mixtures of many tasks (GLUE and SuperGLUE), we instead adopt the simpler strategy of fine-tuning an additional layer after every 218/12 steps.

我们考虑的第二种微调方法是“逐渐解冻”(Howard and Ruder, 2018)。在逐渐解冻的过程中,随着时间的推移,越来越多的模型参数被微调。逐渐解冻最初应用于由单个层堆栈组成的语言模型体系结构。在这种设置中,在微调开始时,只更新最后一层的参数,然后在经过一定次数的更新训练后,也包括倒数第二层的参数,以此类推,直到整个网络的参数都被微调。为了使这种方法适应我们的编码器-解码器模型,我们并行地逐渐解冻编码器和解码器中的层,在这两种情况下都是从顶部开始。由于输入嵌入矩阵和输出分类矩阵的参数是共享的,因此我们在整个微调过程中更新它们。回想一下,我们的基线模型由编码器和解码器中的12层组成,每层都经过218步微调。因此,我们将微调过程细分为12集,每集218/12步,并在第n集从12 - n层到12层进行训练。我们注意到Howard和Ruder(2018)建议在每个训练阶段之后微调一个额外的层。然而,由于我们的监督数据集在大小上变化很大,并且由于我们的一些下游任务实际上是许多任务的混合物(GLUE和SuperGLUE),因此我们采用了更简单的策略,即在每218/12步之后微调额外的层。

A comparison of the performance of these fine-tuning approaches is shown in Table 10. For adapter layers, we report the performance using an inner dimensionality d of 32, 128, 512, 2048. Pursuant with past results (Houlsby et al., 2019; Bapna et al., 2019) we find that lower-resource tasks like SQuAD work well with a small value of d whereas higher resource tasks require a large dimensionality to achieve reasonable performance. This suggests that adapter layers could be a promising technique for fine-tuning on fewer parameters as long as the dimensionality is scaled appropriately to the task size. Note that in our case we treat GLUE and SuperGLUE each as a single “task” by concatenating their constituent data sets, so although they comprise some low-resource data sets the combined data set is large enough that it necessitates a large value of d. We found that gradual unfreezing caused a minor degradation in performance across all tasks, though it did provide some speedup during fine-tuning. Better results may be attainable by more carefully tuning the unfreezing schedule.

这些微调方法的性能比较如表10所示。对于适配器层,我们使用内部维度d为32、128、512、2048来报告性能。根据以往的研究结果(Houlsby et al., 2019;Bapna et al., 2019)我们发现,像SQuAD这样的低资源任务在d值较小的情况下工作得很好,而高资源任务需要大维度才能实现合理的性能。这表明,适配器层可能是一种很有前途的技术,可以对更少的参数进行微调,只要维度适当地缩放到任务大小。请注意,在我们的案例中,我们通过连接它们的组成数据集将GLUE和SuperGLUE各自视为单个“任务”,因此,尽管它们包含一些低资源数据集,但组合的数据集足够大,因此需要较大的d值。我们发现,逐渐解冻会导致所有任务的性能略有下降,尽管它确实在微调期间提供了一些加速。通过更仔细地调整解冻时间表,可以获得更好的结果。

3.5.2 Multi-task Learning

So far, we have been pre-training our model on a single unsupervised learning task before fine-tuning it individually on each downstream task. An alternative approach, called “multi-task learning” (Ruder, 2017; Caruana, 1997), is to train the model on multiple tasks at a time. This approach typically has the goal of training a single model that can simultaneously perform many tasks at once, i.e. the model and most of its parameters are shared across all tasks. We relax this goal somewhat and instead investigate methods for training on multiple tasks at once in order to eventually produce separate parameter settings that perform well on each individual task. For example, we might train a single model on many tasks, but when reporting performance we are allowed to select a different checkpoint for each task. This loosens the multi-task learning framework and puts it on more even footing compared to the pre-train-then-fine-tune approach we have considered so far. We also note that in our unified text-to-text framework, “multi-task learning” simply corresponds to mixing data sets together. It follows that we can still train on unlabeled data when using multi-task learning by treating the unsupervised task as one of the tasks being mixed together. In contrast, most applications of multi-task learning to NLP add task-specific classification networks or use different loss functions for each task (Liu et al., 2019b).

3.5.2多任务学习

到目前为止,我们已经在单个无监督学习任务上预训练了我们的模型,然后在每个下游任务上分别对其进行微调。另一种方法,称为“多任务学习”(Ruder, 2017;Caruana, 1997),即一次在多个任务上训练模型。这种方法通常的目标是训练一个可以同时执行许多任务的单一模型,即模型及其大部分参数在所有任务中共享。我们稍微放宽了这个目标,而是研究同时训练多个任务的方法,以便最终产生在每个单独任务上表现良好的单独参数设置。例如,我们可能在许多任务上训练单个模型,但是当报告性能时,我们被允许为每个任务选择不同的检查点。这就放松了多任务学习框架,与我们目前所考虑的预先训练然后微调的方法相比,使其处于更均匀的基础上。我们还注意到,在我们统一的文本到文本框架中,“多任务学习”仅仅对应于将数据集混合在一起。因此,当使用多任务学习时,我们仍然可以通过将无监督任务视为混合在一起的任务之一来训练未标记数据。相比之下,NLP多任务学习的大多数应用都添加了特定于任务的分类网络,或者对每个任务使用不同的损失函数(Liu et al., 2019b)。

As pointed out by Arivazhagan et al. (2019), an extremely important factor in multi-task learning is how much data from each task the model should be trained on. Our goal is to not under- or over-train the model—that is, we want the model to see enough data from a given task that it can perform the task well, but not to see so much data that it memorizes the training set. How exactly to set the proportion of data coming from each task can depend on various factors including data set sizes, the “difficulty” of learning the task (i.e. how much data the model must see before being able to perform the task effectively), regularization, etc. An additional issue is the potential for “task interference” or “negative transfer”, where achieving good performance on one task can hinder performance on another. Given these concerns, we begin by exploring various strategies for setting the proportion of data coming from each task. A similar exploration was performed by Wang et al. (2019a).

正如Arivazhagan等人(2019)指出的那样,多任务学习中一个极其重要的因素是模型应该在每个任务中训练多少数据。我们的目标是不让模型训练不足或过度,也就是说,我们希望模型从给定任务中看到足够的数据,以便它能很好地执行任务,但不要看到太多的数据,以至于它记住了训练集。如何准确地设置来自每个任务的数据比例取决于各种因素,包括数据集大小、学习任务的“难度”(即模型在能够有效执行任务之前必须看到多少数据)、正则化等。另一个问题是潜在的“任务干扰”或“负迁移”,即在一项任务上取得良好表现可能会妨碍另一项任务的表现。考虑到这些问题,我们首先探索用于设置来自每个任务的数据比例的各种策略。Wang等人(2019a)也进行了类似的探索。

Examples-proportional mixing

Examples-proportional mixing A major factor in how quickly a model will overfit to a given task is the task’s data set size. As such, a natural way to set the mixing proportions is to sample in proportion to the size of each task’s data set. This is equivalent to concatenating the data sets for all tasks and randomly sampling examples from the combined data set. Note, however, that we are including our unsupervised denoising task, which uses a data set that is orders of magnitude larger than every other task’s. It follows that if we simply sample in proportion to each data set’s size, the vast majority of the data the model sees will be unlabeled, and it will undertrain on all of the supervised tasks. Even without the unsupervised task, some tasks (e.g. WMT English to French) are so large that they would similarly crowd out most of the batches. To get around this issue, we set an artificial “limit” on the data set sizes before computing the proportions. Specifically, if the number of examples in each of our N task’s data sets is en, n ∈ {1, . . . , N} then we set probability of sampling an example from the mth task during training to rm = min(em, K)/  min(en, K) where K is the artificial data set size limit.

Examples-proportional混合

模型对给定任务过拟合的速度有多快的一个主要因素是任务的数据集大小。因此,设置混合比例的自然方法是根据每个任务的数据集的大小按比例进行采样。这相当于将所有任务的数据集连接起来,并从合并的数据集中随机抽样。然而,请注意,我们包括了我们的无监督去噪任务,它使用的数据集比其他任务的数据集大几个数量级。因此,如果我们简单地按照每个数据集的大小进行抽样,那么模型看到的绝大多数数据将是未标记的,并且它将在所有监督任务上进行欠训练。即使没有无监督任务,一些任务(例如WMT英语到法语)也是如此之大,以至于它们同样会挤出大多数批次。为了解决这个问题,我们在计算比例之前人为地设置了数据集大小的“限制”。具体来说,如果我们的N个任务数据集中的每个样本的数量为en,则N∈{1,…, N}则我们将训练过程中从第m个任务中抽取一个样本的概率设置为rm = min(em, K)/min(en, K),其中K为人工数据集大小限制。

Temperature-scaled mixing

Temperature-scaled mixing An alternative way of mitigating the huge disparity between data set sizes is to adjust the “temperature” of the mixing rates. This approach was used by multilingual BERT to ensure that the model was sufficiently trained on low-resource languages.14 To implement temperature scaling with temperature T , we raise each task’s mixing rate rm to the power of 1⁄T and renormalize the rates so that they sum to 1. When T = 1, this approach is equivalent to examples-proportional mixing and as T increases the proportions become closer to equal mixing. We retain the data set size limit K (applied to obtain rm before temperature scaling) but set it to a large value of K = 221. We use a large value of K because increasing the temperature will decrease the mixing rate of the largest data sets.

Temperature-scaled混合

减轻数据集大小之间巨大差异的另一种方法是调整混合速率的“温度”。多语言BERT使用这种方法来确保模型在低资源语言上得到充分的训练为了实现温度随温度T的缩放,我们将每个任务的混合速率rm提高到1 / T的次方,并将速率重新归一化,使它们之和为1。当T = 1时,这种方法相当于样本比例混合,随着T的增加,比例更接近于均匀混合。我们保留数据集大小限制K(用于在温度缩放之前获得rm),但将其设置为K = 221的较大值。我们使用较大的K值,因为增加温度会降低最大数据集的混合率。

Equal mixing平等的混合

Equal mixing In this case, we sample examples from each task with equal probability. Specifically, each example in each batch is sampled uniformly at random from one of the data sets we train on. This is most likely a suboptimal strategy, as the model will overfit quickly on low-resource tasks and underfit on high-resource tasks. We mainly include it as a point of reference of what might go wrong when the proportions are set suboptimally.

在这种情况下,我们以等概率从每个任务中采样。具体来说,每个批次中的每个样本都是从我们训练的数据集中随机均匀采样的。这很可能是一个次优策略,因为模型在低资源任务上会快速过拟合,而在高资源任务上会欠拟合。我们主要把它作为一个参考点,当比例为s时,什么可能会出错

To compare these mixing strategies on equal footing with our baseline pre-train-then-fine-tune results, we train multi-task models for the same total number of steps: 219 + 218 = 786,432. The results are shown in Table 11.

为了将这些混合策略与我们的基线预训练-然后微调结果进行比较,我们以相同的总步数训练多任务模型:219 + 218 = 786,432。结果如表11所示。

In general, we find that multi-task training underperforms pre-training followed by fine-tuning on most tasks. The “equal” mixing strategy in particular results in dramatically degraded performance, which may be because the low-resource tasks have overfit, the high-resource tasks have not seen enough data, or the model has not seen enough unlabeled data to learn general-purpose language capabilities. For examples-proportional mixing, we find that for most tasks there is a “sweet spot” for K where the model obtains the best performance, and larger or smaller values of K tend to result in worse performance. The exception (for the range of K values we considered) was WMT English to French translation, which is such a high-resource task that it always benefits from a higher mixing proportion. Finally, we note that temperature-scaled mixing also provides a means of obtaining reasonable performance from most tasks, with T = 2 performing the best in most cases. The finding that a multi-task model is outperformed by separate models trained on each individual task has previously been observed e.g. by Arivazhagan et al. (2019) and McCann et al. (2018), though it has been shown that the multi-task setup can confer benefits across very similar tasks Liu et al.(2019b); Ratner et al. (2018). In the following section, we explore ways to close the gap between multi-task training and the pre-train-then-fine-tune approach.

总的来说,我们发现多任务训练在大多数任务上的预训练和微调效果都不如预训练好。“相等”混合策略尤其会导致性能急剧下降,这可能是因为低资源任务过度拟合,高资源任务没有看到足够的数据,或者模型没有看到足够的未标记数据来学习通用语言功能。例如,比例混合,我们发现对于大多数任务来说,K都有一个“最佳点”,在这个点上模型可以获得最佳性能,K的值越大或越小,性能就越差。唯一的例外(对于我们考虑的K值范围)是WMT英语到法语的翻译,这是一项高资源任务,它总是受益于较高的混合比例。最后,我们注意到,温度比例混合也提供了一种从大多数任务中获得合理性能的手段,在大多数情况下,T = 2表现最佳。先前已经观察到,例如Arivazhagan等人(2019)和McCann等人(2018)发现,在每个单独任务上训练的单独模型优于多任务模型,尽管已经表明,多任务设置可以在非常相似的任务中带来好处Liu等人(2019b);拉特纳等人(2018)。在下一节中,我们将探讨如何缩小多任务训练和预训练-然后微调方法之间的差距。

3.5.3 Combining Multi-Task Learning with Fine-Tuning

Recall that we are studying a relaxed version of multi-task learning where we train a single model on a mixture of tasks but are allowed to evaluate performance using different parameter settings (checkpoints) for the model. We can extend this approach by considering the case where the model is pre-trained on all tasks at once but is then fine-tuned on the individual supervised tasks. This is the method used by the “MT-DNN” (Liu et al., 2015, 2019b), which achieved state-of-the-art performance on GLUE and other benchmarks when it was introduced. We consider three variants of this approach: In the first, we simply pre-train the model on an examples-proportional mixture with an artificial data set size limit of K = 219 before fine-tuning it on each individual downstream task. This helps us measure whether including the supervised tasks alongside the unsupervised objective during pre-training gives the model some beneficial early exposure to the downstream tasks. We might also hope that mixing in many sources of supervision could help the pre-trained model obtain a more general set of “skills” (loosely speaking) before it is adapted to an individual task. To measure this directly, we consider a second variant where we pre-train the model on the same examples-proportional mixture (with K = 219) except that we omit one of the downstream tasks from this pre-training mixture. Then, we fine-tune the model on the task that was left out during pre-training. We repeat this for each of the downstream tasks we consider. We call this approach “leave-one-out” multi-task training. This simulates the real-world setting where a pre-trained model is fine-tuned on a task it had not seen during pre-training. Note that multi-task pre-training provides a diverse mixture of supervised tasks. Since other fields (e.g. computer vision (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014)) use a supervised data set for pre-training, we were interested to see whether omitting the unsupervised task from the multi-task pre-training mixture still produced good results. For our third variant we therefore pre-train on an examples-proportional mixture of all of the supervised tasks we consider with K = 219. In all of these variants, we follow our standard procedure of pre-training for 219 steps before fine-tuning for 218 steps.

多任务学习与微调相结合

回想一下,我们正在研究多任务学习的一个宽松版本,其中我们在混合任务上训练单个模型,但允许使用模型的不同参数设置(检查点)来评估性能。我们可以通过考虑这样一种情况来扩展这种方法,即模型一次在所有任务上进行预训练,然后在单个监督任务上进行微调。这是“MT-DNN”(Liu et al., 2015,2019b)使用的方法,该方法在引入GLUE和其他基准测试时取得了最先进的性能。我们考虑了这种方法的三种变体:在第一种方法中,我们简单地在样本比例混合物上预训练模型,人工数据集大小限制为K = 219,然后在每个单独的下游任务上对其进行微调。这有助于我们衡量在预训练期间将监督任务与无监督目标一起包含是否会使模型对下游任务有一些有益的早期暴露。我们可能还希望,混合许多监督来源可以帮助预训练模型在适应单个任务之前获得更通用的“技能”(粗略地说)。为了直接测量这一点,我们考虑第二种变体,我们在相同的样本比例混合物(K = 219)上预训练模型,只是我们忽略了这个预训练混合物中的一个下游任务。然后,我们在预训练期间遗漏的任务上对模型进行微调。对于我们考虑的每个下游任务,我们都重复此操作。我们称这种方法为“留一种”多任务训练。这模拟了现实世界的设置,其中预训练的模型在预训练期间没有看到的任务上进行微调。注意,多任务预训练提供了多种监督任务的混合。由于其他领域(如计算机视觉)(Oquab et al., 2014;Jia et al., 2014;Huh et al., 2016;Yosinski et al., 2014))使用监督数据集进行预训练,我们很想知道从多任务预训练混合中省略无监督任务是否仍然会产生良好的结果。因此,对于我们的第三种变体,我们在K = 219的所有监督任务的样本比例混合上进行预训练。在所有这些变体中,在对218个步骤进行微调之前,我们遵循预训练219个步骤的标准程序。

We compare the results of these approaches in Table 12. For comparison, we also include results for our baseline (pre-train then fine-tune) and for standard multi-task learning (without fine-tuning) on an examples-proportional mixture with K = 219. We find that fine-tuning after multi-task pre-training results in comparable performance to our baseline. This suggests that using fine-tuning after multi-task learning can help mitigate some of the trade-offs between different mixing rates described in Section 3.5.2. Interestingly, the performance of “leave-one-out” training was only slightly worse, suggesting that a model that was trained on a variety of tasks can still adapt to new tasks (i.e. multi-task pre-training might not result in a dramatic task interference). Finally, supervised multi-task pre-training performed significantly worse in every case except for the translation tasks. This could suggest that the translation tasks benefit less from (English) pre-training, whereas unsupervised pre-training is an important factor in the other tasks.

我们在表12中比较了这些方法的结果。为了进行比较,我们还在K = 219的样本比例混合物上包含了基线(预训练然后微调)和标准多任务学习(没有微调)的结果。我们发现多任务预训练后的微调结果与我们的基线相当。这表明,在多任务学习后使用微调可以帮助减轻3.5.2节中描述的不同混合率之间的一些权衡。有趣的是,“留一个”训练的表现只是稍微差一点,这表明在各种任务上训练的模型仍然可以适应新任务(即多任务预训练可能不会导致严重的任务干扰)。最后,除了翻译任务外,监督多任务预训练在所有情况下的表现都明显更差。这可能表明翻译任务从(英语)预训练中获益较少,而无监督的预训练在其他任务中是一个重要因素。

4、Reflection反思

Having completed our systematic study, we wrap up by first recapping some of our most significant findings. Our results provide some high-level perspective on which avenues of research might be more or less promising. To conclude, we outline some topics we think might provide effective approaches for further progressing the field.

在完成了我们的系统研究之后,我们首先回顾了一些最重要的发现。我们的结果提供了一些高层次的观点,哪些研究途径可能或多或少有希望。最后,我们概述了一些我们认为可能为进一步发展该领域提供有效途径的主题。

4.1 Takeaways要点

Text-to-text文本到文本

Our text-to-text framework provides a simple way to train a single model on a wide variety of text tasks using the same loss function and decoding procedure. We showed how this approach can be successfully applied to generative tasks like abstractive summarization, classification tasks like natural language inference, and even regression tasks like STS-B. In spite of its simplicity, we found the text-to-text framework obtained comparable performance to task-specific architectures and ultimately produced state-of-the-art results when combined with scale.

我们的文本到文本框架提供了一种简单的方法,可以使用相同的损失函数和解码过程在各种文本任务上训练单个模型。我们展示了这种方法如何成功地应用于生成任务,如抽象摘要,分类任务,如自然语言推理,甚至回归任务,如STS-B。尽管它很简单,但我们发现文本到文本框架获得了与特定于任务的架构相当的性能,并且在与规模相结合时最终产生了最先进的结果

Architectures架构

Architectures

While some work on transfer learning for NLP has considered architectural variants of the Transformer, we found the original encoder-decoder form worked best in our text-to-text framework. Though an encoder-decoder model uses twice as many parameters as “encoder-only” (e.g. BERT) or “decoder-only” (language model) architectures, it has a similar computational cost. We also showed that sharing the parameters in the encoder and decoder did not result in a substantial performance drop while halving the total parameter count.

虽然NLP迁移学习的一些工作考虑了Transformer的架构变体,但我们发现原始的编码器-解码器形式在我们的文本到文本框架中效果最好。尽管编码器-解码器模型使用的参数是“仅编码器”(例如BERT)或“仅解码器”(语言模型)架构的两倍,但它具有相似的计算成本。我们还表明,在将总参数计数减半的情况下,共享编码器和解码器中的参数并不会导致性能大幅下降

Unsupervised objectives无监督目标

Overall, we found that most “denoising” objectives, which train the model to reconstruct randomly corrupted text, performed similarly in the text-to-text setup. As a result, we suggest using objectives that produce short target sequences so that unsupervised pre-training is more computationally efficient.

总的来说,我们发现大多数“去噪”目标,在训练模型重建随机损坏的文本方面,在文本到文本设置中表现相似。因此,我们建议使用产生短目标序列的目标,以使无监督预训练更具计算效率

Data sets 数据集

We introduced the “Colossal Clean Crawled Corpus” (C4), which comprises heuristically-cleaned text from the Common Crawl web dump. When comparing C4 to data sets that use additional filtering, we found that training on in-domain unlabeled data could boost performance in a few downstream tasks. However, constraining to a single domain typically results in a smaller data set. We separately showed that performance can degrade when an unlabeled data set is small enough that it is repeated many times over the course of pre-training. This motivates the use of a large and diverse data set like C4 for generic language understanding tasks.

我们介绍了“Colossal Clean Crawl Corpus”(C4),它包含了从Common Crawl网页转储中启发式清理的文本。当将C4与使用额外过滤的数据集进行比较时,我们发现在域内未标记数据上进行训练可以提高一些下游任务的性能。然而,限制在单一领域通常会导致更小的数据集。我们分别表明,当未标记的数据集足够小,以至于在预训练过程中重复多次时,性能可能会下降。这促使我们在通用语言理解任务中使用像C4这样的大型和多样化的数据集。

Training strategies训练策略

We found that the basic approach of updating all of a pre-trained model’s parameters during fine-tuning outperformed methods that are designed to update fewer parameters, although updating all parameters is most expensive. We also experimented with various approaches for training the model on multiple tasks at once, which in our text-to-text setting simply corresponds to mixing examples from different data sets when constructing batches. The primary concern in multi-task learning is setting the proportion of each task to train on. We ultimately did not find a strategy for setting mixing proportions that matched the performance of the basic approach of unsupervised pre-training followed by supervised fine-tuning. However, we found that fine-tuning after pre-training on a mixture of tasks produced comparable performance to unsupervised pre-training.

我们发现,在微调期间更新所有预训练模型参数的基本方法优于旨在更新较少参数的方法,尽管更新所有参数是最昂贵的。我们还尝试了在模型上同时进行多个任务的各种方法,在我们的文本到文本设置中,这只是在构建批次时混合来自不同数据集的示例。多任务学习的主要问题是设置每个任务的训练比例。我们最终没有找到一个设置混合比例的策略,能够与无监督预训练后进行监督微调的基本方法的性能相匹配。然而,我们发现在预训练过程中混合任务后微调产生的性能与无监督预训练相当

Scaling扩展

We compared various strategies for taking advantage of additional compute, including training the model on more data, training a larger model, and using an ensemble of models. We found each approach conferred a significant boost in performance, though training a smaller model on more data was often outperformed by training a larger model for fewer steps. We also showed an ensemble of models can provide substantially better results than a single model, which provides an orthogonal means of leveraging additional computation. Ensembling models that were fine-tuned from the same base pre-trained model performed worse than pre-training and fine-tuning all models completely separately, though fine-tune-only ensembling still substantially outperformed a single model.

我们比较了利用额外计算的各种策略,包括在更多数据上训练模型、训练更大的模型和使用模型集合。我们发现每种方法都能显著提高性能,尽管在更多数据上训练更小的模型通常比在更少的步骤上训练更大的模型表现更好。我们还表明,模型集合可以提供比单个模型更好的结果,它提供了利用额外计算的正交方法。与从相同基础预训练模型进行微调的模型集合表现不如完全独立地预训练和微调所有模型,尽管仅微调的集成仍然大大优于单个模型。

Pushing the limits挑战极限11B的参数,1万亿个token

We combined our above insights and trained substantially larger models (up to 11 billion parameters) to achieve state-of-the-art results across many of the benchmarks we considered. For unsupervised training, we extracted text from our C4 data set and applied a denoising objective that corrupts contiguous spans of tokens. We pre-trained on a multi-task mixture before fine-tuning on individual tasks. Overall, our models were trained on over 1 trillion tokens. In the interest of facilitating the replication, extension, and application of our results, we release our code, the C4 data set, and pre-trained model weights for each T5 variant.1

我们将上述见解结合起来,训练了更大的模型(多达110亿个参数),以便在我们考虑的许多基准测试中获得最先进的结果。对于无监督训练,我们从C4数据集中提取文本,并应用一个去噪目标,该目标会破坏token的连续跨度。在对单个任务进行微调之前,我们对多任务混合进行了预训练。总的来说,我们的模型在超过1万亿个代币上进行了训练。为了方便我们的结果的复制、扩展和应用,我们发布了我们的代码、C4数据集和每个T5变量的预训练模型权重

4.2 Outlook前景

The inconvenience of large models 大模型带来的不便

An unsurprising but important result from our study is that larger models tend to perform better. The fact that the hardware used for running these models is continually getting cheaper and more powerful suggests that scaling up may continue to be a promising way to achieve better performance (Sutton, 2019). However, it will always be the case that there are applications and scenarios where using a smaller or less expensive model is helpful, for example when performing client-side inference or federated learning (Konečny` et al., 2015, 2016). Relatedly, one beneficial use of transfer learning is the possibility of attaining good performance on low-resource tasks. Low-resource tasks often occur (by definition) in settings where one lacks the assets to label more data. It follows that low-resource applications often also have limited access to computational resources which can incur additional costs. As a result, we advocate for research on methods that achieve stronger performance with cheaper models so that transfer learning can be applied where it will have the most impact. Some current work along these lines include distillation (Hinton et al., 2015; Sanh et al., 2019; Jiao et al., 2019), parameter sharing (Lan et al., 2019), and conditional computation (Shazeer et al., 2017).

从我们的研究中得出的一个不令人惊讶但重要的结果是,更大的模型往往表现得更好。用于运行这些模型的硬件不断变得更便宜、更强大,这一事实表明,扩大规模可能仍然是实现更好性能的一种有希望的方式(Sutton, 2019)。然而,在某些应用程序和场景中,使用更小或更便宜的模型是有帮助的,例如在执行客户端推理或联邦学习时(kononeny ' et al., 2015,2016)。与此相关,迁移学习的一个有益用途是在低资源任务上获得良好表现的可能性。根据定义,低资源任务通常发生在缺乏资源来标记更多数据的环境中。因此,低资源应用程序通常对计算资源的访问也有限,这可能会产生额外的成本。因此,我们提倡研究用更便宜的模型实现更强性能的方法,以便迁移学习可以应用于最具影响力的地方。目前沿着这些方向进行的一些工作包括蒸馏(Hinton等人,2015;Sanh等人,2019;Jiao et al., 2019)、参数共享(Lan et al., 2019)和条件计算(Shazeer et al., 2017)。

More efficient knowledge extraction更高效的知识提取

Recall that one of the goals of pre-training is (loosely speaking) to provide the model with general-purpose “knowledge” that improves its performance on downstream tasks. The method we use in this work, which is currently common practice, is to train the model to denoise corrupted spans of text. We suspect that this simplistic technique may not be a very efficient way to teach the model general-purpose knowledge. More concretely, it would be useful to be able to attain good fine-tuning performance without needing to train our models on 1 trillion tokens of text first. Some concurrent work along these lines improves efficiency by pre-training a model to distinguish between real and machine-generated text (Clark et al., 2020).

回想一下,预训练的目标之一是(粗略地说)为模型提供通用的“知识”,以提高其在下游任务上的性能。我们在这项工作中使用的方法是训练模型去噪损坏的文本跨度,这是目前常见的做法。我们怀疑这种简单的技术可能不是教授模型通用知识的非常有效的方法。更具体地说,如果能够获得良好的微调性能,而不需要首先在1万亿个文本标记上训练我们的模型,这将是有用的。沿着这些思路进行的一些并发工作通过预训练模型来区分真实文本和机器生成的文本来提高效率(Clark et al., 2020)。

Formalizing the similarity between tasks 形式化任务之间的相似性

We observed that pre-training on unlabeled in-domain data can improve performance on downstream tasks (Section 3.4). This finding mostly relies on basic observations like the fact that SQuAD was created using data from Wikipedia. It would be useful to formulate a more rigorous notion of the “similarity” between the pre-training and downstream tasks, so that we could make more principled choices about what source of unlabeled data to use. There is some early empirical work along these lines in the field of computer vision (Huh et al., 2016; Kornblith et al., 2018; He et al., 2018). A better notion of the relatedness of tasks could also help choose supervised pre-training tasks, which has been shown to be helpful for the GLUE benchmark (Phang et al., 2018).

我们观察到,对未标记的域内数据进行预训练可以提高下游任务的性能(第3.4节)。这一发现主要依赖于基本的观察结果,比如SQuAD是使用维基百科的数据创建的。在预训练和下游任务之间形成一个更严格的“相似性”概念将是有用的,这样我们就可以在使用未标记数据的来源方面做出更有原则的选择。在计算机视觉领域有一些类似的早期实证工作(Huh et al., 2016;Kornblith等人,2018;他等人,2018)。更好地了解任务的相关性也可以帮助选择有监督的预训练任务,这已被证明对GLUE基准测试有帮助(Phang等人,2018)。

Language-agnostic models语言无关模型

We were disappointed to find that English-only pre-training did not achieve state-of-the-art results on the translation tasks we studied. We also are interested in avoiding the logistical difficulty of needing to specify which languages

a vocabulary can encode ahead of time. To address these issues, we are interested in further investigating language-agnostic models, i.e. models that can perform a given NLP task with good performance regardless of the text’s language. This is an especially pertinent issue given that English is not the native language for the majority of the world’s population.

我们很失望地发现,只有英语的预训练并没有在我们研究的翻译任务中达到最先进的效果。我们还希望避免需要指定哪些语言的逻辑困难

词汇表可以提前编码。为了解决这些问题,我们有兴趣进一步研究与语言无关的模型,即无论文本的语言如何,都能以良好的性能执行给定的NLP任务的模型。考虑到英语不是世界上大多数人的母语,这是一个特别相关的问题。

The motivation for this paper was the flurry of recent work on transfer learning for NLP. Before we began this work, these advances had already enabled breakthroughs in settings where learning-based methods had not yet been shown to be effective. We are happy to be able to continue this trend, for example by nearly matching human-level performance on the SuperGLUE benchmark, a task specifically designed to be difficult for modern transfer-learning pipelines. Our results stem from the combination of a straightforward and unified text-to-text framework, our new C4 data set, and insights from our systematic study. Additionally, we provided an empirical overview of the field and a perspective on where it stands. We are excited to see continued work using transfer learning towards the goal of general language understanding.

这篇论文的动机是最近在NLP迁移学习方面的工作。在我们开始这项工作之前,这些进步已经使基于学习的方法尚未显示出有效性的环境取得了突破。我们很高兴能够继续这一趋势,例如,在SuperGLUE基准测试上的表现接近人类水平,这是一项专门为现代迁移学习管道设计的困难任务。我们的结果源于一个简单而统一的文本到文本框架、我们新的C4数据集和我们系统研究的见解的结合。此外,我们还提供了该领域的经验概述和对其现状的看法。我们很高兴看到继续使用迁移学习来实现通用语言理解的目标。

T5的简介

T5: 文本到文本的传输Transformer。截至2022年7月,我们建议使用T5X。T5X是T5(以及更多内容)在JAX和Flax中的新改进实现。在Tensorflow上使用MeshTF的T5不再得到积极开发。如果您对T5不熟悉,我们建议从T5X开始。

t5库主要用于复制《探索使用统一文本到文本变压器进行迁移学习的极限》中的实验的代码。在论文中,我们演示了如何通过对大型文本语料库进行预训练的文本到文本Transformer,在多个自然语言处理任务上取得最先进的结果。

此存储库中的大部分代码用于加载、预处理、混合和评估数据集。它还提供了一种微调发布的预训练模型的方式。

t5库可以通过提供用于在文本到文本任务混合中训练和微调(可能是巨大的)模型的有用模块,用于未来模型开发。

GitHub地址GitHub - google-research/text-to-text-transfer-transformer: Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

1、已发布的模型检查点

我们已发布在论文中描述的预训练模型的以下检查点:在此处查看其他实验性预训练模型检查点的列表。

T5的安装和使用方法

尝试T5的最简单方法是使用我们Colab教程中的免费TPU。

地址https://colab.research.google.com/github/google-research/text-to-text-transfer-transformer/blob/main/notebooks/t5-trivia.ipynb

以下是使用我们的代码库从命令行预训练、微调、评估和解码模型的示例。您可以使用这些说明重现我们的结果,使用您自己的数据和/或超参数微调我们发布的检查点之一,或者从头开始预训练一个模型。

1、数据集准备

您可以使用新的或预先存在的任务,也可以从预处理的TSV文件加载示例。

2、使用任务

根据您的数据源(见上文),您需要适当准备您的数据。

如果使用普通任务,只需确保由您的dataset_fn加载的任何文件对TPU可访问(即位于GCS存储桶中),然后您就可以开始了!

TfdsTask

我们大多数预定义的任务使用TensorFlow数据集(TFDS)作为它们的数据源。当您使用TfdsTask运行我们的训练二进制文件(请参见下文的说明)时,数据集将在首次使用时自动下载和准备。准备完成后,数据集将被缓存在本地存储中,以避免未来运行中的这种开销。如果在云中工作,建议您设置--t5_tfds_data_dir标志,指向持久存储位置,例如GCS存储桶。这是在TPU上训练时的要求。

C4

我们为无监督预训练创建的C4数据集可在TensorFlow数据集中获得,但下载原始Common Crawl抓取(约7TB)以及准备数据(约335个CPU天)需要大量带宽和计算资源。我们建议您充分利用TFDS中的Apache Beam支持,它可以实现对数据集的分布式预处理,并可在Google Cloud Dataflow上运行。使用500个工作者,作业应在约16小时内完成。

在适当定义了MY_PROJECT和MY_BUCKET后,您可以使用以下命令在GCP中的DataFlow中构建数据集:

pip install tfds-nightly[c4]

echo 'tfds-nightly[c4]' > /tmp/beam_requirements.txt

python -m tensorflow_datasets.scripts.download_and_prepare \

  --datasets=c4/en \

  --data_dir=gs://$MY_BUCKET/tensorflow_datasets \

  --beam_pipeline_options="project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service,region=$MY_REGION"

阅读更多关于TFDS Beam指南的内容。

TextLineTask

当您的数据源是一个文本文件(或文件),每行一个示例时,TextLineTask很有用。然后,您可以使用文本预处理器将每行转换为输入和目标的字典。

确保您的文件对TPU可访问(即在GCS存储桶中),然后您就可以开始了!

直接使用TSV文件

与定义新任务不同,您可以直接使用TSV文件(或文件)作为数据集,其中每一行的格式为<input>\t<target>。

但是,有一些注意事项:

没有办法定义文本处理器,因此TSV将需要以预处理格式包含您的数据。

目前还没有办法为直接使用TSV文件时的评估设置标记处理器、后处理函数或度量函数。

如果您需要其中的任何功能,则必须定义新任务、TfdsTask或TextLineTask。

与上述情况类似,您的TSV文件(们)必须对TPU可访问(即在GCS存储桶中)。

3、安装

要安装T5包,只需运行:

pip install t5[gcp]

在GCP上设置TPU

您首先需要在Google Cloud上启动一个虚拟机(VM)。有关启动VM的详细信息,请参阅Google Cloud文档。

为了在Cloud TPU上运行训练或评估,您必须根据您的项目、区域和GCS存储桶适当设置以下变量。有关更多详细信息,请参阅Cloud TPU入门指南。

在GCP上设置TPU

export PROJECT=your_project_name

export ZONE=your_project_zone

export BUCKET=gs://yourbucket/

export TPU_NAME=t5-tpu

export TPU_SIZE=v3-8

export DATA_DIR="${BUCKET}/your_data_dir"

export MODEL_DIR="${BUCKET}/your_model_dir"

请使用以下命令在Cloud VM中创建TPU设备。

ctpu up --name=$TPU_NAME --project=$PROJECT --zone=$ZONE --tpu-size=$TPU_SIZE \

        --tpu-only --noconf

4、训练

训练一个模型

在下面的命令中,我们从头开始在GLUE Benchmark MRPC任务上训练一个模型。您可以更改MIXTURE_NAME gin参数以使用我们包中提供的任何任务或混合。

t5_mesh_transformer  \

  --tpu="${TPU_NAME}" \

  --gcp_project="${PROJECT}" \

  --tpu_zone="${ZONE}" \

  --model_dir="${MODEL_DIR}" \

  --t5_tfds_data_dir="${DATA_DIR}" \

  --gin_file="dataset.gin" \

  --gin_file="models/bi_v1.gin" \

  --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" \

  --gin_param="utils.tpu_mesh_shape.tpu_topology = '${TPU_SIZE}'" \

  --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'"

获取任务和混合的完整列表

可以通过运行以下命令获取任务和混合的完整列表:

python -c "import t5; print(t5.data.MixtureRegistry.names())"

您还可以在新文件中定义其他任务和混合,并使用--module_import标志导入它。

或者,您可以使用TSV文件进行训练,其中每一行的格式为<input>\t<target>(见上文)。

5、微调

为了微调我们预训练的模型之一,您需要将预训练模型的操作配置传递给训练脚本。操作配置应作为gin_file标志传递。它指定了模型架构和其他超参数。此外,您需要指定要微调的混合。例如,要在glue_mrpc_v002混合上微调T5-small模型,请运行:

t5_mesh_transformer  \

  --tpu="${TPU_NAME}" \

  --gcp_project="${PROJECT}" \

  --tpu_zone="${ZONE}" \

  --model_dir="${MODEL_DIR}" \

  --t5_tfds_data_dir="${DATA_DIR}" \

  --gin_file="dataset.gin" \

  --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" \

  --gin_param="utils.tpu_mesh_shape.tpu_topology = '${TPU_SIZE}'" \

  --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'" \

  --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin"

操作配置中包含了正确的预训练检查点路径。

您还可以在新文件中定义其他任务和混合,并使用--module_import标志导入它。

或者,您可以使用TSV文件进行微调,其中每一行的格式为<input>\t<target>(见上文)。例如,您可以尝试WMT '19 News Commentary 14训练集的其中一个配对翻译数据集(例如,英语-法语)。使用TSV文件时,您将MIXTURE_NAME标志替换为:

--gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn"

--gin_param="tsv_dataset_fn.filename = 'gs:/path/to/tsv'"

要使用我们在论文中使用的相同超参数进行微调(使用学习率0.001的恒定学习率),您可以传递包含在T5包中的此gin文件:

--gin_file="learning_rate_schedules/constant_0_001.gin"

预训练模型的操作配置设置为实际上没有对训练步数的限制。如果您想要特定数量的步骤进行训练,您需要传递进来。由于预训练模型已经经过1000000步的训练,因此在预训练和微调之后,您应该指定总步数。例如,如果您想要额外进行10000步微调,您应该传递

--gin_param="run.train_steps = 1010000"

您还可以为微调使用不同的批量大小。我们根据批次中的总令牌数设置批次大小。默认情况下,批次使用512的序列长度。要设置批次中的令牌数,您应该设置

--gin_param = "tokens_per_batch=1048576"

6、评估

为了在T5框架中评估模型,您需要使用eval.gin文件,指定模型目录、解码方法以及要评估的检查点步数。因此,要在GLUE MRPC任务上使用所有检查点的beam search进行评估,请使用以下命令:

t5_mesh_transformer \

  --tpu="${TPU_NAME}" \

  --gcp_project="${PROJECT}" \

  --tpu_zone="${ZONE}" \

  --model_dir="${MODEL_DIR}" \

  --gin_file="${MODEL_DIR}/operative_config.gin" \

  --t5_tfds_data_dir=${DATA_DIR} \

  --gin_file="eval.gin" \

  --gin_file="beam_search.gin" \

  --gin_param="run.dataset_split = 'validation'" \

  --gin_param="utils.tpu_mesh_shape.tpu_topology = '${TPU_SIZE}'" \

  --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'" \

  --gin_param="eval_checkpoint_step = 'all'"

要评估特定检查点,只需将eval_checkpoint_step参数设置为适当的检查点。

--gin_param="eval_checkpoint_step = 100000"

您还可以在上述命令中使用greedy_decode.gin或sample_decode.gin,而不是beam_search.gin。

7、解码

为了从T5框架中的模型生成预测,您需要指定模型目录、解码方法以及要用于解码的检查点步数。假设您在/path/to/inputs.txt中存储了输入序列的文本文件,那么示例命令如下:

t5_mesh_transformer \

  --tpu="${TPU_NAME}" \

  --gcp_project="${PROJECT}" \

  --tpu_zone="${ZONE}" \

  --model_dir="${MODEL_DIR}" \

  --gin_file="${MODEL_DIR}/operative_config.gin" \

  --gin_file="infer.gin" \

  --gin_file="sample_decode.gin" \

  --gin_param="input_filename = '/path/to/inputs.txt'"\

  --gin_param="output_filename = '/tmp/outputs.txt'"\

  --gin_param="utils.tpu_mesh_shape.tpu_topology = '${TPU_SIZE}'"\

  --gin_param="infer_checkpoint_step = 'all'"

要使用特定检查点进行预测,只需将infer_checkpoint_step参数设置为适当的检查点。

--gin_param="infer_checkpoint_step = 100000"

您还可以在上述命令中使用beam_search.gin或greedy_decode.gin,而不是sample_decode.gin。

8、导出

您可能还想导出SavedModel,这对于服务训练模型(例如,使用ML Engine或Docker镜像部署)非常有用。

您可能还想要导出一个SavedModel,这对于提供您训练好的模型非常有用,例如在使用ML Engine或Docker镜像部署时。

t5_mesh_transformer

--gcp_project="${PROJECT}"

--tpu_zone="${ZONE}"

--model_dir="${MODEL_DIR}"

--use_model_api

--mode="export_predict"

--export_dir="/path/to/export/dir"

上述命令导出模型目录中的最新检查点。要导出特定检查点,请添加以下标志:

--checkpoint_mode="specific"

--checkpoint_steps=1000000

t5-deploy笔记演示了如何导出SavedModel并将其打包到Docker镜像中进行提供。

9、GPU使用

如果您想要使用GPU而不是TPU,可以通过删除TPU特定标志(--tpu,--tpu_zone,--gcp_project)并根据所需设置设置mesh_shape和mesh_devices的gin参数来修改上述命令。

例如,如果您的计算机可以访问6个GPU,并且您想要进行3路模型并行和2路数据并行,上面的微调命令将变为:

t5_mesh_transformer

--model_dir="${MODEL_DIR}"

--t5_tfds_data_dir="${DATA_DIR}"

--gin_file="dataset.gin"

--gin_param="utils.run.mesh_shape = 'model:3,batch:2'"

--gin_param="utils.run.mesh_devices = ['gpu:0','gpu:1','gpu:2','gpu:3','gpu:4','gpu:5']"

--gin_param="MIXTURE_NAME = 'glue_mrpc_v002'"

--gin_file="gs://t5-data/pretrained_models/small/operative_config.gin"

对于单个GPU,命令如下:

t5_mesh_transformer

--model_dir="${MODEL_DIR}"

--t5_tfds_data_dir="${DATA_DIR}"

--gin_file="dataset.gin"

--gin_param="utils.run.mesh_shape = 'model:1,batch:1'"

--gin_param="utils.run.mesh_devices = ['gpu:0']"

--gin_param="MIXTURE_NAME = 'glue_mrpc_v002'"

--gin_file="gs://t5-data/pretrained_models/small/operative_config.gin"

10、重现我们的实验

我们在gs://t5-data/experiments中为论文中的所有实验提供了operative configs。实验文件夹有不同的子目录,对应于我们论文中的不同部分。例如,gs://t5-data/experiments/objectives包含第3.3节("无监督目标")的实验。objectives文件夹的每个子目录都包含某个特定实验的operative configs(粗略地说,一个“实验”是我们论文中某个表格中的一行)。

假设您想要重现“前缀语言建模”目标的结果(表4中的第一行)。该实验的operative configs位于gs://t5-data/experiments/objectives/obj-prefix_lm。在基本目录中,有一个用于预训练模型的operative config(gs://t5-data/experiments/objectives/obj-prefix_lm/operative_config.gin)。然后,对于我们考虑的每个下游微调混合物,都有各自的operative config的子目录,每个子目录都有自己的operative config(例如gs://t5-data/experiments/objectives/obj-prefix_lm/cnn_dailymail_v002/operative_config.gin)。要运行此实验,请首先使用预训练的operative config对模型进行预训练:

export PRETRAIN_MODEL_DIR="${BUCKET}/obj-prefix_lm"

t5_mesh_transformer

--tpu="${TPU_NAME}"

--gcp_project="${PROJECT}"

--tpu_zone="${ZONE}"

--model_dir="${PRETRAIN_MODEL_DIR}"

--gin_file="gs://t5-data/experiments/objectives/obj-prefix_lm/operative_config.gin"

--gin_param="utils.tpu_mesh_shape.model_parallelism = 1"

--gin_param="utils.tpu_mesh_shape.tpu_topology = '${TPU_SIZE}'"

然后,您可以对CNN/Daily Mail进行微调预训练模型,如下所示:

export FINETUNE_MODEL_DIR="${BUCKET}/obj-prefix_lm/cnn_dailymail_v002"

t5_mesh_transformer

--tpu="${TPU_NAME}"

--gcp_project="${PROJECT}"

--tpu_zone="${ZONE}"

--model_dir="${FINETUNE_MODEL_DIR}"

--gin_file="gs://t5-data/experiments/objectives/obj-prefix_lm/cnn_dailymail_v002/operative_config.gin"

--gin_param="init_checkpoint = '${PRETRAIN_MODEL_DIR}/model.ckpt-524288'"

--gin_param="utils.tpu_mesh_shape.model_parallelism = 1"

--gin_param="utils.tpu_mesh_shape.tpu_topology = '${TPU_SIZE}'"

11、有用选项

某些训练变体需要同时设置多个标志。对于以下每个变体,将标志组添加到./third_party/py/t5/google/scripts/run_finetune.sh。

确定性训练

--train_gin_param="mesh_train_dataset_fn.seed=${SEED}"

--train_gin_param="utils.run.skip_seen_data = True"

语言模型

--objective="lm"

--train_gin_param="utils.run.model_type = "lm""

T5的案例应用

更新中……

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/IT小白/article/detail/626032
推荐阅读
相关标签
  

闽ICP备14008679号