2.1 Language Representation Learning语言表征学习

Non-contextual Embeddings非上下文嵌入【静态词嵌入】

Contextual Embeddings上下文嵌入【动态词嵌入】

2.2 Neural Contextual Encoders神经网络上下文编码器

2.2.1 Sequence Models序列模型——CNN、RNN

2.2.2 Non-Sequence Models非序列模型——RecursNN、TreeL-STM、GCN、FCSA

2.2.3 Analysis分析

2.3 Why Pre-training?为什么需要预训练——三大优势

2.4 A Brief History of PTMs for NLP—NLP 的 PTM 简史

2.4.1 First-Generation PTMs: Pre-trained Word Embeddings 第一代PTMs:预训练词嵌入

2.4.2 Second-Generation PTMs: Pre-trained Contextual En-coders第二代PTMs:预训练的上下文编码器 

3 Overview of PTMs—PTM的概述

3.1 Pre-training Tasks预训练任务

3.1.1 Language Modeling (LM)语言建模

3.1.2 Masked Language Modeling (MLM)掩码语言建模

3.1.3 Permuted Language Modeling (PLM)置换语言建模

3.1.4 Denoising Autoencoder (DAE)降噪自动编码器

3.1.5 Contrastive Learning (CTL)对比学习

3.1.6 Others

3.2 Taxonomy of PTMs

3.3 Model Analysis模型分析

3.3.1 Non-Contextual Embeddings非上下文嵌入

Figure 3: Taxonomy of PTMs with Representative Examples

Table 2: List of Representative PTMs有代表性的 PTMs 及其架构

3.3.2 Contextual Embeddings上下文嵌入

4 Extensions of PTMs—PTM 的扩展

4.1 Knowledge-Enriched PTMs知识丰富的 PTM

4.2 Multilingual and Language-Specific PTMs多语言和特定语言的PTMs

4.2.1 Multilingual PTMs多语言的PTMs

4.2.2 Language-Specific PTMs特定语言的 PTM

4.3 Multi-Modal PTMs多模态PTM

4.3.1 Video-Text PTMs

4.3.2 Image-Text PTMs图像-文本 PTM

4.3.3 Audio-Text PTMs音频-文本PTM

4.4 Domain-Specific and Task-Specific PTMs 特定领域和特定任务的 PTM

4.5 Model Compression模型压缩

4.5.1 Model Pruning模型剪枝——删除不太重要的参数

4.5.2 Quantization量化——用更少的比特来表示参数

4.5.3 Parameter Sharing参数共享——相似单元间共享参数

4.5.4 Knowledge Distillation知识蒸馏/提炼——训练一个更小的学生模型

关键词额外信息补充—Hard-target 和 Soft-target对比

4.5.5 Module Replacing模块替换——用更紧凑的替换

4.5.6 Early Exit早退

5 Adapting PTMs to Downstream Tasks使 PTM 适应下游任务

5.1 Transfer Learning迁移学习

5.2 How to Transfer?如何迁移

5.2.1 Choosing appropriate pre-training task, model architecture and corpus选择合适的预训练任务、模型架构和语料库

5.2.2 Choosing appropriate layers选择合适的图层

5.2.3 To tune or not to tune?是否微调?

5.3 Fine-Tuning Strategies微调策略

5.3.1 Prompt-based Tuning基于提示的微调

6 Resources of PTMs—PTM 的资源

7 Applications应用

7.1 General Evaluation Benchmark通用评价基准

7.2 Question Answering / MRC

7.3 Sentiment Analysis情感分析


7.4 Named Entity Recognition命名实体识别

7.5 Machine Translation机器翻译

7.6 Summarization摘要总结

7.7 Adversarial Attacks and Defenses对抗性攻击和防御AdvAtt

8 Future Directions未来发展方向





(5)、 PTM的可解释性和可靠性——Transformer架构解释较难、易受到对抗性攻击(采用对抗性防御)

9 Conclusion结论



Paper:《Pre-trained Models for Natural Language Processing: A Survey自然语言处理的预训练模型综述》翻译与解读


Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, Xuanjing Huang






Recently, the emergence of pre-trained models (PTMs)* has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.


Deep Learning, Neural Network, Natural Language Processing, Pre-trained Model, Distributed Representation, Word Embedding, Self-Supervised Learning, Language Modelling



With the development of deep learning, various neural net- works have been widely used to solve Natural Language Pro- cessing (NLP) tasks, such as convolutional neural networks (CNNs) [1–3], recurrent neural networks (RNNs) [4, 5], graph- based neural networks (GNNs) [6–8] and attention mechanisms [9, 10]. One of the advantages of these neural models is their ability to alleviate the feature engineering problem.

 Non-neural NLP methods usually heavily rely on the discrete handcrafted features, while neural methods usually use low- dimensional and dense vectors (aka. distributed representa- tion) to implicitly represent the syntactic or semantic features of the language. These representations are learned in specific NLP tasks. Therefore, neural methods make it easy for people to develop various NLP systems.

Despite the success of neural models for NLP tasks, the performance improvement may be less significant compared to the Computer Vision (CV) field. The main reason is that current datasets for most supervised NLP tasks are rather small (except machine translation). Deep neural networks usually have a large number of parameters, which make them overfit on these small training data and do not generalize well in practice. Therefore, the early neural models for many NLP tasks were relatively shallow and usually consisted of only 1∼3 neural layers.



尽管神经模型在NLP任务中取得了成功,但与计算机视觉(CV)领域相比,其性能改进可能那么显著。主要原因是目前大多数受监督NLP任务的数据集都相当小(机器翻译除外)。深度神经网络通常具有量的参数,这使得它们在这些小的训练数据上过拟合,在实践中不能很好泛化。因此,许多 NLP 任务的早期神经模型相对较浅,通常仅由 1∼3 个神经层组成

Recently, substantial work has shown that pre-trained mod- els (PTMs), on the large corpus can learn universal language representations, which are beneficial for downstream NLP tasks and can avoid training a new model from scratch. With the development of computational power, the emergence of the deep models (i.e., Transformer [10]), and the constant enhancement of training skills, the architecture of PTMs has been advanced from shallow to deep.

The first-generation PTMs aim to learn good word embeddings. Since these mod- els themselves are no longer needed by downstream tasks, they are usually very shallow for computational efficiencies, such as Skip-Gram [11] and GloVe [12]. Although these pre-trained embeddings can capture semantic meanings of words, they are context-free and fail to capture higher-level concepts in con- text, such as polysemous disambiguation, syntactic structures, semantic roles, anaphora.

The second-generation PTMs focus on learning contextual word embeddings, such as CoVe [13], ELMo [14], OpenAI GPT [15] and BERT [16]. These learned encoders are still needed to represent words in context by downstream tasks. Besides, various pre-training tasks are also proposed to learn PTMs for different purposes.



第二代PTM专注学习上下文的词嵌入,如CoVe [13], ELMo [14], OpenAI GPT[15]和BERT[16]。下游任务仍需要这些学习到的编码器来表示上下文中的单词。此外,还提出了各种预训练任务来学习用于不同目的的 PTM

The contributions of this survey can be summarized as follows:

(1)、Comprehensive review. We provide a comprehensive review of PTMs for NLP, including background knowl-edge, model architecture, pre-training tasks, various extensions, adaption approaches, and applications.

(2)、New taxonomy. We propose a taxonomy of PTMs for NLP, which categorizes existing PTMs from four dif-ferent perspectives: 1) representation type, 2) model architecture; 3) type of pre-training task; 4) extensions for specific types of scenarios.

(3)、Abundant resources. We collect abundant resources on PTMs, including open-source implementations of PTMs, visualization tools, corpora, and paper lists.

(4)、Future directions. We discuss and analyze the limi-tations of existing PTMs. Also, we suggest possible future research directions.






The rest of the survey is organized as follows. Section 2 outlines the background concepts and commonly used nota-tions of PTMs.

Section 3 gives a brief overview of PTMs and clarifies the categorization of PTMs.

Section 4 provides extensions of PTMs.

Section 5 discusses how to transfer the knowledge of PTMs to downstream tasks.

Section 6 gives the related resources on PTMs.

Section 7 presents a collection of applications across various NLP tasks.

Section 8 discusses the current challenges and suggests future directions.

Section 9 summarizes the paper.










2.1 Language Representation Learning语言表征学习

As suggested by Bengio et al. [17], a good representation should express general-purpose priors that are not task-specific but would be likely to be useful for a learning machine to solve AI-tasks. When it comes to language, a good representation should capture the implicit linguistic rules and common sense knowledge hiding in text data, such as lexical meanings, syn-tactic structures, semantic roles, and even pragmatics.

The core idea of distributed representation is to describe the meaning of a piece of text by low-dimensional real-valued vec-tors. And each dimension of the vector has no corresponding sense, while the whole represents a concrete concept. Figure 1 illustrates the generic neural architecture for NLP. There are two kinds of word embeddings: non-contextual and contex-tual embeddings. The difference between them is whether the embedding for a word dynamically changes according to the context it appears in.



Non-contextual Embeddings非上下文嵌入【静态词嵌入

Non-contextual Embeddings

The first step of represent-ing language is to map discrete language symbols into a dis-tributed embedding space. Formally, for each word (or sub-word) x in a vocabulary V, we map it to a vector ex ∈ RDe with a lookup table E ∈ RDe×|V|, where De is a hyper-parameter indicating the dimension of token embeddings. These em-beddings are trained on task data along with other model parameters.

There are two main limitations to this kind of embeddings. The first issue is that the embeddings are static. The embed-ding for a word does is always the same regardless of its context. Therefore, these non-contextual embeddings fail to model polysemous words. The second issue is the out-of-vocabulary problem. To tackle this problem, character-level word representations or sub-word representations are widely used in many NLP tasks, such as CharCNN [18], FastText [19] and Byte-Pair Encoding (BPE) [20].

Non-contextual Embeddings非上下文嵌入


这种嵌入有两个主要限制。第一个问题是嵌入是静态的。不管上下文如何,单词的嵌入都是一样的。因此,这些非上下文嵌入无法对多义词建模。第二个问题是词汇量不足问题。为了解决这个问题,字符级词表示或子词表示被广泛应用于许多NLP任务中,如CharCNN [18], FastText[19]和字节对编码(BPE)[20]。

Contextual Embeddings上下文嵌入【动态词嵌入

Contextual Embeddings

To address the issue of polyse-mous and the context-dependent nature of words, we need distinguish the semantics of words in different contexts. Given a text x1, x2, · · · , xT where each token xt ∈ V is a word or sub-word, the contextual representation of xt depends on the whole text.

 where fenc() is neural encoder, which is described in Section 2.2, ht is called contextual embedding or dynamical embedding of token xt because of the contextual information included in.

为了解决单词的多义性上下文依赖性问题,我们需要区分单词在不同语境中的语义。给定一个文本x1, x2,···,xT,其中每个标记xT∈V是一个词或子词,xT的上下文表示取决于整个文本。


Figure 1: Generic Neural Architecture for NLP


 Figure 2: Neural Contextual Encoders

2.2 Neural Contextual Encoders神经网络上下文编码器

Most of the neural contextual encoders can be classified into two categories: sequence models and non-sequence models. Figure 2 illustrates three representative architectures.


2.2.1 Sequence Models序列模型——CNN、RNN

Sequence models usually capture local context of a word in sequential order.


Convolutional Models

Convolutional models take the em-beddings of words in the input sentence and capture the mean-ing of a word by aggregating the local information from its neighbors by convolution operations [2].

Recurrent Models

Recurrent models capture the contextual representations of words with short memory, such as LSTMs [21] and GRUs [22]. In practice, bi-directional LSTMs or GRUs are used to collect information from both sides of a word, but its performance is often affected by the long-term dependency problem.


卷积模型采用输入句子中的词嵌入,并通过卷积运算 [2] 聚合来自相邻词的局部信息捕获词的含义



2.2.2 Non-Sequence Models非序列模型——RecursNN、TreeL-STM、GCN、FCSA

Non-sequence models learn the contextual representation with a pre-defined tree or graph structure between words, such as the syntactic structure or semantic relation. Some popu-lar non-sequence models include Recursive NN [6], TreeL-STM [7, 23], and GCN [24].

Although the linguistic-aware graph structure can provide useful inductive bias, how to build a good graph structure is also a challenging problem. Besides, the structure depends heavily on expert knowledge or external NLP tools, such as the dependency parser.



Fully-Connected Self-Attention Model

In practice, a more straightforward way is to use a fully-connected graph to model the relation of every two words and let the model learn the structure by itself. Usually, the connection weights are dynamically computed by the self-attention mechanism, which implicitly indicates the connection between words. A successful instance of fully-connected self-attention model is the Transformer [10, 25], which also needs other supplement modules, such as positional embeddings, layer normalization, residual connections and position-wise feed-forward network (FFN) layers.


在实践中,更直接的方法是使用全连接图每两个单词之间的关系进行建模,并让模型自己学习结构。通常,连接权值由自注意力机制动态计算,自注意力机制隐含的表示单词之间的连接。全连接自注意力模型的一个成功实例Transformer[10,25],它还需要其他补充模块,如位置嵌入层归一化残差连接位置前馈网络层(position-wise forward network, FFN)。

2.2.3 Analysis分析

Sequence models learn the contextual representation of the word with locality bias and are hard to capture the long-range interactions between words. Nevertheless, sequence models are usually easy to train and get good results for various NLP tasks.

In contrast, as an instantiated fully-connected self-attention model, the Transformer can directly model the dependency between every two words in a sequence, which is more power-ful and suitable to model long range dependency of language. However, due to its heavy structure and less model bias, the Transformer usually requires a large training corpus and is easy to overfit on small or modestly-sized datasets [15, 26].

Currently, the Transformer has become the mainstream architecture of PTMs due to its powerful capacity.



目前,Transformer以其强大的能力成为 PTM 的主流架构。

2.3 Why Pre-training?为什么需要预训练——三大优势

With the development of deep learning, the number of model parameters has increased rapidly. The much larger dataset is needed to fully train model parameters and prevent overfit-ting. However, building large-scale labeled datasets is a great challenge for most NLP tasks due to the extremely expen-sive annotation costs, especially for syntax and semantically related tasks.

In contrast, large-scale unlabeled corpora are relatively easy to construct. To leverage the huge unlabeled text data, we can first learn a good representation from them and then use these representations for other tasks. Recent studies have demon-strated significant performance gains on many NLP tasks with the help of the representation extracted from the PTMs on the large unannotated corpora.



The advantages of pre-training can be summarized as fol-lows:

1、Pre-training on the huge text corpus can learn universal language representations and help with the downstream tasks.

2、Pre-training provides a better model initialization, which usually leads to a better generalization perfor-mance and speeds up convergence on the target task.

3、Pre-training can be regarded as a kind of regularization to avoid overfitting on small data [27].





2.4 A Brief History of PTMs for NLP—NLP 的 PTM 简史

Pre-training has always been an effective strategy to learn the parameters of deep neural networks, which are then fine-tuned on downstream tasks. As early as 2006, the breakthrough of deep learning came with greedy layer-wise unsupervised pre-training followed by supervised fine-tuning [28].

In CV, it has been in practice to pre-train models on the huge ImageNet corpus, and then fine-tune further on smaller data for different tasks. This is much better than a random initialization because the model learns general image features, which can then be used in various vision tasks.

In NLP, PTMs on large corpus have also been proved to be beneficial for the downstream NLP tasks, from the shallow word embedding to deep neural models.




2.4.1 First-Generation PTMs: Pre-trained Word Embeddings 第一代PTMs:预训练词嵌入

Representing words as dense vectors has a long history [29].

The “modern” word embedding is introduced in pioneer work of neural network language model (NNLM) [30]. ColloBERT et al. [31] showed that the pre-trained word embedding on the unlabelled data could significantly improve many NLP tasks. To address the computational complexity, they learned word embeddings with pairwise ranking task instead of language modeling. Their work is the first attempt to obtain generic word embeddings useful for other tasks from unlabeled data.

Mikolov et al. [11] showed that there is no need for deep neural networks to build good word embeddings. They pro-pose two shallow architectures: Continuous Bag-of-Words (CBOW) and Skip-Gram (SG) models. Despite their sim-plicity, they can still learn high-quality word embeddings to capture the latent syntactic and semantic similarities among words. Word2vec is one of the most popular implementations of these models and makes the pre-trained word embeddings accessible for different tasks in NLP. Besides, GloVe [12] is also a widely-used model for obtaining pre-trained word embeddings, which are computed by global word-word co-occurrence statistics from a large corpus.

密集向量表示单词有着悠久历史 [29]。

“现代”词嵌入是在神经网络语言模型 (NNLM) [30] 的开创性工作中引入的。ColloBERT et al.[31]表明,在未标记数据上进行预训练词嵌入可以显著改善许多NLP任务。为了解决计算复杂性,他们通过成对排序任务而不是语言建模来学习词嵌入。他们的工作是第一次尝试从未标记的数据中获得对其他任务有用的通用词嵌入

Mikolov等人[11]表明,不需要深度神经网络来构建良好的词嵌入。他们提出了两种浅层架构:连续词袋模型(Continuous Bag-of-Words, CBOW)和Skip-Gram模型(SG)。尽管它们很简单,但它们仍然可以学习高质量的词嵌入,以捕捉单词之间潜在句法语义相似性Word2vec是这些模型最流行的实现之一,它使预训练好的词嵌入可用于NLP中的不同任务。此外,GloVe[12]也是一种广泛使用的预训练词嵌入模型,它是通过从大型语料库中全局词-词共现统计来计算的。

Although pre-trained word embeddings have been shown ef-fective in NLP tasks, they are context-independent and mostly trained by shallow models. When used on a downstream task, the rest of the whole model still needs to be learned from scratch.

During the same time period, many researchers also try to learn embeddings of paragraph, sentence or document, such as paragraph vector [32], Skip-thought vectors [33], Con-text2Vec [34]. Different from their modern successors, these sentence embedding models try to encode input sentences into a fixed-dimensional vector representation, rather than the contextual representation for each token.


在同一时期,许多研究者也尝试学习段落句子文档的嵌入,如段落向量paragraph vector[32],Skip-thought向量[33],上下文Con-text2Vec[34]。与它们的现代后继者不同,这些句子嵌入模型试图将输入句子编码为固定维度的向量表示,而不是每个标记的上下文表示。

2.4.2 Second-Generation PTMs: Pre-trained Contextual En-coders第二代PTMs:预训练的上下文编码器 

Since most NLP tasks are beyond word-level, it is natural to pre-train the neural encoders on sentence-level or higher. The output vectors of neural encoders are also called contextual word embeddings since they represent the word semantics depending on its context.


Dai and Le [35] proposed the first successful instance of PTM for NLP. They initialized LSTMs with a language model (LM) or a sequence autoencoder, and found the pre-training can improve the training and generalization of LSTMs in many text classification tasks.

Liu et al. [5] pre-trained a shared LSTM encoder with LM and fine-tuned it under the multi-task learning (MTL) framework. They found the pre-training and fine-tuning can further improve the performance of MTL for several text classification tasks.

Ramachandran et al. [36] found the Seq2Seq models can be significantly improved by unsupervised pre-training. The weights of both encoder and decoder are initialized with pre-trained weights of two lan-guage models and then fine-tuned with labeled data. Besides pre-training the contextual encoder with LM,

McCann et al.[13] pre-trained a deep LSTM encoder from an attentional sequence-to-sequence model with machine translation (MT). The context vectors (CoVe) output by the pre-trained encoder can improve the performance of a wide variety of common NLP tasks.

2015年,Dai和Le[35]提出了 NLP 的第一个成功的 PTM实例。他们使用语言模型(LM-LSTM)序列自编码器(SA-LSTM)对LSTMs进行初始化,发现预训练可以提高LSTM在许多文本分类任务中的训练和泛化能力


Ramachandran等人[36]发现 Seq2Seq 模型可以通过无监督预训练得到显着改善。编码器和解码器的权重均使用两种语言模型的预训练权重进行初始化,然后使用标记数据进行微调。除了使用LM预训练上下文编码器外,


Since these precursor PTMs, the modern PTMs are usually trained with larger scale corpora, more powerful or deeper architectures (e.g., Transformer), and new pre-training tasks.

Peters et al. [14] pre-trained 2-layer LSTM encoder with a bidirectional language model (BiLM), consisting of a for-ward LM and a backward LM. The contextual representations output by the pre-trained BiLM, ELMo (Embeddings from Language Models), are shown to bring large improvements on a broad range of NLP tasks.

Akbik et al. [37] captured word meaning with contextual string embeddings pre-trained with character-level LM. However, these two PTMs are usu-ally used as a feature extractor to produce the contextual word embeddings, which are fed into the main model for downstream tasks. Their parameters are fixed, and the rest parameters of the main model are still trained from scratch. ULMFiT (Universal Language Model Fine-tuning) [38] at-tempted to fine-tune pre-trained LM for text classification (TC) and achieved state-of-the-art results on six widely-used TC datasets. ULMFiT consists of 3 phases:

1) 、pre-training LM on general-domain data;

2)、 fine-tuning LM on target data;

3)、 fine-tuning on the target task.

ULMFiT also investigates some effective fine-tuning strategies, including discrimina-tive fine-tuning, slanted triangular learning rates, and gradual unfreezing.



Akbik等人[37]用字符级LM预训练的上下文字符串嵌入捕获词义。但是,这两个PTM通常用作特征提取器来生成上下文词嵌入,这些词嵌入被馈送到用于下游任务的主模型中。它们的参数是固定的,主模型的其余参数仍然从头训练ULMFiT (通用语言模型微调)[38]尝试对预训练LM进行文本分类(TC)微调,并在六个广泛使用的TC数据集上取得了最先进的结果。ULMFiT由三个阶段组成:





More recently, the very deep PTMs have shown their pow-erful ability in learning universal language representations: 

e.g., OpenAI GPT (Generative Pre-training) [15] and BERT (Bidirectional Encoder Representation from Transformer) [16].

Besides LM, an increasing number of self-supervised tasks (see Section 3.1) is proposed to make the PTMs capturing more knowledge form large scale text corpora.

Since ULMFiT and BERT, fine-tuning has become the mainstream approach to adapt PTMs for the downstream tasks.


例如,OpenAI GPT(生成预训练)[15]和BERT(来自Transformer的双向编码器表示)[16]。


自从ULMFiT BERT以来,微调已经成为为下游任务调整PTM的主流方法

3 Overview of PTMs—PTM的概述

The major differences between PTMs are the usages of con-textual encoders, pre-training tasks, and purposes. We have briefly introduced the architectures of contextual encoders in Section 2.2. In this section, we focus on the description of pre-training tasks and give a taxonomy of PTMs.


3.1 Pre-training Tasks预训练任务

The pre-training tasks are crucial for learning the universal representation of language. Usually, these pre-training tasks should be challenging and have substantial training data. In this section, we summarize the pre-training tasks into three categories: supervised learning, unsupervised learning, and self-supervised learning.

1、Supervised learning (SL) is to learn a function that maps an input to an output based on training data consisting of input-output pairs.

2、Unsupervised learning (UL) is to find some intrinsic knowledge from unlabeled data, such as clusters, densi-ties, latent representations.

3、Self-Supervised learning (SSL) is a blend of supervised learning and unsupervised learning1). The learning paradigm of SSL is entirely the same as supervised learning, but the labels of training data are generated automatically. The key idea of SSL is to predict any part of the input from other parts in some form. For example, the masked language model (MLM) is a self-supervised task that attempts to predict the masked words in a sentence given the rest words.


1、监督学习(Supervised learning, SL)是根据由输入-输出对组成的训练数据,学习一个将输入映射到输出的函数。

2、无监督学习(Unsupervised learning, UL)是指从未标记数据中发现一些内在知识,如聚类、密度、潜在表征等。

3、自监督学习(SSL)是监督学习无监督学习混合体。SSL的学习范式监督学习完全相同,只是训练数据的标签是自动生成的。SSL的关键思想是以某种形式,从其他部分预测输入的任何部分。例如,掩码/蒙面语言模型 (MLM) 是一项自监督的任务,它试图在给定其余单词的情况下预测句子中的掩码单词

In CV, many PTMs are trained on large supervised training sets like ImageNet. However, in NLP, the datasets of most supervised tasks are not large enough to train a good PTM. The only exception is machine translation (MT). A large-scale MT dataset, WMT 2017, consists of more than 7 million sen-tence pairs. Besides, MT is one of the most challenging tasks in NLP, and an encoder pre-trained on MT can benefit a va-riety of downstream NLP tasks. As a successful PTM, CoVe [13] is an encoder pre-trained on MT task and improves a wide variety of common NLP tasks: sentiment analysis (SST, IMDb), question classification (TREC), entailment (SNLI), and question answering (SQuAD).

In this section, we introduce some widely-used pre-training tasks in existing PTMs. We can regard these tasks as self-supervised learning. Table 1 also summarizes their loss func-tions.

CV领域,许多PTM在大型监督训练集(如ImageNet)上进行训练。但是NLP领域,大多数监督任务的数据集都不够大,无法训练出一个良好的PTM。唯一的例外是机器翻译(MT)。大规模的MT数据集WMT 2017由超过700万句对组成。此外,MT是NLP中最具挑战性的任务之一,预先在MT上训练的编码器可以受益于各种下游的NLP任务。作为一个成功的PTM, CoVe[13]是一个预训练MT任务的编码器,并改进了各种常见的NLP任务:情感分析(SST, IMDb),问题分类(TREC),蕴涵(SNLI)和问题回答(SQuAD)。


3.1.1 Language Modeling (LM)语言建模

The most common unsupervised task in NLP is probabilistic language modeling (LM), which is a classic probabilistic den-sity estimation problem. Although LM is a general concept, in practice, LM often refers in particular to auto-regressive LM or unidirectional LM.


Given a text sequence x1:T = [x1, x2, · · · , xT ], its joint prob-ability p(x1:T ) can be decomposed as

给定文本序列x1:T = [x1, x2,···,xT],其联合概率p(x1:T)可分解为

The conditional probability p(xt|x0:t−1) can be modeled by a probability distribution over the vocabulary given linguistic context x0:t−1. The context x0:t−1 is modeled by neural encoder fenc(·), and the conditional probability is


Given a huge corpus, we can train the entire network with maximum likelihood estimation (MLE).

A drawback of unidirectional LM is that the representa-tion of each token encodes only the leftward context tokens and itself. However, better contextual representations of text should encode contextual information from both directions. An improved solution is bidirectional LM (BiLM), which con-sists of two unidirectional LMs: a forward left-to-right LM and a backward right-to-left LM. For BiLM, Baevski et al.[39] proposed a two-tower model that the forward tower oper-ates the left-to-right LM and the backward tower operates the right-to-left LM.


单向LM的一个缺点是每个标记的表示仅对左向上下文标记及其自身进行编码。然而,更好的文本上下文表示应该从两个方向编码上下文信息。一种改进的解决方案是双向LM (BiLM),它由两个单向LM组成:一个向前的从左到右LM和一个向后的从右到左LM。对于BiLM, Baevski et al.[39]提出了一个双塔模型,前向塔运行从左到右的LM,后向塔运行从右到左的LM。

 Table 1: Loss Functions of Pre-training Tasks

3.1.2 Masked Language Modeling (MLM)掩码语言建模

Masked language modeling (MLM) is first proposed by Tay-lor [40] in the literature, who referred to this as a Cloze task. Devlin et al. [16] adapted this task as a novel pre-training task to overcome the drawback of the standard unidirectional LM. Loosely speaking, MLM first masks out some tokens from the input sentences and then trains the model to predict the masked tokens by the rest of the tokens. However, this pre-training method will create a mismatch between the pre-training phase and the fine-tuning phase because the mask token does not appear during the fine-tuning phase. Empirically, to deal with this issue, Devlin et al. [16] used a special [MASK] token 80%of the time, a random token 10% of the time and the original token 10% of the time to perform masking.

蒙面/掩码语言建模(MLM)是由Tay-lor[40]在文献中首次提出的,他将其称为Cloze完形填空任务。Devlin et al.[16]将该任务作为一种新的预训练任务,以克服标准单向LM的缺点。简单地说,MLM首从输入句子屏蔽掉一些tokens/标记,然后训练模型通过剩余tokens预测屏蔽的tokens。但是,这种预训练方法将在预训练阶段和微调阶段之间产生不匹配,因为在微调阶段不会出现掩码tokens。根据经验,为了解决这个问题,Devlin等人在80%的时间里使用一个特殊的[MASK]的token,10%的时间里使用一个随机token,10%的时间里使用原始token来执行屏蔽。

Sequence-to-Sequence MLM (Seq2Seq MLM)

MLM is usually solved as classification problem. We feed the masked sequences to a neural encoder whose output vectors are fur-ther fed into a softmax classifier to predict the masked token.

Alternatively, we can use encoder-decoder (aka. sequence-to-sequence) architecture for MLM, in which the encoder is fed a masked sequence, and the decoder sequentially produces the masked tokens in auto-regression fashion. We refer to this kind of MLM as sequence-to-sequence MLM (Seq2Seq MLM), which is used in MASS [41] and T5 [42]. Seq2Seq MLM can benefit the Seq2Seq-style downstream tasks, such as question answering, summarization, and machine transla-tion.

序列对序列的MLM(Seq2Seq  MLM)

MLM通常作为分类问题来解决。我们将掩码序列输入一个神经编码器,其输出向量进一步提供给 softmax分类器来预测掩码token。

或者,我们可以使用编码器-解码器(又名序列到序列Seq2Seq)架构,其中编码器被馈送一个掩码序列,解码器以自回归的方式依次产生掩码token。我们将这种MLM称为Seq2Seq的MLM(Seq2Seq MLM),在MASS[41]和T5[42]中使用。Seq2Seq MLM可以使Seq2Seq风格的下游任务受益,例如问题回答、摘要和机器翻译。

Enhanced Masked Language Modeling (E-MLM)

Con-currently, there are multiple research proposing different en-hanced versions of MLM to further improve on BERT. Instead of static masking, RoBERTa [43] improves BERT by dynamic masking.

UniLM [44, 45] extends the task of mask prediction on three types of language modeling tasks: unidirectional, bidi-rectional, and sequence-to-sequence prediction. XLM [46] performs MLM on a concatenation of parallel bilingual sen-tence pairs, called Translation Language Modeling (TLM). SpanBERT [47] replaces MLM with Random Contiguous Words Masking and Span Boundary Objective (SBO) to inte-grate structure information into pre-training, which requires the system to predict masked spans based on span boundaries. Besides, StructBERT [48] introduces the Span Order Recovery task to further incorporate language structures.

Another way to enrich MLM is to incorporate external knowledge (see Section 4.1).



UniLM[44,45]将屏蔽预测任务扩展到三种类型的语言建模任务上:单向双向Seq2Seq预测XLM[46]在并行双语句子对的串联上执行MLM,称为翻译语言建模(TLM)。SpanBERT[47]用随机连续词掩码跨度边界目标 (SBO) 代替 MLM,将结构信息整合到预训练中,这需要系统基于跨度边界预测掩码跨度。此外,StructBERT[48]还引入了Span Order Recovery跨度顺序恢复任务来进一步整合语言结构


3.1.3 Permuted Language Modeling (PLM)置换语言建模

Despite the wide use of the MLM task in pre-training, Yang et al. [49] claimed that some special tokens used in the pre-training of MLM, like [MASK], are absent when the model is applied on downstream tasks, leading to a gap between pre-training and fine-tuning. To overcome this issue, Permuted Language Modeling (PLM) [49] is a pre-training objective to replace MLM. In short, PLM is a language modeling task on a random permutation of input sequences. A permutation is randomly sampled from all possible permutations. Then some of the tokens in the permuted sequence are chosen as the target, and the model is trained to predict these targets, depending on the rest of the tokens and the natural positions of targets. Note that this permutation does not affect the natural positions of sequences and only defines the order of token pre-dictions. In practice, only the last few tokens in the permuted sequences are predicted, due to the slow convergence. And a special two-stream self-attention is introduced for target-aware representations.

尽管MLM任务预训练被广泛使用,但Yang et al.[49]声称,在将模型应用于下游任务时,MLM预训练中使用的一些特殊tokens (如[MASK])不存在,导致预训练与微调之间存在差距。为了克服这一问题,PLM[49]是替代MLM的预训练目标。简而言之,PLM 是一种对输入序列进行随机排列的语言建模任务。一个排列是从所有可能的排列中随机抽取的。然后选择置换序列中的一些tokens作为目标,并训练模型根据剩余tokens和目标的自然位置来预测这些目标。注意,这种排列不影响序列的自然位置,只定义token预测的顺序。在实践中,由于收敛速度较慢,仅预测置换序列中的最后几个token标记。针对目标感知表示引入了一种特殊双流自注意

3.1.4 Denoising Autoencoder (DAE)降噪自动编码器

Denoising autoencoder (DAE) takes a partially corrupted input and aims to recover the original undistorted input. Specific to language, a sequence-to-sequence model, such as the standard Transformer, is used to reconstruct the original text. There are several ways to corrupt text [50]: 

(1)、Token Masking: Randomly sampling tokens from the input and replacing them with [MASK] elements.

(2)、Token Deletion: Randomly deleting tokens from the in-put. Different from token masking, the model needs to decide the positions of missing inputs.

(3)、Text Infilling: Like SpanBERT, a number of text spans are sampled and replaced with a single [MASK] token. Each span length is drawn from a Poisson distribution (λ = 3). The model needs to predict how many tokens are missing from a span.

(4)、Sentence Permutation: Dividing a document into sen-tences based on full stops and shuffling these sentences in random order.

(5)、Document Rotation: Selecting a token uniformly at random and rotating the document so that it begins with that token. The model needs to identify the real start position of the document.


(1)Token Masking掩码:从输入中随机采样token,并用[MASK]元素替换它们。

(2)、Token Deletion删除:从输入中随机删除token。与token掩码不同,该模型需要确定缺失输入的位置

(3)、Text Infilling文本填充:与 SpanBERT 一样,对多个文本跨度进行采样,并用单个 [MASK] token替换。每个跨度长度都取自泊松分布(λ = 3)中绘制的。该模型需要预测跨度中丢失多少tokens

(4)、Sentence Permutation语句排序:根据句点将文档分成句子,并将这些句子随机排列

(5)、Document Rotation文档旋转:随机均匀地选择一个token,并旋转文档,使其从该token开始。该模型需要识别文档的真正起始位置

3.1.5 Contrastive Learning (CTL)对比学习

Contrastive learning [51] assumes some observed pairs of text that are more semantically similar than randomly sampled text. A score function s(x, y) for text pair (x, y) is learned to minimize the objective function:

 where (x, y+) are a similar pair and y− is presumably dissimi-lar to x. y+ and y− are typically called positive and negative sample. The score function s(x, y) is often computed by a learnable neural encoder in two ways:

s(x, y) = f Tenc(x) fenc(y) or

s(x, y) = fenc(x ⊕ y).

对比学习[51]假设一些观察到的文本对,在语义上比随机采样的文本更相似。学习一个文本对(x, y)的评分函数s(x, y)来最小化目标函数:

其中(x, y+)是相似的一对,y−可能与x不同,y+和y−通常称为正样本和负样本。评分函数s(x, y)通常由可学习的神经编码器以两种方式计算:

s(x, y) = f Tenc(x) fenc(y)或

s(x, y) = fenc(x⊕y)。

The idea behind CTL is “learning by comparison”. Com-pared to LM, CTL usually has less computational complex-ity and therefore is desirable alternative training criteria for PTMs.

ColloBERT et al. [31] proposed pairwise ranking task to dis-tinguish real and fake phrases. The model needs to predict a higher score for a legal phrase than an incorrect phrase obtained by replacing its central word with a random word.

 Mnih and Kavukcuoglu [52] trained word embeddings effi-ciently with Noise-Contrastive Estimation (NCE) [53], which trains a binary classifier to distinguish real and fake samples. The idea of NCE is also used in the well-known word2vec embedding [11].

We briefly describe some recently proposed CTL tasks in the following paragraphs.


ColloBERT et al.[31]提出了成对排序任务来区分真假短语。该模型需要预测,一个合法短语一个用随机单词替换其中心单词,得到的不正确短语更高的分数



Deep InfoMax (DIM) 

Deep InfoMax (DIM) [54] is origi-nally proposed for images, which improves the quality of the representation by maximizing the mutual information between an image representation and local regions of the image.

Kong et al. [55] applied DIM to language representation learning. The global representation of a sequence x is defined to be the hidden state of the first token (assumed to be a spe-cial start of sentence symbol) output by contextual encoder fenc(x). The objective of DIM is to assign a higher score for fenc(xi: j)T fenc(xˆi: j) than fenc(x˜i: j)T fenc(xˆi: j), where xi: j denotes an n-gram2) span from i to j in x, xˆi: j denotes a sentence masked at position i to j, and x˜i: j denotes a randomly-sampled negative n-gram from corpus.

Deep InfoMax (DIM)

Deep InfoMax (DIM)[54]最初针对图像提出的,它通过最大化图像表示与图像局部区域之间的互信息提高图像表示的质量

Kong et al.[55]将DIM应用于语言表征学习。序列x的全局表示被定义为上下文编码器fenc(x)输出的第一个token(假设是一个特殊的句子符号的开始)的隐藏状态。DIM的目标是为 fenc(xi: j)T fenc(xi: j) 分配比 fenc(x∼i: j)T fenc(xi: j) 更高的分数,其中xi: j表示一个n-gram2)在 x 中从 i 到 j,xi:j 表示在位置 i 到 j 处被屏蔽的句子,x∼i:j 表示从语料库中随机采样的负 n-gram。

Replaced Token Detection (RTD)

Replaced Token Detec-tion (RTD) is the same as NCE but predicts whether a token is replaced given its surrounding context. 

CBOW with negative sampling (CBOW-NS) [11] can be viewed as a simple version of RTD, in which the negative samples are randomly sampled from vocabulary with simple proposal distribution.

ELECTRA [56] improves RTD by utilizing a generator to replacing some tokens of a sequence. A generator G and a dis-criminator D are trained following a two-stage procedure:

(1) Train only the generator with MLM task for n1 steps;

(2) Ini-tialize the weights of the discriminator with the weights of the generator. Then train the discriminator with a discriminative task for n2 steps, keeping G frozen. Here the discriminative task indicates justifying whether the input token has been re-placed by G or not. The generator is thrown after pre-training, and only the discriminator will be fine-tuned on downstream tasks.

RTD is also an alternative solution for the mismatch prob-lem. The network sees [MASK] during pre-training but not when being fine-tuned in downstream tasks.

Similarly, WKLM [57] replaces words on the entity-level instead of token-level. Concretely, WKLM replaces entity mentions with names of other entities of the same type and train the models to distinguish whether the entity has been replaced.



负抽样CBOW (CBOW-NS)[11]可以看作是RTD的一个简单版本,其中负样本是从具有简单建议分布的词汇表中随机抽取的。






Next Sentence Prediction (NSP)

 Punctuations are the nat-ural separators of text data. So, it is reasonable to construct pre-training methods by utilizing them. Next Sentence Predic-tion (NSP) [16] is just a great example of this. As its name suggests, NSP trains the model to distinguish whether two input sentences are continuous segments from the training cor-pus. Specifically, when choosing the sentences pair for each pre-training example, 50% of the time, the second sentence is the actual next sentence of the first one, and 50% of the time, it is a random sentence from the corpus. By doing so, it is capable to teach the model to understand the relationship between two input sentences and thus benefit downstream tasks that are sensitive to this information, such as Question Answering and Natural Language Inference.

However, the necessity of the NSP task has been questioned by subsequent work [47, 49, 43, 63]. Yang et al. [49] found the impact of the NSP task unreliable, while Joshi et al. [47] found that single-sentence training without the NSP loss is superior to sentence-pair training with the NSP loss. More-over, Liu et al. [43] conducted a further analysis for the NSP task, which shows that when training with blocks of text from a single document, removing the NSP loss matches or slightly improves performance on downstream tasks.





然而,NSP任务的必要性受到了后续工作的质疑[47,49,43,63]。Yang et al.[49]发现NSP任务的影响不可靠,而Joshi et al.[47]发现没有NSP损失的单句训练优于有NSP损失的句子对训练。此外,Liu et al.[43]对NSP任务进行了进一步的分析,结果表明,当使用单个文档中的文本块进行训练时,去除NSP损失匹配或略微提高下游任务的性能

Sentence Order Prediction (SOP)

To better model inter-sentence coherence, ALBERT [63] replaces the NSP loss with a sentence order prediction (SOP) loss. As conjectured in Lan et al. [63], NSP conflates topic prediction and coherence prediction in a single task. Thus, the model is allowed to make predictions merely rely on the easier task, topic prediction. Different from NSP, SOP uses two consecutive segments from the same document as positive examples, and the same two consecutive segments but with their order swapped as negative examples. As a result, ALBERT consistently outperforms BERT on various downstream tasks.

StructBERT [48] and BERTje [88] also take SOP as their self-supervised learning task.




3.1.6 Others

Apart from the above tasks, there are many other auxiliary pre-training tasks designated to incorporate factual knowledge (see Section 4.1), improve cross-lingual tasks (see Section 4.2), multi-modal applications (see Section 4.3), or other specific tasks (see Section 4.4).


3.2 Taxonomy of PTMs

To clarify the relations of existing PTMs for NLP, we build the taxonomy of PTMs, which categorizes existing PTMs from four different perspectives:

1、Representation Type: According to the representation used for downstream tasks, we can divide PTMs into non-contextual and contextual models.

2、Architectures: The backbone network used by PTMs, including LSTM, Transformer encoder, Transformer decoder, and the full Transformer architecture. “Trans-former” means the standard encoder-decoder architec-ture. “Transformer encoder” and “Transformer decoder” mean the encoder and decoder part of the standard Transformer architecture, respectively. Their difference is that the decoder part uses masked self-attention with a triangular matrix to prevent tokens from attending their future (right) positions.

3、Pre-Training Task Types: The type of pre-training tasks used by PTMs. We have discussed them in Section 3.1.

4、Extensions: PTMs designed for various scenarios, in-cluding knowledge-enriched PTMs, multilingual or language-specific PTMs, multi-model PTMs, domain-specific PTMs and compressed PTMs. We will particu-larly introduce these extensions in Section 4.




“Transformer”是指标准的编码器-解码器结构。“Transformer 编码器”和“Transformer 解码器”分别是指标准 Transformer 架构的编码器和解码器部分。它们的不同之处在于,解码器部分使用带有三角形矩阵的屏蔽自注意力来防止tokens出现在它们未来(右/正确的)的位置。



Figure 3 shows the taxonomy as well as some correspond-ing representative PTMs. Besides, Table 2 distinguishes some representative PTMs in more detail.


3.3 Model Analysis模型分析

Due to the great success of PTMs, it is important to understand what kinds of knowledge are captured by them, and how to in-duce knowledge from them. There is a wide range of literature analyzing linguistic knowledge and world knowledge stored in pre-trained non-contextual and contextual embeddings.


PTM捕获的两种类型知识:Linguistic Knowledge语言知识/语言学知识(2种)、World Knowledge世界知识/知识库知识(4种)

3.3.1 Non-Contextual Embeddings非上下文嵌入

Static word embeddings are first probed for kinds of knowl-edge. Mikolov et al. [117] found that word representa-tions learned by neural network language models are able to capture linguistic regularities in language, and the rela-tionship between words can be characterized by a relation-specific vector offset. Further analogy experiments [11] demonstrated that word vectors produced by skip-gram model can capture both syntactic and semantic word relationships, such as vec(“China”) − vec(“Beijing”) ≈ vec(“Japan”) − vec(“Tokyo”). Besides, they find compositionality property of word vectors, for example, vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”). Inspired by these work, Rubin-stein et al. [118] found that distributional word representations are good at predicting taxonomic properties (e.g., dog is an animal) but fail to learn attributive properties (e.g., swan is white). Similarly, Gupta et al. [119] showed that word2vec embeddings implicitly encode referential attributes of entities. The distributed word vectors, along with a simple supervised model, can learn to predict numeric and binary attributes of entities with a reasonable degree of accuracy.

本文首先探讨了静态词嵌入技术在各类知识边缘中的应用。Mikolov等[117]发现神经网络语言模型学习的词表示能够捕捉语言中的语言规律,并且词之间的关系可以用关系特定的向量偏移量来表征。进一步的类比实验[11]表明,skip-gram模型产生的词向量可以捕获句法和语义词关系,如 vec(“China”) − vec(“Beijing”) ≈ vec(“Japan”) − vec(“Tokyo”)。此外,他们还发现词向量的组合性,例如vec(“Germany”) + vec(“capital”) 接近于 vec(“Berlin”)。受到这些工作的启发,Rubin-stein等人[118]发现,分布式词表示擅长预测分类学属性(例如,狗是一种动物),但无法学习定语属性(例如,天鹅是白色的)。类似地,Gupta等人[119]表明word2vec嵌入隐式地编码实体的引用属性分布式词向量,以及一个简单的监督模型,可以学习预测实体的数字和二进制属性,具有合理的准确性。

Figure 3: Taxonomy of PTMs with Representative Examples

Table 2: List of Representative PTMs有代表性的 PTMs 及其架构

“Transformer Enc.” and “Transformer Dec.” mean the encoder and decoder part of the standard Transformer architecture respectively. Their dierence is that the
decoder part uses masked self-attention with triangular matrix to prevent tokens from attending their future (right) positions. “Transformer” means the standard encoder-decoder architecture.
the averaged score on 9 tasks of GLUE benchmark (see Section 7.1).
without WNLI task.
indicates ensemble result.
means whether is model usually used in fine-tuning fashion.
The MLM of UniLM is built on three versions of LMs: Unidirectional LM, Bidirectional LM, and Sequence-to-Sequence LM.

“Transformer Enc.”和“Transformer Dec.”分别表示标准Transformer架构的编码器和解码器部分。他们的不同之处在于
UniLM 的 MLM 建立在三个版本的 LM 之上:单向 LM、双向 LM 和序列到序列 LM。

3.3.2 Contextual Embeddings上下文嵌入

A large number of studies have probed and induced different types of knowledge in contextual embeddings. In general, there are two types of knowledge: linguistic knowledge and world knowledge.


Linguistic Knowledge

A wide range of probing tasks are designed to investigate the linguistic knowledge in PTMs. Ten-ney et al. [120], Liu et al. [121] found that BERT performs well on many syntactic tasks such as part-of-speech tagging and constituent labeling. However, BERT is not good enough at semantic and fine-grained syntactic tasks, compared with simple syntactic tasks.

Besides, Tenney et al. [122] analyzed the roles of BERT鈥檚 layers in di铿€erent tasks and found that BERT solves tasks in a similar order to that in NLP pipelines. Furthermore, knowl-edge of subject-verb agreement [123] and semantic roles [124] are also confirmed to exist in BERT. Besides, Hewitt and Man-ning [125], Jawahar et al. [126], Kim et al. [127] proposed several methods to extract dependency trees and constituency trees from BERT, which proved the BERT鈥檚 ability to encode syntax structure. Reif et al. [128] explored the geometry of internal representations in BERT and find some evidence:

1)、linguistic features seem to be represented in separate semantic and syntactic subspaces;

2)、attention matrices contain gram-matical representations;

3)、BERT distinguishes word senses at a very fine level.



此外,Tenney等人[122]分析了 BERT层在不同任务中的作用,发现BERT解决任务的顺序与NLP管道中的类似。此外,BERT中还存在主谓一致知识[123]和语义角色[124]。此外,Hewitt和Man-ning[125]、Jawahar等人[126]、Kim等人[127]提出了几种从BERT中提取依赖树和成分树/选区树的方法,证明了 BERT 的语法结构编码能力。Reif等人[128]探索了BERT中内部表征的几何结构,并发现了一些证据:



3)、 BERT可以非常精细区分词义

World Knowledge

Besides linguistic knowledge, PTMs may also store world knowledge presented in the training data. A straightforward method of probing world knowledge is to query BERT with “fill-in-the-blank” cloze statements, for example, “Dante was born in [MASK]”. Petroni et al. [129] constructed LAMA (Language Model Analysis) task by manu-ally creating single-token cloze statements (queries) from sev-eral knowledge sources. Their experiments show that BERT contains world knowledge competitive with traditional information extraction methods. Since the simplicity of query generation procedure in LAMA, Jiang et al. [130] argued that LAMA just measures a lower bound for what language models know and propose more advanced methods to generate more efficient queries. Despite the surprising findings of LAMA, it has also been questioned by subsequent work [131, 132]. Sim-ilarly, several studies induce relational knowledge [133] and commonsense knowledge [134] from BERT for downstream tasks.



4 Extensions of PTMs—PTM 的扩展

4.1 Knowledge-Enriched PTMs知识丰富的 PTM

PTMs usually learn universal language representation from general-purpose large-scale text corpora but lack domain-specific knowledge. Incorporating domain knowledge from external knowledge bases into PTM has been shown to be effective. The external knowledge ranges from linguistic [135, 79, 77, 136], semantic [137], commonsense [138], factual [76–78, 57, 80], to domain-specific knowledge [139, 78].


On the one hand, external knowledge can be injected dur-ing pre-training. Early studies [140–143] focused on learning knowledge graph embeddings and word embedding jointly. Since BERT, some auxiliary pre-training tasks are designed to incorporate external knowledge into deep PTMs. LIB-ERT [135] (linguistically-informed BERT) incorporates lin-guistic knowledge via an additional linguistic constraint task. Ke et al. [79] integrated sentiment polarity of each word to extend the MLM to Label-Aware MLM (LA-MLM). As a re-sult, their proposed model, SentiLR, achieves state-of-the-art performance on several sentence- and aspect-level sentiment classification tasks. Levine et al. [137] proposed SenseBERT, which is pre-trained to predict not only the masked tokens but also their supersenses in WordNet. ERNIE(THU) [76] inte-grates entity embeddings pre-trained on a knowledge graph with corresponding entity mentions in the text to enhance the text representation. Similarly, KnowBERT [77] trains BERT jointly with an entity linking model to incorporate entity repre-sentation in an end-to-end fashion. Wang et al. [80] proposed KEPLER, which jointly optimizes knowledge embedding and language modeling objectives. These work inject structure information of knowledge graph via entity embedding. In con-trast, K-BERT [78] explicitly injects related triples extracted from KG into the sentence to obtain an extended tree-form input for BERT. CoLAKE [81] integrates knowledge context and language context into a unified graph, which is then pre-trained with MLM to obtain contextualized representation for both knowledge and language. Moreover, Xiong et al. [57] adopted entity replacement identification to encourage the model to be more aware of factual knowledge. However, most of these methods update the parameters of PTMs when inject-ing knowledge, which may suffer from catastrophic forgetting when injecting multiple kinds of knowledge. To address this, K-Adapter [136] injects multiple kinds of knowledge by train-ing different adapters independently for different pre-training tasks, which allows continual knowledge infusion.

一方面,可以在预训练期间注入外部知识。早期的研究[140-143]集中在知识图嵌入词嵌入的联合学习上。自BERT以来,一些辅助预训练任务旨在将外部知识融入深度 PTMsLIBERT[135](linguistically-informed BERT,语言知情的BERT)通过额外的语言约束任务整合了语言知识。Ke等人[79]整合了每个词的情感极性,将MLM扩展为标签感知MLM(Label-Aware MLM,LA-MLM)。因此,他们提出的模型SentiLR在几个句子级方面级情感分类任务上实现了最先进的性能。Levine等人[137]提出了SenseBERT,它经过预训练,不仅可以预测掩码tokens,还可以预测WordNet中的超义。ERNIE(THU)[76]将在知识图上预训练的实体嵌入文本中提到的相应实体相结合,以增强文本表示。类似地,KnowBERT[77]将BERT实体链接模型联合训练,以端到端方式整合实体表示。Wang等[80]提出了联合优化知识嵌入语言建模目标的KEPLER。这些工作通过实体嵌入的方法注入知识图的结构信息。相比之下,K-BERT[78]显式地将从KG(知识图)中提取的相关三元组注入到句子中,以获得BERT的扩展树形输入。CoLAKE[81]将知识上下文和语言上下文整合成一个统一的图中,然后用MLM进行预训练,得到知识和语言的上下化表示。另外,Xiong等WKLM[57]采用实体替换识别鼓励模型更多地了解事实性知识。然而,这些方法大多在注入知识时更新PTM的参数,当注入多种知识时可能会出现灾难性遗忘。为了解决这个问题,K-Adapter[136]通过针对不同的预训练任务独立训练不同的适配器注入多种知识,从而实现持续的知识注入

On the other hand, one can incorporate external knowledge into pre-trained models without retraining them from scratch. As an example, K-BERT [78] allows injecting factual knowl-edge during fine-tuning on downstream tasks. Guan et al.[138] employed commonsense knowledge bases, ConceptNet and ATOMIC, to enhance GPT-2 for story generation. Yang et al. [144] proposed a knowledge-text fusion model to acquire related linguistic and factual knowledge for machine reading comprehension.

备注:KT-NET是由百度开创性地提出了语言表示与知识表示的深度融合模型,希望同时借助语言和知识的力量进一步提升机器阅读理解(Machine Reading Comprehension,MRC)的效果。

Besides, Logan IV et al. [145] and Hayashi et al. [146] ex-tended language model to knowledge graph language model (KGLM) and latent relation language model (LRLM) respec-tively, both of which allow prediction conditioned on knowl-edge graph. These novel KG-conditioned language models show potential for pre-training.

此外,Logan IV等[145]和Hayashi等[146]分别将语言模型扩展为知识图语言模型(KGLM)和潜在关系语言模型(LRLM),这两种语言模型都允许以知识图为条件进行预测。这些新的KG条件语言模型显示出预训练的潜力

4.2 Multilingual and Language-Specific PTMs多语言和特定语言的PTMs

4.2.1 Multilingual PTMs多语言的PTMs

Learning multilingual text representations shared across lan-guages plays an important role in many cross-lingual NLP tasks.


Cross-Lingual Language Understanding (XLU)

Most of the early works focus on learning multilingual word em-bedding [147–149], which represents text from multiple lan-guages in a single semantic space. However, these methods usually need (weak) alignment between languages.



Multilingual BERT(3) (mBERT) is pre-trained by MLM with the shared vocabulary and weights on Wikipedia text from the top 104 languages. Each training sample is a monolingual doc-ument, and there are no cross-lingual objectives specifically designed nor any cross-lingual data. Even so, mBERT per-forms cross-lingual generalization surprisingly well [150]. K et al. [151] showed that the lexical overlap between languages plays a negligible role in cross-lingual success.

XLM [46] improves mBERT by incorporating a cross-lingual task, translation language modeling (TLM), which performs MLM on a concatenation of parallel bilingual sen-tence pairs. Unicoder [82] further propose three new cross-lingual pre-training tasks, including cross-lingual word recov-ery, cross-lingual paraphrase classification and cross-lingual masked language model (XMLM).

XLM-RoBERTa (XLM-R) [62] is a scaled multilingual encoder pre-trained on a significantly increased amount of training data, 2.5TB clean CommonCrawl data in 100 differ-ent languages. The pre-training task of XLM-RoBERTa is monolingual MLM only. XLM-R achieves state-of-the-arts results on multiple cross-lingual benchmarks, including XNLI, MLQA, and NER.

多语言BERT(3) (mBERT)由MLM预训练,使用来自前104种语言的维基百科文本的共享词汇权重。每个训练样本都是单语言文档,没有专门设计的跨语言目标,也没有任何跨语言数据。即便如此,mBERT在跨语言泛化方面的表现还是出奇的好[150]。K等人[151]表明,语言之间的词汇重叠在跨语言成功中起着微不足道的作用。

XLM[46]通过结合跨语言任务翻译语言建模(TLM)来改进mBERT, TLM在并行双语句子对的连接/串联上执行MLM。Unicoder[82]进一步提出了三种新的跨语言预训练任务,包括跨语言单词恢复跨语言释义分类跨语言掩码语言模型(XMLM)。

XLM-RoBERTa (XLM-R)[62]是一种可伸缩多语言编码器,预训练数据量显着增加,包括2.5TB 100种不同语言的干净CommonCrawl数据。XLM-RoBERTa的预训练任务仅为单语MLMXLM-R在多种跨语言基准测试(包括XNLIMLQANER)上实现了最先进的结果。

Cross-Lingual Language Generation (XLG)

Multilin-gual generation is a kind of tasks to generate text with different languages from the input language, such as machine transla-tion and cross-lingual abstractive summarization.

Different from the PTMs for multilingual classification, the PTMs for multilingual generation usually needs to pre-train both the encoder and decoder jointly, rather than only focusing on the encoder.

MASS [41] pre-trains a Seq2Seq model with monolingual Seq2Seq MLM on multiple languages and achieves significant improvement for unsupervised NMT. XNLG [60] performs two-stage pre-training for cross-lingual natural language gen-eration. The first stage pre-trains the encoder with monolin-gual MLM and Cross-Lingual MLM (XMLM) tasks. The second stage pre-trains the decoder by using monolingual DAE and Cross-Lingual Auto-Encoding (XAE) tasks while keeping the encoder fixed. Experiments show the benefit of XNLG on cross-lingual question generation and cross-lingual abstractive summarization. mBART [61], a multilingual exten-sion of BART [50], pre-trains the encoder and decoder jointly with Seq2Seq denoising auto-encoder (DAE) task on large-scale monolingual corpora across 25 languages. Experiments demonstrate that mBART produces significant performance gains across a wide variety of machine translation (MT) tasks.




MASS[41]在多语言上用单语言Seq2Seq MLM预训练Seq2Seq模型,并在无监督NMT方面取得了显著改善。



>> 第二阶段通过使用单语言DAE和跨语言自动编码(XAE)任务预训练解码器,同时保持编码器固定。实验证明了XNLG跨语言问题生成跨语言抽象摘要方面的优势。

mBART[61]是BART[50]的多语言扩展,它与Seq2Seq DAE(去噪自动编码器)任务联合在跨25种语言的大规模单语语料库上预训练编码器和解码器。实验证明mBART在各种机器翻译(MT)任务中产生了显著的性能提升

4.2.2 Language-Specific PTMs特定语言的 PTM

Although multilingual PTMs perform well on many languages, recent work showed that PTMs trained on a single language significantly outperform the multilingual results [89, 90, 152].

For Chinese, which does not have explicit word bound-aries, modeling larger granularity [85, 87, 86] and multi-granularity [84, 153] word representations have shown great success. Kuratov and Arkhipov [154] used transfer learn-ing techniques to adapt a multilingual PTM to a monolin-gual PTM for Russian language. In addition, some monolin-gual PTMs have been released for different languages, such as CamemBERT [89] and FlauBERT [90] for French, Fin-BERT [152] for Finnish, BERTje [88] and RobBERT [91] for Dutch, AraBERT [155] for Arabic language.


对于没有明确词边界的中文,建模更大粒度 [85-BERT-wwm-Chinese、87-ZEN、86-NEZHA] 和多粒度 [84-ERNIE、153] 词表示已显示出巨大的成功。Kuratov和Arkhipov[154]使用迁移学习技术使多语言PTM适应于俄语的单语PTM。此外,还针对不同的语言发布了一些单语PTM,如法语的CamemBERT[89]和FlauBERT[90],芬兰语的Fin-BERT[152],荷兰语的BERTje[88]和RobBERT[91],阿拉伯语的AraBERT[155]。

4.3 Multi-Modal PTMs多模态PTM

Observing the success of PTMs across many NLP tasks, some research has focused on obtaining a cross-modal version of PTMs. A great majority of these models are designed for a general visual and linguistic feature encoding. And these models are pre-trained on some huge corpus of cross-modal data, such as videos with spoken words or images with cap-tions, incorporating extended pre-training tasks to fully utilize the multi-modal feature. Typically, tasks like visual-based MLM, masked visual-feature modeling and visual-linguistic matching are widely used in multi-modal pre-training, such as VideoBERT [97], VisualBERT [94], ViLBERT [92].观察PTM在许多NLP任务中的成功,一些研究侧重中在获得PTM的跨模态版本上。这些模型大多是为一般的视觉和语言特征编码而设计的。这些模型在一些庞大跨模态数据语料库上进行预训练,如带语音的视频或带字幕的图像,并加入扩展的预训练任务,以充分利用多模态特性。通常,基于视觉的MLM、掩码视觉特征建模和视觉-语言匹配等任务被广泛应用于多模态预训练中,如VideoBERT[97]、VisualBERT[94]、ViLBERT[92]。

4.3.1 Video-Text PTMs

VideoBERT [97] and CBT [98] are joint video and text mod-els. To obtain sequences of visual and linguistic tokens used for pre-training, the videos are pre-processed by CNN-based encoders and off-the-shelf speech recognition techniques, re-spectively. And a single Transformer encoder is trained on the processed data to learn the vision-language representations for downstream tasks like video caption. Furthermore, Uni-ViLM [156] proposes to bring in generation tasks to further pre-train the decoder using in downstream tasks.


4.3.2 Image-Text PTMs图像-文本 PTM

Besides methods for video-language pre-training, several works introduce PTMs on image-text pairs, aiming to fit down-stream tasks like visual question answering(VQA) and vi-sual commonsense reasoning(VCR). Several proposed models adopt two separate encoders for image and text representation independently, such as ViLBERT [92] and LXMERT [93]. While other methods like VisualBERT [94], B2T2 [95], VL-BERT [96], Unicoder-VL [157] and UNITER [158] propose single-stream unified Transformer. Though these model ar-chitectures are different, similar pre-training tasks, such as MLM and image-text matching, are introduced in these ap-proaches. And to better exploit visual elements, images are converted into sequences of regions by applying RoI or bound-ing box retrieval techniques before encoded by pre-trained Transformers.除了视频语言预训练的方法外,还一些工作在图像-文本对上引入了PTMs,旨在适应下游任务,如视觉问题回答(VQA)和视觉常识推理(VCR)。一些提出的模型采用两个单独的编码器分别进行图像和文本表示,如ViLBERT[92]和LXMERT[93]。而VisualBERT[94]、B2T2[95]、VL-BERT[96]、Unicoder-VL[157]、UNITER[158]等方法则提出了单流统一Transformer。虽然这些模型结构不同,但在这些方法中引入了类似的预训练任务,如MLM和图像-文本匹配。为了更好地利用视觉元素,在预训练的Transformer编码之前,应用RoI边界盒检索技术将图像转换为区域序列

4.3.3 Audio-Text PTMs音频-文本PTM

Moreover, several methods have explored the chance of PTMs on audio-text pairs, such as SpeechBERT [99]. This work tries to build an end-to-end Speech Question Answering (SQA) model by encoding audio and text with a single Transformer encoder, which is pre-trained with MLM on speech and text corpus and fine-tuned on Question Answering.

此外,还有一些方法探索了在音频-文本对上出现PTM的机会,如SpeechBERT[99]。本文尝试用个Transformer编码器对音频和文本进行编码,构建一个端到端的语音问答(SQA)模型,该编码器在语音和文本语料库上使用 MLM 进行预训练,并在问答上进行微调

4.4 Domain-Specific and Task-Specific PTMs 特定领域和特定任务的 PTM

Most publicly available PTMs are trained on general do-main corpora such as Wikipedia, which limits their appli-cations to specific domains or tasks. Recently, some studies have proposed PTMs trained on specialty corpora, such as BioBERT [100] for biomedical text, SciBERT [101] for scien-tific text, ClinicalBERT [159, 160] for clinical text.

In addition to pre-training a domain-specific PTM, some work attempts to adapt available pre-trained models to target applications, such as biomedical entity normalization [161], patent classification [102], progress notes classification and keyword extraction [162].

Some task-oriented pre-training tasks were also proposed, such as sentiment Label-Aware MLM in SentiLR [79] for sen-timent analysis, Gap Sentence Generation (GSG) [163] for text summarization, and Noisy Words Detection for disfluency detection [164].

大多数公开可用的PTM都是在通用领域主语料库(如Wikipedia)上进行训练的,这将它们的应用程序限制在特定的领域或任务上。最近,一些研究提出了在专业语料库上训练的PTMs,如生物医学文本的BioBERT[100],科学文本的SciBERT[101],临床文本的ClinicalBERT[159, 160]。


还提出了一些面向任务的预训练任务,如SentiLR中的情感标签感知MLM[79]用于情感分析,Gap Sentence Generation (GSG)[163]用于文本摘要,以及用于不流畅检测的 Noisy Words Detection[164]—NWD

4.5 Model Compression模型压缩

Since PTMs usually consist of at least hundreds of millions of parameters, they are difficult to be deployed on the on-line service in real-life applications and on resource-restricted de-vices. Model compression [165] is a potential approach to reduce the model size and increase computation efficiency.

There are five ways to compress PTMs [166]:

(1) model pruning, which removes less important parameters,

(2) weight quantization [167], which uses fewer bits to represent the parameters,

(3) parameter sharing across similar model units,

(4) knowledge distillation [168], which trains a smaller student model that learns from intermediate outputs from the original model and

(5) module replacing, which replaces the modules of original PTMs with more compact substitutes.

Table 3 gives a comparison of some representative com-pressed PTMs.


有五种方法可以压缩 PTM[166]:







 Table 3: Comparison of Compressed PTMs

4.5.1 Model Pruning模型剪枝——删除不太重要的参数

Model pruning refers to removing part of neural network (e.g., weights, neurons, layers, channels, attention heads), thereby achieving the effects of reducing the model size and speeding up inference time.

Gordon et al. [103] explored the timing of pruning (e.g., pruning during pre-training, after downstream fine-tuning) and the pruning regimes. Michel et al. [174] and Voita et al. [175] tried to prune the entire self-attention heads in the transformer block.


Gordon等人[103]—CompressingBERT 探讨了修剪的时机(例如,在预训练修剪,在下游微调之后修剪)和修剪机制。Michel等人[174]和Voita等人[175]试图修剪Transformer块中的整个self-attention heads

[174]量化BERT每个注意力Head的重要性且可修剪掉20~40%的注意力头。在文献《Are Sixteen Heads Really Better than One》中,深入分析了BERT多头机制中每个头到底有多大用,结果发现很多头其实没什么用。作者通过迭代的方法从BERT模型中逐步去除注意力头(attention head),他们使用了一种基于梯度检测的方法(对下游任务进行梯度估计)来估计每个注意力头的重要性,并通过绘制性能--去除的注意力头所占百分比函数来测试模型对注意力头剪枝的鲁棒性。在实践中,作者发现20 - 40%的注意力头可以修剪,且对模型准确性的影响可以忽略不计。

[175]量化Multi-Head Self-Attention中各个注意力Heads重要性并提出可修剪掉冗余Head,在文献《Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned》中,提出了一种量化注意力头重要程度的方法。多个Head的作用有大多数是冗余的,很多可以被砍掉。

4.5.2 Quantization量化——更少的比特表示参数

Quantization refers to the compression of higher precision parameters to lower precision. Works from Shen et al. [104] and Zafrir et al. [105] solely focus on this area. Note that quantization often requires compatible hardware.

量化是指将较高精度的参数压缩到较低精度。Shen等人[104]—Q-BERT 和Zafrir等人[105]—Q8BERT 的研究主要集中在这一领域。注意,量化通常需要兼容的硬件

[104]Q-BERT是一种对BERT使用二阶hessian信息的混合精度方法实现模型压缩的新型系统性方法。它是基于 BERT 的模型执行超低精度量化,旨在最小化性能下降幅度,同时保持硬件效率,能够在CV和NLP领域任务中产生前所未有的小模型。

[105]Q8BERT ,将较高精度的浮点参数(32)压缩为较低精度的浮点参数(8位),可以达到4x的压缩效果,同时把精度损失降到了最低。核心是通过把所有的FC层和embedding层的权值都量化成了8bit(因为这些权值占据了全部权值的99%)。低位表示是一种与硬件高度相关的技术,所以需要有一个针对8位的通用矩阵乘做了优化的硬件,将量化的模型布置上去后能够加速模型的推理性能,但是该论文只做了量化工作,没有做硬件的设计。

4.5.3 Parameter Sharing参数共享——相似单元共享参数

Another well-known approach to reduce the number of pa-rameters is parameter sharing, which is widely used in CNNs, RNNs, and Transformer [176]. ALBERT [63] uses cross-layer parameter sharing and factorized embedding parameteriza-tion to reduce the parameters of PTMs. Although the number of parameters is greatly reduced, the training and inference time of ALBERT are even longer than the standard BERT.

Generally, parameter sharing does not improve the compu-tational efficiency at inference phase.

另一种著名的减少参数参数数量的方法是参数共享,广泛应用于CNNRNNTransformer[176]。ALBERT[63]采用跨层参数共享分解嵌入参数化来减少 PTM 的参数。虽然参数量大大减少,但ALBERT训练和推理时间比标准BERT更长


4.5.4 Knowledge Distillation知识蒸馏/提炼——训练一个更小学生模型

Knowledge distillation (KD) [168] is a compression technique in which a small model called student model is trained to re-produce the behaviors of a large model called teacher model. Here the teacher model can be an ensemble of many models and usually well pre-trained. Different to model compres-sion, distillation techniques learn a small student model from a fixed teacher model through some optimization objectives, while compression techniques aiming at searching a sparser architecture.

Generally, distillation mechanisms can be divided into three types:

(1) distillation from soft target probabilities,

(2) dis-tillation from other knowledge, and

(3) distillation to other structures:



(1)、从Soft Target概率中蒸馏,



(1) Distillation from soft target probabilities. Bucilua et al.[165] showed that making the student approximate the teacher model can transfer knowledge from teacher to student. A com-mon method is approximating the logits of the teacher model. DistilBERT [106] trained the student model with a distillation loss over the soft target probabilities of the teacher as:

 where ti and si are the probabilities estimated by the teacher model and the student, respectively.

Distillation from soft target probabilities can also be used in task-specific models, such as information retrieval [177], and sequence labeling [178].

(1)、Soft Target软目标概率蒸馏。Bucilua等[165]表明,使学生近似于教师模型可以将知识从教师转移到学生。一种常用的方法是近似教师模型的对数DistilBERT[106]对学生模型进行了训练,其对教师软目标概率的蒸馏损失为:



关键词额外信息补充详见—Hard-target 和 Soft-target对比

(2) Distillation from other knowledge. Distillation from soft target probabilities regards the teacher model as a black box and only focus on its outputs. Moreover, decomposing the teacher model and distilling more knowledge can bring improvement to the student model.

TinyBERT [107] performs layer-to-layer distillation with embedding outputs, hidden states, and self-attention distribu-tions. MobileBERT [171] also perform layer-to-layer distil-lation with soft target probabilities, hidden states, and self-attention distributions. MiniLM [108] distill self-attention distributions and self-attention value relation from teacher model.

Besides, other models distill knowledge through many ap-proaches. Sun et al. [169] introduced a “patient” teacher-student mechanism, Liu et al. [179] exploited KD to improve a pre-trained multi-task deep neural network.




(3) Distillation to other structures. Generally, the structure of the student model is the same as the teacher model, except for a smaller layer size and a smaller hidden size. However, not only decreasing parameters but also simplifying model structures from Transformer to RNN [180] or CNN [181] can reduce the computational complexity.


[169] PKDBert:针对Bert模型压缩提出了叫做Patient Knowledge Distillation(PKD)的方案,该方案有2种不同的蒸馏策略PKD-Last(学习最后k层)、PKD-Skip(学习中间的每k层信息)。论文介绍了一种bert模型压缩蒸馏的方法,在vanilla 知识蒸馏方法的基础上,直接学习老师模型的中间层信(充分挖掘教师模型的信息),通过学习teacher网络中间层信息提高student网络表现。
[179] DK_MT-DNN,将知识蒸馏,拓展到多任务学习以训练MT-DNN,从而打造出更稳固且通用的自然语言理解模型。过程如下所示
第二步,通过多任务学习,从多个集成学习模型(ensemble teachers)中,蒸馏训练一个单一的MT-DNN(student模型)。

关键词额外信息补充—Hard-target 和 Soft-target对比

      Hard-target 和 Soft-target传统的神经网络训练方法是,定义一个损失函数,目标是使预测值尽可能接近于真实值(Hard- target),损失函数就是使神经网络的损失值和尽可能小。这种训练过程是对ground truth求极大似然。在知识蒸馏中,是使用大模型的类别概率作为Soft-target的训练过程。
>> Hard-target:原始数据集标注的 one-shot 标签,除了正标签为 1,其他负标签都是 0;
>> Soft-target:Teacher模型softmax层输出的类别概率,每个类别都分配了概率,正标签的概率最高;
      知识蒸馏用Teacher模型预测的 Soft-target 来辅助Hard-target训练 Student模型的方式为什么有效呢?softmax层的输出,除了正例之外,负标签也带有Teacher模型归纳推理的大量信息,比如某些负标签对应的概率远远大于其他负标签,则代表 Teacher模型在推理时认为该样本与该负标签有一定的相似性。而在传统的训练过程(Hard-target)中,所有负标签都被统一对待。也就是说,知识蒸馏的训练方式,使得每个样本给Student模型带来的信息量大于传统的训练方式


4.5.5 Module Replacing模块替换——用更紧凑替换

Module replacing is an interesting and simple way to reduce the model size, which replaces the large modules of original PTMs with more compact substitutes. Xu et al. [109] pro-posed Theseus Compression motivated by a famous thought experiment called “Ship of Theseus”, which progressively substitutes modules from the source model with modules of fewer parameters. Different from KD, Theseus Compression only requires one task-specific loss function. The compressed model, BERT-of-Theseus, is 1.94× faster while retaining more than 98% performance of the source model.

模块替换减小模型尺寸的一种有趣而简单的方法,它用更紧凑的替代品替换原始PTM的大模块。Xu等人[109]提出 Theseus 压缩的动机是一个名为“Theseus 之船”的著名思想实验,该实验逐步用更少参数的模块替代源模型中的模块。与KD不同,Theseus 压缩只需要一个特定任务的损失函数。压缩模型BERT-of-Theseus速度快1.94倍,同时保持源模型98%以上的性能

4.5.6 Early Exit早退

Another efficient way to reduce the inference time is early exit, which allows the model to exit early at an o铿€-ramp instead of passing through the entire model. The number of layers to be executed is conditioned on the input.

The idea of early exit is first applied in computer vision, such as BranchyNet [182] and Shallow-Deep Network [183]. With the emergence of deep pre-trained language models, early exit is recently adopted to speedup Transformer-based models. As a prior work, Universal Transformer [176] uses the Adaptive Computation Time (ACT) mechanism [184] to achieve input-adaptive computation. Elbayad et al. [185] pro-posed Depth-adaptive transformer for machine translation, which learns to predict how many decoding layers are re-quired for a particular sequence or token. Instead of learning how much computation is required, Liu et al. [186] proposed two estimation approaches based on Mutual Information (MI) and Reconstruction Loss respectively to directly allocate the appropriate computation to each sample.


早期退出的思想最早应用于计算机视觉,如BranchyNet[182]和Shallow-Deep Network[183]。随着深度预训练语言模型的出现,最近采用早期退出来加速基于transformer的模型。作为先前的工作,Universal Transformer[176]使用自适应计算时间(Adaptive Computation Time, ACT)机制[184]来实现输入自适应计算。Elbayad等人[185]提出了用于机器翻译DAdap Transformers,它可以学习预测特定序列或token需要多少解码层。Liu等[186](FDAdap Transformers)提出了两种分别基于互信息(MI)和重建损失(Reconstruction Loss)的估计方法,直接为每个样本分配适当的计算,而不是学习需要多少计算量。

More recently, DeeBERT [110], RightTool [111], Fast-BERT [112], ELBERT [187], PABEE [113] are proposed to reduce the computation of transformer encoder for natural language understanding tasks. Their methods usually contain two steps: (a) Training the injected off-ramps (aka internal classifiers), and (b) Designing an exiting strategy to decide whether or not to exit.

减少Transformer Enc的计算量

最近,DeeBERT [110], RightTool [111], Fast-BERT [112], ELBERT [187], PABEE[113]被提出来减少用于自然语言理解任务的Transformer编码器的计算量。他们的方法通常包含两个步骤:



Typically, the training objective is a weighted sum of the cross-entropy losses at all off-ramps, i.e.

 where M is the number of off-ramps. FastBERT [112] adopted the self-distillation loss that trains each off-ramp with the soft target generated by the final classifier. Liao et al. [114] im-proved the objective by considering both the past and the future information. In particular, the off-ramps are trained to aggregate the hidden states of the past layers, and also ap-proximate the hidden states of the future layers. Moreover, Sun et al. [115] developed a novel training objective from the perspective of ensemble learning and mutual information, by which the off-ramps are trained as an ensemble. Their proposed objective not only optimizes the accuracy of each ff-ramp but also the diversity of the off-ramps.

During inference, an exiting strategy is required to decide whether to exit early or continue to the next layer. Dee-BERT [110], FastBERT [112], Liao et al. [114] adopt the entropy of the prediction distribution as the exiting criterion. Similarly, RightTool [111] use the maximum softmax score to decide whether to exit. PABEE developed a patience-based strategy that allows a sample to exit when the prediction is unchanged for successive layers. Further, Sun et al. [115] adopt a voting-based strategy to let all of the past off-ramps take a vote to decide whether or not to exit. Besides, Li et al.[116] proposed a window-based uncertainty as the exiting cri-terion to achieve token-level early exit (TokEE) for sequence labeling tasks.



在推理过程中,需要一个退出策略决定是提早退出还是继续到下一层。DeeBERT [110], FastBERT [112], Liao等[114](GPFEE)采用预测分布的熵作为现有的判据。类似地,RightTool[111]使用最大softmax分数来决定是否退出。PABEE开发了一种基于耐心的策略,允许样本在连续层的预测不变时退出。此外,Sun等人[115]采用基于投票的策略,让所有过去的出入口进行投票来决定是否退出。此外,Li等人[116](SentEE/TokEE)提出了基于窗口的不确定性作为退出标准,以实现序列标签任务的token级提前退出(TokEE)。

[114] Global Past-future Early Exit,(GPFEE)则尝试利用 imitation learning,一方面利用所有浅层的样本表示,另外一方面尝试预测出更深层的样本表示来作为辅助信息,进而提升分类的效果。
[115] Early exiting with ensemble internal classifiers,(EICEE)

5 Adapting PTMs to Downstream Tasks使 PTM 适应下游任务

Although PTMs capture the general language knowledge from a large corpus, how effectively adapting their knowledge to the downstream task is still a key problem.


5.1 Transfer Learning迁移学习

Transfer learning [188] is to adapt the knowledge from a source task (or domain) to a target task (or domain). Fig-ure 4 gives an illustration of transfer learning.

There are many types of transfer learning in NLP, such as domain adaptation, cross-lingual learning, multi-task learning. Adapting PTMs to downstream tasks is sequential transfer learning task, in which tasks are learned sequentially and the target task has labeled data.



5.2 How to Transfer?如何迁移

To transfer the knowledge of a PTM to the downstream NLP tasks, we need to consider the following issues:


5.2.1 Choosing appropriate pre-training task, model architecture and corpus选择合适的预训练任务、模型架构和语料库

Different PTMs usually have different effects on the same downstream task, since these PTMs are trained with various pre-training tasks, model architecture, and corpora.

(1) Currently, the language model is the most popular pre-training task and can more efficiently solve a wide range of NLP problems [58]. However, different pre-training tasks have their own bias and give different effects for different tasks. For example, the NSP task [16] makes PTM understand the relationship between two sentences. Thus, the PTM can benefit downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI).

(2) The architecture of PTM is also important for the down-stream task. For example, although BERT helps with most natural language understanding tasks, it is hard to generate language.

(3) The data distribution of the downstream task should be approximate to PTMs. Currently, there are a large number of off-the-shelf PTMs, which can just as conveniently be used for various domain-specific or language-specific downstream tasks.

Therefore, given a target task, it is always a good solution to choose the PTMs trained with appropriate pre-training task, architecture, and corpus.


(1)、语言模型是目前最流行的预训练任务,可以更有效地解决广泛的NLP问题[58]。但是,不同的预训练任务有其自身的偏向性,对不同的任务有不同的效果。例如,NSP任务[16]让PTM理解两个句子之间的关系。因此,PTM 可以使下游任务受益,如问答(QA)和自然语言推理(NLI)等具体应用

(2) 、PTM的对下游任务也很重要。例如,尽管BERT有助于大多数自然语言理解任务,但它很难生成语言



5.2.2 Choosing appropriate layers选择合适的图层

Given a pre-trained deep model, different layers should cap-ture different kinds of information, such as POS tagging, pars-ing, long-term dependencies, semantic roles, coreference. For RNN-based models, Belinkov et al. [189] and Melamud et al.[34] showed that representations learned from different layers in a multi-layer LSTM encoder benefit different tasks (e.g., predicting POS tags and understanding word sense). For transformer-based PTMs, Tenney et al. [122] found BERT represents the steps of the traditional NLP pipeline: basic syntactic information appears earlier in the network, while high-level semantic information appears at higher layers.

给定一个预训练好的深度模型不同的层应该捕获不同类型的信息,例如POS词性标记、解析、长期依赖、语义角色、共引用。对于基于RNN的模型,Belinkov等人[189]和Melamud等人[34]表明,从多层LSTM编码器中的不同层学习的表示有利于不同的任务(例如,预测POS标签和理解词义)。对于基于Transformer的PTM, Tenney等[122]发现BERT代表了传统NLP管道的步骤:基本语法信息出现在网络的较早位置,而高级语义信息出现在较高层

Let H(l)(1  l  L) denotes the l-th layer representation of the pre-trained model with L layers, and g(·) denote the task-specific model for the target task.

There are three ways to select the representation:

a、Embedding Only. One approach is to choose only the pre-trained static embeddings, while the rest of the model still needs to be trained from scratch for a new target task.
They fail to capture higher-level information that might be even more useful. Word embeddings are only useful in capturing semantic meanings of words, but we also need to understand higher-level concepts like word sense.
b、Top Layer. The most simple and effective way is to feed the representation at the top layer into the task-specific model g(H(L)).

c、All Layers. A more flexible way is to automatic choose the best layer in a soft version, like ELMo [14]:

设H(l)(1≤ l ≤L)表示l层预训练模型的第l层表示,g(·)表示目标任务的特定任务模型。




c,所有Layers。一个更灵活的方法是在soft 版本中自动选择最好的图层,比如ELMo [14]:

 where αl is the softmax-normalized weight for layer l and γ is a scalar to scale the vectors output by pre-trained model. The mixup representation is fed into the task-specific model g(rt).

其中αl为层l的softmax归一化权值,γ 是一个标量,用于缩放预训练模型输出的向量。混合表示被输入到特定任务的模型g(rt)。

5.2.3 To tune or not to tune?是否微调?

Currently, there are two common ways of model transfer: fea-ture extraction (where the pre-trained parameters are frozen), and fine-tuning (where the pre-trained parameters are unfrozen and fine-tuned).

In feature extraction way, the pre-trained models are re-garded as off-the-shelf feature extractors. Moreover, it is im-portant to expose the internal layers as they typically encode the most transferable representations [190].

Although both these two ways can significantly benefit most of NLP tasks, feature extraction way requires more com-plex task-specific architecture. Therefore, the fine-tuning way is usually more general and convenient for many different downstream tasks than feature extraction way.






Table 4: Some common combinations of adapting PTMs.

Table 4: Some common combinations of adapting PTMs.



5.3 Fine-Tuning Strategies微调策略

With the increase of the depth of PTMs, the representation cap-tured by them makes the downstream task easier. Therefore, the task-specific layer of the whole model is simple. Since ULMFit and BERT, fine-tuning has become the main adaption method of PTMs. However, the process of fine-tuning is often brittle: even with the same hyper-parameter values, distinct random seeds can lead to substantially different results [193].

Besides standard fine-tuning, there are also some useful fine-tuning strategies.



Two-stage fine-tuning

An alternative solution is two-stage transfer, which introduces an intermediate stage between pre-training and fine-tuning. In the first stage, the PTM is trans-ferred into a model fine-tuned by an intermediate task or cor-pus. In the second stage, the transferred model is fine-tuned to the target task. Sun et al. [64] showed that the “further pre-training” on the related-domain corpus can further improve the ability of BERT and achieved state-of-the-art performance on eight widely-studied text classification datasets. Phang et al. [194] and Garg et al. [195] introduced the intermedi-ate supervised task related to the target task, which brings a large improvement for BERT, GPT, and ELMo. Li et al. [65] also used a two-stage transfer for the story ending prediction. The proposed TransBERT (transferable BERT) can transfer not only general language knowledge from large-scale unla-beled data but also specific kinds of knowledge from various semantically related supervised tasks.


>> 在第一阶段,PTM被转移为一个由中间任务或语料库微调的模型。
>> 在第二阶段,传输的模型被微调到目标任务。
Sun等[64]研究表明,在相关领域语料库上进行“进一步的预训练”可以进一步提高BERT的能力,并在8个被广泛研究的文本分类数据集上取得了最先进的性能。Phang等[194]和Garg等[195]引入了与目标任务相关的中间监督任务,为BERTGPTELMo带来了很大的改进。Li等人[65]也使用了两阶段转移来预测故事结局。TransBERT (transferable BERT)不仅可以从大规模的无标签数据中转移一般语言知识,还可以从各种语义相关的监督任务中转移特定类型的知识

Multi-task fine-tuning

Liu et al. [67] fine-tuned BERT un-der the multi-task learning framework, which demonstrates that multi-task learning and pre-training are complementary technologies.



Fine-tuning with extra adaptation modules

The main drawback of fine-tuning is its parameter inefficiency: every downstream task has its own fine-tuned parameters. There-fore, a better solution is to inject some fine-tunable adaptation modules into PTMs while the original parameters are fixed.

Stickland and Murray [68] equipped a single share BERT model with small additional task-specific adaptation modules, projected attention layers (PALs). The shared BERT with the PALs matches separately fine-tuned models on the GLUE benchmark with roughly 7 times fewer parameters. Similarly, Houlsby et al. [69] modified the architecture of pre-trained BERT by adding adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without re-visiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing.




Others Motivated by the success of widely-used ensemble models, Xu et al. [196] improved the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation, which can improve the performance of BERT on downstream tasks without leveraging external resource or significantly de-creasing the training efficiency. They integrated ensemble and distillation within a single training process. The teacher model is an ensemble model by parameter-averaging several student models in previous time steps.

Instead of fine-tuning all the layers simultaneously, grad-ual unfreezing [38] is also an effective method that gradu-ally unfreezes layers of PTMs starting from the top layer. Chronopoulou et al. [197] proposed a simpler unfreezing method, sequential unfreezing, which first fine-tunes only the randomly-initialized task-specific layers, and then unfreezes the hidden layers of PTM, and finally unfreezes the embedding layer.

Li and Eisner [198] compressed ELMo embeddings us-ing variational information bottleneck while keeping only the information that helps the target task.

Generally, the above works show that the utility of PTMs can be further stimulated by better fine-tuning strategies.



与同时微调所有层不同,T6、逐渐解冻 [38] 也是一种从顶层开始逐渐解冻 PTM 层的有效方法。Chronopoulou等人[197]提出了一种更简单的解冻结方法——顺序解冻结,该方法首先只对随机初始化的特定任务层进行微调,然后解冻结PTM的隐藏层,最后解冻结嵌入层



5.3.1 Prompt-based Tuning基于提示的微调

Narrowing the gap between pre-training and fine-tuning can further boost the performance of PTMs on downstream tasks. An alternative approach is reformulating the downstream tasks into a MLM task by designing appropriate prompts. Prompt-based methods have shown great power in few-shot setting [199, 200, 70, 72], zero-shot setting [129, 201], and even fully-supervised setting [74, 75]. Current prompt-based methods can be categorized as two branches according to the prompt is whether discrete or continuous.



Discrete prompts

Discrete prompt is a sequence of words to be inserted into the input text, which helps the PTM to bet-ter model the downstream task. Sun et al. [202] constructed an auxiliary sentence by transforming aspect-based sentiment analysis (ABSA) task to a sentence pair classification task, but its model parameters still need to be fine-tuned. GPT-3 [59] proposed the in-context learning that concatenates the original input with the task description and a few examples. By this, GPT-3 can achieve competitive performance without tuning the parameters. Besides, Petroni et al. [129] found that with proper manual prompt, BERT can perform well on entity pre-diction task (LAMA) without training. In addition to LAMA, Schick and Sch¨utze [200, 70] proposed PET that designed discrete prompts for various text classification and entailment tasks. However, the manually designed prompts can be sub-optimal, as a result, many methods are developed to automate the generation of prompts. LPAQA [201] uses two methods,i.e., mining-based generation and paraphrasing-based generation, to find the optimal patterns that express particular relations. AutoPrompt [71] finds the optimal prompt with gradient-guided search. LM-BFF [72] employs T5 [42] to automatically generate prompts.


离散提示符,是插入到输入文本中的一系列单词序列,它帮助PTM更好地对下游任务建模。Sun等[202]将基于aspect的情感分析(ABSA)任务转化为句子对分类任务,构建了辅助句,但其模型参数仍需微调。GPT-3[59]提出了上下文学习,将原始输入与任务描述和一些示例连接起来。通过这种方式,GPT-3可以在不调整参数的情况下实现具有竞争力的性能。此外,Petroni等[129]发现在适当的手动提示下,BERT无需训练即可在实体预测任务 (LAMA) 上表现良好。除了LAMA之外,Schick和sch¨utze[200,70]还提出了PET, PET为各种文本分类和隐含任务设计离散提示。然而,手动设计的提示可能不是最优的,因此开发了许多方法来自动生成提示LPAQA[201]((挖掘自动提示)采用了两种方法:,基于挖掘的生成和基于释义的生成,以找到表达特定关系的最佳模式。AutoPrompt[71](梯度搜索提示)使用梯度引导搜索找到最佳提示。LM-BFF[72]采用T5[42]自动生成提示符。

Continuous prompts

Instead of finding the optimal con-crete prompt, another alternative is to directly optimize the prompt in continuous space, i.e. the prompt vectors are not necessarily word type embeddings of the PTM. The opti-mized continuous prompt is concatenated with word type embeddings, which is then fed into the PTM. Qin and Eisner [203] and Zhong et al. [204] found that the optimized con-tinuous prompt can outperform concrete prompts (including manual [129], mined (LPAQA [201]), and gradient-searched (AutoPrompt [71]) prompts) on relational tasks. WARP [73] inserts trainable continuous prompt tokens before, between, and after the input sequence while keeping the parameters of the PTM fixed, resulting in considerable performance on GLUE benchmark. Prefix-Tuning [74] inserts continuous prompt as prefix of the input of GPT-2 for table-to-text gen-eration and BART for summarization. Prefix-Tuning, as a parameter-efficient tuning technique, achieved comparable per-formance in fully-supervised setting and outperformed model fine-tuning in few-shot setting. Further, P-Tuning [75] showed that, with continuous prompt, GPT can also achieve compa-rable or even better performance to similar-sized BERT on natural language understanding (NLU) tasks. Very recently, Lester et al. [205] showed that prompt tuning becomes more competitive with scale. When the PTM exceeds billions of parameters, the gap between model fine-tuning and prompt tuning can be closed, which makes the prompt-based tuning a very promising method for efficient serving of large-scale PTMs.


另一种方法是直接在连续空间中优化提示,而不是寻找最佳的具体提示,即提示向量不一定是 PTM 的词类型嵌入优化后的连续提示符与词类型嵌入连接,然后将其输入PTM。Qin和Eisner[203]和Zhong等人[204]发现优化后的连续提示在关系任务上优于具体提示(包括手动提示[129]、挖掘提示(LPAQA[201])和梯度搜索提示(AutoPrompt[71])。WARP[73]在保持PTM参数固定的同时,在输入序列之前、之间和之后插入可训练的连续提示token,从而在GLUE基准测试中获得相当好的性能。Prefix-Tuning前缀调优[74]将连续提示符插入GPT-2的输入前缀,用于表到文本的生成,BART用于摘要。Prefix-Tuning作为一种参数高效的调优技术,在全监督环境下具有相当的性能,在少样本环境下优于模型微调。此外,P-Tuning[75]表明,在连续提示下,GPT也可以在自然语言理解(NLU)任务上实现与类似规模的BERT相当甚至更好的性能。最近,Lester等人[205]表明,随着规模的扩大,提示调优变得更具竞争力。当PTM参数超过数十亿个时模型微调提示调优之间的差距可以缩小,这使得基于提示的调优成为高效服务大规模PTM的一种很有前途的方法。

6 Resources of PTMs—PTM 的资源

There are many related resources for PTMs available online. Table 5 provides some popular repositories, including third-party implementations, paper lists, visualization tools, and other related resources of PTMs.

Besides, there are some other good survey papers on PTMs for NLP [211, 212, 173].


此外,还有一些其他关于 NLP 的 PTM 的优秀调查论文 [211、212、173]。

Table 5: Resources of PTMs

7 Applications应用

In this section, we summarize some applications of PTMs in several classic NLP tasks.


7.1 General Evaluation Benchmark通用评价基准

There is an essential issue for the NLP community that how can we evaluate PTMs in a comparable metric. Thus, large-scale-benchmark is necessary.

The General Language Understanding Evaluation (GLUE) benchmark [213] is a collection of nine natural language under-standing tasks, including single-sentence classification tasks (CoLA and SST-2), pairwise text classification tasks (MNLI, RTE, WNLI, QQP, and MRPC), text similarity task (STS-B), and relevant ranking task (QNLI). GLUE benchmark is well-designed for evaluating the robustness as well as general-ization of models. GLUE does not provide the labels for the test set but set up an evaluation server.

However, motivated by the fact that the progress in recent years has eroded headroom on the GLUE benchmark dra-matically, a new benchmark called SuperGLUE [214] was presented. Compared to GLUE, SuperGLUE has more challenging tasks and more diverse task formats (e.g., coreference resolution and question answering).

State-of-the-art PTMs are listed in the corresponding leader-board4) 5).






7.2 Question Answering / MRC

Question answering (QA), or a narrower concept machine reading comprehension (MRC), is an important application in the NLP community. From easy to hard, there are three types of QA tasks: single-round extractive QA (SQuAD) [215], multi-round generative QA (CoQA) [216], and multi-hop QA (HotpotQA) [217].

BERT creatively transforms the extractive QA task to the spans prediction task that predicts the starting span as well as the ending span of the answer [16]. After that, PTM as an encoder for predicting spans has become a competitive baseline. For extractive QA, Zhang et al. [218] proposed a retrospective reader architecture and initialize the encoder with PTM (e.g., ALBERT). For multi-round generative QA, Ju et al.[219] proposed a “PTM+Adversarial Training+Rationale Tag-ging+Knowledge Distillation” model. For multi-hop QA, Tu et al. [220] proposed an interpretable “Select, Answer, and Explain” (SAE) system that PTM acts as the encoder in the selection module.

>> 单轮提取式QA (SQuAD)[215]、
>> 多轮生成式QA (CoQA)[216]和
>> 多跳式QA (HotpotQA)[217]。

>> 对于提取性QA, Zhang等人[218]提出了一种回溯式阅读器架构,并使用PTM(例如ALBERT)初始化编码器。
>> 对于多轮生成式QA, Ju等人[219]提出了“PTM+对抗性训练+基本原理标记+知识蒸馏”模型。

>> 对于多跳QA, Tu等人[220]提出了一种可解释的“选择、回答和解释”(SAE)系统,PTM作为选择模块中的编码器。

Generally, encoder parameters in the proposed QA model are initialized through a PTM, and other parameters are ran-domly initialized. State-of-the-art models are listed in the corresponding leaderboard. 6) 7) 8)

通常,所提出的QA模型中的编码器参数通过PTM初始化其他参数随机初始化。最先进的模型被列在相应的排行榜上。6) 7) 8)

7.3 Sentiment Analysis情感分析

BERT+迁移学习技术,在日语 SA 中实现了新的SOA

基于端到端ABSA的aspect 检测和情感分类
SentiLRSentiWordNet+Label-Aware MLM来捕获情感转移关系;
基于BERT的“Mask and Infill”实现分离情感

BERT outperforms previous state-of-the-art models by simply fine-tuning on SST-2, which is a widely used dataset for senti-ment analysis (SA) [16]. Bataa and Wu [221] utilized BERT with transfer learning techniques and achieve new state-of-the-art in Japanese SA.

Despite their success in simple sentiment classification, directly applying BERT to aspect-based sentiment analysis (ABSA), which is a fine-grained SA task, shows less signif-icant improvement [202]. To better leverage the powerful representation of BERT, Sun et al. [202] constructed an auxil-iary sentence by transforming ABSA from a single sentence classification task to a sentence pair classification task. Xu et al. [222] proposed post-training to adapt BERT from its source domain and tasks to the ABSA domain and tasks. Fur-thermore, Rietzler et al. [223] extended the work of [222] by analyzing the behavior of cross-domain post-training with ABSA performance. Karimi et al. [224] showed that the per-formance of post-trained BERT could be further improved via adversarial training. Song et al. [225] added an additional pooling module, which can be implemented as either LSTM or attention mechanism, to leverage BERT intermediate lay-ers for ABSA. In addition, Li et al. [226] jointly learned aspect detection and sentiment classification towards end-to-end ABSA. SentiLR [79] acquires part-of-speech tag and prior sen-timent polarity from SentiWordNet and adopts Label-Aware MLM to utilize the introduced linguistic knowledge to capture the relationship between sentence-level sentiment labels and word-level sentiment shifts. SentiLR achieves state-of-the-art performance on several sentence- and aspect-level sentiment classification tasks.

BERT通过对SST-2进行微调就超越了以前最先进的模型,这是一个广泛用于情感分析(SA)[16]的数据集。Bataa和Wu[221]利用BERT迁移学习技术,并在日语 SA 实现了新的SOA水平

尽管他们在简单的情感分类取得了成功,但直接将BERT应用于基于aspect的情感分析(ABSA),这是一项细粒度的SA任务,显示出不太显著的改进[202]。为了更好地利用BERT强大表示法,Sun等人[202]将ABSA单句分类任务转化为句对分类任务,构造了一个辅助句。Xu等人[222]提出后训练,使BERT从其源域和任务适应到ABSA域和任务。此外,Rietzler等人[223]通过使用ABSA性能分析跨域后训练的行为,扩展了[222]的工作。Karimi等人[224]研究表明,通过对抗性训练可以进一步提高训练后BERT的表现。Song等人[225]添加了一个额外的池化模块,可以作为 LSTM 或注意力机制来实现,以利用 BERT 中间层进行 ABSA。此外,Li等[226]共同学习了基于端到端ABSA的aspect 检测和情感分类SentiLR[79]从SentiWordNet中获取词性标签和先验情感极性,并采用Label-Aware MLM利用引入的语言知识来捕获句子级情感标签单词级情感转移之间的关系SentiLR在几个句子级aspect的情感分类任务上达到了最先进的性能。

For sentiment transfer, Wu et al. [227] proposed “Mask and Infill” based on BERT. In the mask step, the model disen-tangles sentiment from content by masking sentiment tokens. In the infill step, it uses BERT along with a target sentiment embedding to infill the masked positions.

对于情感转移,Wu等[227]提出了基于BERT的“Mask and Infill”。在屏蔽步骤中,模型通过屏蔽情感tokens情感从内容中分离出来。在填充步骤中,它使用BERT目标情感嵌入填充掩码位置


Aspect-Based情感分析任务/ABSA:Aspect-Based Sentiment Analysis (ABSA) ,可以是句子中未出现的词。aspect指的是句子中名词或实体的类别,是抽象出来的方面,一般情况下,不是句子中本来存在的名词;这是一项细粒度的SA任务

7.4 Named Entity Recognition命名实体识别

Named Entity Recognition (NER) in information extraction and plays an important role in many NLP downstream tasks. In deep learning, most of NER methods are in the sequence-labeling framework. The entity information in a sentence will be transformed into the sequence of labels, and one label corresponds to one word. The model is used to predict the label of each word. Since ELMo and BERT have shown their power in NLP, there is much work about pre-trained models for NER.

Akbik et al. [37] used a pre-trained character-level language model to produce word-level embedding for NER. TagLM [228] and ELMo [14] use a pre-trained language model’s last layer output and weighted-sum of each layer output as a part of word embedding. Liu et al. [229] used layer-wise pruning and dense connection to speed up ELMo’s inference on NER. Devlin et al. [16] used the first BPE’s BERT representation to predict each word’s label without CRF. Pires et al. [150] realized zero-shot NER through multilingual BERT. Tsai et al.[178] leveraged knowledge distillation to run a small BERT for NER on a single CPU. Besides, BERT is also used on domain-specific NER, such as biomedicine [230, 100], etc.

命名实体识别(NER),在信息抽取和许多 NLP 下游任务中起着重要作用。在深度学习中,大多数NER方法都是在序列标签框架中。句子中的实体信息转化标签序列一个标签对应一个单词。该模型用于预测每个单词的标签。由于ELMoBERT已经在NLP中展示了它们的强大功能,因此有很多关于 NER 预训练模型的工作。

Akbik等人[37]使用预训练的字符级语言模型为NER生成词级嵌入TagLM[228]和ELMo[14]使用预训练语言模型的最后一层输出各层输出的加权和作为词嵌入的一部分。Liu等[229]使用分层剪枝密集连接加速了ELMoNER的推断。Devlin et al.[16]使用第一个BPE的BERT表示来预测没有CRF的每个单词的标签。Pires等[150]通过多语言BERT实现了零样本NER。Tsai等人[178]利用知识蒸馏在单个CPU上为NER运行一个小型BERT。此外,BERT还用于特定领域NER,如生物医学[230,100]等。

7.5 Machine Translation机器翻译

Machine Translation (MT) is an important task in the NLP community, which has attracted many researchers. Almost all of Neural Machine Translation (NMT) models share the encoder-decoder framework, which first encodes input tokens to hidden representations by the encoder and then decodes output tokens in the target language from the decoder. Ra-machandran et al. [36] found the encoder-decoder models can be significantly improved by initializing both encoder and decoder with pre-trained weights of two language models. Edunov et al. [231] used ELMo to set the word embedding layer in the NMT model. This work shows performance im-provements on English-Turkish and English-German NMT model by using a pre-trained language model for source word embedding initialization.

Given the superb performance of BERT on other NLP tasks, it is natural to investigate how to incorporate BERT into NMT models. Conneau and Lample [46] tried to initialize the entire encoder and decoder by a multilingual pre-trained BERT model and showed a significant improvement could be achieved on unsupervised MT and English-Romanian super-vised MT. Similarly, Clinchant et al. [232] devised a series of different experiments for examining the best strategy to utilize BERT on the encoder part of NMT models. They achieved some improvement by using BERT as an initializa-tion of the encoder. Also, they found that these models can get better performance on the out-of-domain dataset. Imamura and Sumita [233] proposed a two stages BERT fine-tuning method for NMT. At the first stage, the encoder is initialized by a pre-trained BERT model, and they only train the decoder on the training set. At the second stage, the whole NMT model is jointly fine-tuned on the training set. By experiment, they show this approach can surpass the one stage fine-tuning method, which directly fine-tunes the whole model. Apart from that, Zhu et al. [192] suggested using pre-trained BERT as an extra memory to facilitate NMT models. Concretely, they first encode the input tokens by a pre-trained BERT and use the output of the last layer as extra memory. Then, the NMT model can access the memory via an extra attention mod-ule in each layer of both encoder and decoder. And they show a noticeable improvement in supervised, semi-supervised, and unsupervised MT.



Instead of only pre-training the encoder, MASS (Masked Sequence-to-Sequence Pre-Training) [41] utilizes Seq2Seq MLM to pre-train the encoder and decoder jointly. In the experiment, this approach can surpass the BERT-style pre-training proposed by Conneau and Lample [46] both on un-supervised MT and English-Romanian supervised MT. Dif-ferent from MASS, mBART [61], a multilingual extension of BART [50], pre-trains the encoder and decoder jointly with Seq2Seq denoising auto-encoder (DAE) task on large-scale monolingual corpora across 25 languages. Experiments demonstrated that mBART could significantly improve both supervised and unsupervised machine translation at both the sentence level and document level.

MASS(Masked Sequence-to-Sequence Pre-Training)[41]并不是只对编码器进行预训练,而是利用Seq2Seq MLM对编码器和解码器进行联合预训练。在实验中,该方法在无监督MT和英语-罗马尼亚监督MT上都能超过Conneau和Lample[46]提出的BERT式预训练。与 MASS 不同,mBART [61] 是 BART [50] 的多语言扩展,它与 Seq2Seq  DAE去噪自动编码器 (DAE) 任务联合对 25 种语言的大规模单语语料库进行预训练。实验表明,mBART句子水平文档水平上都能显著提高有监督和无监督机器翻译。

7.6 Summarization摘要总结

Summarization, aiming at producing a shorter text which pre-serves the most meaning of a longer text, has attracted the attention of the NLP community in recent years. The task has been improved significantly since the widespread use of PTM. Zhong et al. [191] introduced transferable knowledge (e.g., BERT) for summarization and surpassed previous mod-els. Zhang et al. [234] tries to pre-trained a document-level model that predicts sentences instead of words, and then apply it on downstream tasks such as summarization. More elabo-rately, Zhang et al. [163] designed a Gap Sentence Generation (GSG) task for pre-training, whose objective involves generat-ing summary-like text from the input. Furthermore, Liu and Lapata [235] proposed BERTSUM. BERTSUM included a novel document-level encoder, and a general framework for both extractive summarization and abstractive summarization.In the encoder frame, BERTSUM extends BERT by inserting multiple [CLS] tokens to learn the sentence representations. For extractive summarization, BERTSUM stacks several inter-sentence Transformer layers. For abstractive summarization, BERTSUM proposes a two-staged fine-tuning approach using a new fine-tuning schedule. Zhong et al. [236] proposed a novel summary-level framework MATCHSUM and conceptu-alized extractive summarization as a semantic text matching problem. They proposed a Siamese-BERT architecture to compute the similarity between the source document and the candidate summary and achieved a state-of-the-art result on CNN/DailyMail (44.41 in ROUGE-1) by only using the base version of BERT.


>> 对于提取摘要BERTSUM堆叠了几个句间Transformer层。

>> 对于抽象总结BERTSUM提出了一种使用新的微调计划的两阶段微调方法。


7.7 Adversarial Attacks and Defenses对抗性攻击和防御AdvAtt

The deep neural models are vulnerable to adversarial examples that can mislead a model to produce a specific wrong predic-tion with imperceptible perturbations from the original input. In CV, adversarial attacks and defenses have been widely stud-ied. However, it is still challenging for text due to the discrete nature of languages. Generating of adversarial samples for text needs to possess such qualities: (1) imperceptible to hu-man judges yet misleading to neural models; (2) fluent in grammar and semantically consistent with original inputs. Jin et al. [237] successfully attacked the fine-tuned BERT on text classification and textual entailment with adversarial exam-ples. Wallace et al. [238] defined universal adversarial triggers that can induce a model to produce a specific-purpose predic-tion when concatenated to any input. Some triggers can even cause the GPT-2 model to generate racist text. Sun et al. [239] showed BERT is not robust on misspellings.

PTMs also have great potential to generate adversarial sam-ples. Li et al. [240] proposed BERT-Attack, a BERT-based high-quality and effective attacker. They turned BERT against another fine-tuned BERT on downstream tasks and success-fully misguided the target model to predict incorrectly, out-performing state-of-the-art attack strategies in both success rate and perturb percentage, while the generated adversarial samples are fluent and semantically preserved.





PTMs在生成对抗样本方面也有很大的潜力。Li等人[240]提出了BERT-Attack,一种基于BERT的高质量和有效的攻击者。他们将 BERT 与另一个在下游任务上经过微调的 BERT 进行对比,并成功地误导目标模型进行错误预测,在成功率扰动百分比方面均优于最先进的攻击策略,而生成的对抗样本流畅且语义上得到了保留。

Besides, adversarial defenses for PTMs are also promis-ing, which improve the robustness of PTMs and make them immune against adversarial attack.

Adversarial training aims to improve the generalization by minimizes the maximal risk for label-preserving perturba-tions in embedding space. Recent work [241, 242] showed that adversarial pre-training or fine-tuning can improve both generalization and robustness of PTMs for NLP.



8 Future Directions未来发展方向

Though PTMs have proven their power for various NLP tasks, challenges still exist due to the complexity of language. In this section, we suggest five future directions of PTMs.



(1) Upper Bound of PTMs 

Currently, PTMs have not yet reached its upper bound. Most of the current PTMs can be further improved by more training steps and larger corpora.

The state of the art in NLP can be further advanced by increasing the depth of models, such as Megatron-LM [243](8.3 billion parameters, 72 Transformer layers with a hidden size of 3072 and 32 attention heads) and Turing-NLG9) (17 billion parameters, 78 Transformer layers with a hidden size of 4256 and 28 attention heads).

The general-purpose PTMs are always our pursuits for learning the intrinsic universal knowledge of languages (even world knowledge). However, such PTMs usually need deeper architecture, larger corpus, and challenging pre-training tasks, which further result in higher training costs. However, train-ing huge models is also a challenging problem, which needs more sophisticated and efficient training techniques such as distributed training, mixed precision, gradient accumulation, etc. Therefore, a more practical direction is to design more efficient model architecture, self-supervised pre-training tasks, optimizers, and training skills using existing hardware and software. ELECTRA [56] is a good solution towards this direction.



NLP 的最新技术可以通过增加模型的深度来进一步推进,例如Megatron-LM [243](83 亿个参数,72 个Transformer 层,隐藏尺寸为3072 和 32 个注意力头),和Turing-NLG9(170亿参数,78个Transformer层,隐藏尺寸为4256和28个注意头)。



(2) Architecture of PTMs

The Transformer has been proved to be an effective architecture for pre-training. How-ever, the main limitation of the Transformer is its computation complexity, which is quadratic to the input length. Limited by the memory of GPUs, most of current PTMs cannot deal with the sequence longer than 512 tokens. Breaking this limit needs to improve the architecture of the Transformer. Al-though many works [25] tried to improve the efficiency of Transformer, there remains much room for improvement.

Besides, searching for more efficient alternative non-Transformer architecture for PTMs is important to capture longer-range contextual information. The design of deep architecture is challenging, and we may seek help from some automatic methods, such as neural architecture search (NAS) [245].


Transformer已被证明是一种有效预训练架构。然而,Transformer的主要限制是它的计算复杂度,它是输入长度的二次方。受限于GPU的内存,目前大多数PTMs无法处理长度超过512 token的序列。打破这个限制需要改进Transformer的架构。虽然[25]的许多工作都试图提高Transformer的效率,但仍然有很大的改进空间。



(3) Task-oriented Pre-training and Model Compression

In practice, different downstream tasks require the different abilities of PTMs. The discrepancy between PTMs and down-stream tasks usually lies in two aspects: model architecture and data distribution. A larger discrepancy may result in that the benefit of PTMs may be insignificant. For example, text generation usually needs a specific task to pre-train both the encoder and decoder, while text matching needs pre-training tasks designed for sentence pairs.

Besides, although larger PTMs can usually lead to better performance, a practical problem is how to leverage these huge PTMs on special scenarios, such as low-capacity devices and low-latency applications. Therefore, we can carefully de-sign the specific model architecture and pre-training tasks for downstream tasks or extract partial task-specific knowledge from existing PTMs.

Instead of training task-oriented PTMs from scratch, we can teach them with existing general-purpose PTMs by us-ing techniques such as model compression (see Section 4.5). Although model compression is widely studied for CNNs in CV [246], compression for PTMs for NLP is just beginning. The fully-connected structure of the Transformer also makes model compression more challenging.






(4) Knowledge Transfer Beyond Fine-tuning

 Currently, fine-tuning is the dominant method to transfer PTMs’ knowl-edge to downstream tasks, but one deficiency is its parameter inefficiency: every downstream task has its own fine-tuned parameters. An improved solution is to fix the original pa-rameters of PTMs and by adding small fine-tunable adap-tion modules for specific task [68, 69]. Thus, we can use a shared PTM to serve multiple downstream tasks. Indeed, mining knowledge from PTMs can be more flexible, such as feature extraction, knowledge distillation [210], data augmen-tation [247, 248], using PTMs as external knowledge [129]. More efficient methods are expected.


目前,微调是将 PTM 的知识转移到下游任务的主要方法,但其缺点是参数效率低:每个下游任务都有自己的微调参数。一种改进的解决方案是固定PTMs 的原始参数并为特定任务添加小型可微调自适应模块[68,69]。因此,我们可以使用一个共享的PTM来服务多个下游任务。事实上,从PTMs中挖掘知识可以更加灵活,例如将PTMs作为外部知识[129],进行特征提取知识蒸馏[210]、数据增强[247,248]。人们期望有更有效的方法。

(5) PTM可解释性可靠性——Transformer架构解释较难、易受到对抗性攻击(采用对抗性防御)

(5) Interpretability and Reliability of PTMs

Although PTMs reach impressive performance, their deep non-linear architecture makes the procedure of decision-making highly non-transparent.

Recently, explainable artificial intelligence (XAI) [249] has become a hotspot in the general AI community. Unlike CNNs for images, interpreting PTMs is harder due to the complex-ities of both the Transformer-like architecture and language. Extensive efforts (see Section 3.3) have been made to analyze the linguistic and world knowledge included in PTMs, which help us understand these PMTs with some degree of trans-parency. However, much work on model analysis depends on the attention mechanism, and the effectiveness of attention for interpretability is still controversial [250, 251].

Besides, PTMs are also vulnerable to adversarial attacks (see Section 7.7). The reliability of PTMs is also becoming an issue of great concern with the extensive use of PTMs in production systems. The studies of adversarial attacks against PTMs help us understand their capabilities by fully exposing their vulnerabilities. Adversarial defenses for PTMs are also promising, which improve the robustness of PTMs and make them immune against adversarial attack.

Overall, as key components in many NLP applications, the interpretability and reliability of PTMs remain to be ex-plored further in many respects, which helps us understand how PTMs work and provides a guide for better usage and further improvement.

(5) PTM可解释性可靠性



此外,PTM也容易受到对抗性攻击(参见7.7节)。随着PTM在生产系统中的广泛使用,PTM的可靠性也成为一个非常值得关注的问题。针对 PTM 的对抗性攻击的研究通过充分暴露其弱点来帮助我们了解它们的能力。对PTM的对抗性防御也很有前途,它提高了PTM的鲁棒性,并使其免受对抗性攻击。


9 Conclusion结论

In this survey, we conduct a comprehensive overview of PTMs for NLP, including background knowledge, model ar-chitecture, pre-training tasks, various extensions, adaption approaches, related resources, and applications. Based on current PTMs, we propose a new taxonomy of PTMs from four different perspectives. We also suggest several possible future research directions for PTMs.



We thank Zhiyuan Liu, Wanxiang Che, Minlie Huang, Dan-qing Wang and Luyao Huang for their valuable feedback on this manuscript. This work was supported by the National Natural Science Foundation of China (No. 61751201 and 61672162), Shanghai Municipal Science and Technology Ma-jor Project (No. 2018SHZDZX01) and ZJLab.

感谢Zhiyuan Liu, Wanxiang Che, Minlie Huang, Dan-qing Wang 和 Luyao Huang对本文的宝贵反馈。国家自然科学基金项目(No. 61751201、61672162)、上海市科技重大专项项目(No. 2018SHZDZX01)和ZJLab的资助。


