MacBERT 的改进（Revisiting Pre-Trained Models for Chinese Natural Language Processing）_macbert为基座模型重新预训练

作者：知新_RL | 2024-04-02 11:07:51

踩

macbert为基座模型重新预训练

MacBERT

1.MacBERT简介
2.论文的主要工作
- 2.1预训练模型对比
3.MacBERT的结构
- 3.1BERT-wwm & RoBERTa-wwm
- 3.2MacBERT训练过程
试验结果
总结

github地址： https://github.com/ymcui/MacBERT
论文地址： https://arxiv.org/abs/2004.13922
MacBERT 的主要工作是在RoBERT的基础上进行了几方面的改进，尤其在 Mask的策略。

1.MacBERT简介

we also propose a new pre-trained model called MacBERT, which replaces the original MLM task into MLM as correction (Mac) task and mitigate the discrepancy of pre-training and fine-tuning stage.
我们提出新的预训练模型为 MacBERT， MacBERT替换bert原始的MLM任务为Mac任务，并减少预训练和微调阶段的差异。

2.论文的主要工作

The contributions of this paper arelisted as follows.（论文的贡献。）

[1] Extensive empirical studies are carried out to revisit the performance of Chinese pre-trained models on various tasks with careful analyses.（广泛的演究重新审视中文预训练模型在各种任务上的表现。）
[2] We propose a new pre-trained model called MacBERT that mitigate the gapbetween the pre-training and fine-tuning stage by masking the word with its similar word, which has proven to be effective on down-streamtasks.（我们提出新的预训练模型MacBERT，使用相似的词替换mask词能减缓预训练和微调两个阶段的误差，已经证实能有效的提高下游任务的现。）
[3] To further acceleratefuture research on Chinese NLP, we create and release the Chinese pre-trained model series to the
community.（为了进一步推动中文的预训练模型的演究，我们开源了一些列的预训练模型对社区。）

2.1预训练模型对比

在这里插入图片描述

BERT

BERT consists of two pre-training tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP).（bert包含两种任务：MLM 和 NSP）
MLM: Randomly masks some of the tokens from the input, and the objective is to predict the original word based only on its context.（随机mask一些输入的token，目标是：通过的mask上下文的token预测mask的单词。）
NSP: To predict whether sentence B is the next sentence of A.（预测B是否是A的下一个语句。）
Later, they further proposed a technique called Whole Word Masking (WWM) for optimizing the original masking in the MLM task.（进一步的通过全词mask的方式优化MLM任务中的mask方式。）In this setting,instead of randomly selecting WordPiece tokens to mask, we always mask all of the tokens corresponding to a whole word at once.（在这种策略下，代替随机mask WordPoece token，mask整个词相关的WordPoece token。）

ERNIE

Enhanced Representation through kNowledge IntEgration (ERNIE) is designed to optimize the masking process of BERT, which includes entity-level masking and phrase-level masking. （ERNIE设计优化bert的mask策略，包含实体水平的mask和短语水平的mask。）Different from selecting random words in the input, entity-level masking will mask the named entities, which are often formed by several words. （与输入随机选择词不同，实体水平的mask将mask命名实体的词，有可能是几个词。）Phrase-level masking is to mask consecutive words, which is similar to the N-gram masking strategy.（短语水平的mask是mask连续的几个词，与ngram mask策略相似的。）

XLNET

To alleviate this problem, they proposed XLNet, which was based onTransformer-XL.（为了缓解bert预训练和微调阶段的误差，提出了XLNET，它是基于Transformer-XL。）XLNet mainly modifies in two ways. The first is to maximize the expected likelihood over all permutations of the factorization order of the input, where they called the Permutation Language Model (PLM). Another is to change the autoencoding language model into an autoregressive one, which is similar to the traditional statistical language models.（XLNET主要做了两方面的改变。首先，其次，最大化所有输入排列的对数似然，它成为重排列语言建模PLM）。XLNET改变自编码语言模型为自回归语言模型，它是与传统的统计语言模型相似的。）

3.MacBERT的结构

3.1BERT-wwm & RoBERTa-wwm

We use the traditional Chinese Word Segmentation (CWS) tool to split the text into several words.（我么采用中文分词工具切分文本为一类列词序列。）In this way, we could adopt whole word masking in Chinese to mask the word instead of individual Chinese characters.（在这种方式下，中文语料中我们采用整词mask替代单个字。）MacBERT remains the same pre-training tasks as BERT with several modifications.For the MLM task, we perform the following modifications .（MacBERT任然保持与bert相同的预训练任务，仅做几方面的修改。对于MLM任务，我们做以下修改。）We use LTP (Che et al., 2010) for Chinese word segmentation to identify the word boundaries.（我们使用LTP对于中文分词识别词的边界。）Note that the whole word masking only affects the selection of the masking tokens in the pre-training stage.（注意：整词mask仅影响预训练阶段mask token的选择。）The input of BERT still uses WordPiece tokenizer to split the text, which is identical to the original BERT.（Bert的输入仍然使用WordPiece切分文本，这与原始的bert相同。）
不同的mask策略

3.2MacBERT训练过程

For the MLM task, we perform the following modifications .（对于MLM任务，仅做以下修改。）

We use whole word masking as well as Ngram masking strategies for selecting candidate tokens for masking, with a percentage of 40%, 30%, 20%, 10% for unigram to 4-gram.（我们使用全词mask和ngram mask的策略选择mask的token，对于ngram mask 策略，从unigram到4-gram的比例分别为40%, 30%, 20%, 10% 。）
Instead of masking with [MASK] token, which never appears in the fine-tuning stage, we propose to use similar words for the masking purpose. （在微调阶段没使用mask策略，为了替换mask token，我们采用mask token相似的词。）A similar word is obtained by using Synonyms toolkit , which is based on word2vec similarity calculations.（相似的词使用Synonyms工具计算得到，它是基于Wordvector相似度计算。） If an N-gram is selected to mask, we will find similar words individually.（如果使用N-grammask策略，我们将分别找到相似的词。） In rare cases, when there is no similar word, we will degrade to use random word replacement.（在少数情况下，如果没找相似的词，我们将使用随机的词替换。）
We use a percentage of 15% input words for masking, where 80% will replace with similar words, 10% replace with a random word, and keep with original words for the rest of 10%.（我们采用15%的输入的word进行mask，其中80%的词替换为相似的词，10%的词随机替换，10%的词保持不变。）
For the NSP task, we perform sentence-order prediction (SOP) task as introduced by ALBERT, where the negative samples are created by switching the original order of two consecutive sentences.（对于NSP任务，我们采用albert的SOP任务，SOP中的负样本是交换两个句子的顺序。）

试验结果

在这里插入图片描述

总结

MacBERT 在bert的各个改进版本之上的，持续优化训练的策略：（1）对于MLM任务，使用全词mask和ngram mask策略，mask的词使用相似词替换，减少训练和微调之间的误差：（2）使用albert的SOP任务；在下游的任务中，比之前的bert版本效果均要好。
[1]: http://meta.math.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference
[2]: https://mermaidjs.github.io/
[3]: https://mermaidjs.github.io/
[4]: https://arxiv.org/pdf/1907.11692.pdf*

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/知新_RL/article/detail/352050