赞
踩
NLP之RoBERTa:RoBERTa的简介、安装和使用方法、案例应用之详细攻略
目录
《RoBERTa: A Robustly Optimized BERT Pretraining Approach》翻译与解读
在Winograd Schema Challenge(WSC)上微调
地址 | |
时间 | 2019年7月26日 |
作者 | Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov Facebook AI |
总结 | 该论文提出了一个对BERT预训练过程进行优化的方法,称为RoBERTa。 背景痛点: >> BERT预训练的设计选择和训练策略没有进行全面和细致的研究。这可能限制了其表现水平。 >> 引入新训练目的或新模型架构难以判断其实际贡献。因为不同模型的训练环境(如数据量、配置)都不同。 >> 数据集体积和多样性在预训练中也是一个重要因素,但前期工作对此的考察不够。 具体解决方案: >> 对BERT预训练过程进行详细分析,调整masked language model目标函数、输入格式、batch大小等策略。 >> 收集大规模开源语料CC-NEWS,控制数据集影响。 >> 采用动态masking、仅使用语义分割预测、大batch预训练等改进。 >> 扩充预训练规模,对160GB语料进行更长时间训练。 核心特点: >> 在控制数据集情况下,RoBERTa实现超过原版BERT和后继工作的表现。 >> 去掉下一句预测任务后, masked language model目标仍具有很强的竞争力。 >> 采取的是较为简单高效的训练策略,但通过扩大数据集和训练时间大幅提升能力。 优势: >> 在GLUE、SQuAD、RACE三大语言理解测验上均获得先进水平,在部分任务上达到第一。 >> 预训练效果更稳定,在五次试验下中值效果都优于比BERT。 >> 发布了开源实现代码,提高了这个领域工作的可复现性。 >> 通过对细节的深入研究,对模型设计带来更深入的理解。 |
Language model pretraining has led to sig-nificant performance gains but careful com-parison between different approaches is chal-lenging. Training is computationally expen-sive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final re-sults. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparam-eters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the impor-tance of previously overlooked design choices, and raise questions about the source of re-cently reported improvements. We release our models and code.1 | 语言模型的预训练已经取得了显著的性能提升,但在不同方法之间进行仔细的比较具有挑战性。训练是计算密集型的,通常在不同大小的私有数据集上进行,并且,正如我们将展示的,超参数的选择对最终结果有显著影响。我们进行了对BERT预训练(Devlin等,2019)的复制研究,仔细测量了许多关键超参数和训练数据大小的影响。我们发现BERT明显未经充分训练,可以与或超过其之后发布的每个模型的性能。我们的最佳模型在GLUE、RACE和SQuAD上取得了最新的结果。这些结果强调了先前被忽视的设计选择的重要性,并对最近报道的改进提出了疑问。我们发布了我们的模型和代码。 |
We carefully evaluate a number of design de-cisions when pretraining BERT models. We find that performance can be substantially im-proved by training the model longer, with bigger batches over more data; removing the next sen-tence prediction objective; training on longer se-quences; and dynamically changing the masking pattern applied to the training data. Our improved pretraining procedure, which we call RoBERTa, achieves state-of-the-art results on GLUE, RACE and SQuAD, without multi-task finetuning for GLUE or additional data for SQuAD. These re-sults illustrate the importance of these previ-ously overlooked design decisions and suggest that BERT’s pretraining objective remains com-petitive with recently proposed alternatives. We additionally use a novel dataset, CC-NEWS, and release our models and code for pretraining and finetuning at: | 在预训练BERT模型时,我们仔细评估了许多设计决策。我们发现通过在更多数据上使用更大的批次进行更长时间的训练,去除下一句预测目标,对更长的序列进行训练以及动态更改应用于训练数据的屏蔽模式,可以显著提高性能。我们改进的预训练过程,我们称之为RoBERTa,在GLUE、RACE和SQuAD上取得了最新的结果,而不需要对GLUE进行多任务微调或对SQuAD使用额外的数据。这些结果说明了这些先前被忽视的设计决策的重要性,并表明BERT的预训练目标仍然与最近提出的替代方案竞争力十足。我们还使用了一个新颖的数据集CC-NEWS,并在以下网址发布了我们的预训练和微调模型以及代码:https://github.com/pytorch/fairseq。 |
RoBERTa在BERT的预训练过程上进行了迭代,包括使用更大的批次在更多数据上进行更长时间的模型训练;去除了下一句预测目标;在更长的序列上进行训练;以及动态更改应用于训练数据的屏蔽模式。详细信息请参阅相关论文。
GitHub地址:https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.md
2020年12月:德语模型(GottBERT)可用:GottBERT。
2020年1月:意大利语模型(UmBERTo)来自Musixmatch Research:UmBERTo。
2019年11月:法语模型(CamemBERT)可用:CamemBERT。
2019年11月:多语言编码器(XLM-RoBERTa)可用:XLM-R。
2019年9月:通过transformers库支持TensorFlow和TPU。
2019年8月:RoBERTa现在在pytorch-transformers库中受支持。
2019年8月:为WinoGrande的微调添加了教程。
2019年8月:为使用自己的数据对RoBERTa进行预训练添加了教程。
加载RoBERTa | 从torch.hub(PyTorch >= 1.1)加载RoBERTa: import torch roberta = torch.hub.load('pytorch/fairseq', 'roberta.large') roberta.eval() # 禁用dropout(或保留在训练模式以进行微调) 加载RoBERTa(适用于PyTorch 1.0或自定义模型): # 下载roberta.large模型 wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz tar -xzvf roberta.large.tar.gz # 在fairseq中加载模型 from fairseq.models.roberta import RobertaModel roberta = RobertaModel.from_pretrained('/path/to/roberta.large', checkpoint_file='model.pt') roberta.eval() # 禁用dropout(或保留在训练模式以进行微调) |
字节对编码 | 对输入文本应用字节对编码(BPE): tokens = roberta.encode('Hello world!') assert tokens.tolist() == [0, 31414, 232, 328, 2] roberta.decode(tokens) # 'Hello world!' |
提取特征 | 从RoBERTa中提取特征: # 提取最后一层的特征 last_layer_features = roberta.extract_features(tokens) assert last_layer_features.size() == torch.Size([1, 5, 1024]) # 提取所有层的特征(层0是嵌入层) all_layers = roberta.extract_features(tokens, return_all_hiddens=True) assert len(all_layers) == 25 assert torch.all(all_layers[-1] == last_layer_features) |
句对分类任务 | 用于句对分类任务的RoBERTa: # 下载已在MNLI上微调的RoBERTa roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli') roberta.eval() # 在评估中禁用dropout # 对一对句子进行编码并进行预测 tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.') roberta.predict('mnli', tokens).argmax() # 0:矛盾 # 对另一对句子进行编码 tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is based on BERT.') roberta.predict('mnli', tokens).argmax() # 2:蕴含 |
注册新的分类头部 | 注册新的(随机初始化的)分类头部: roberta.register_classification_head('new_task', num_classes=3) logprobs = roberta.predict('new_task', tokens) # tensor([[-1.1050, -1.0672, -1.1245]], grad_fn=<LogSoftmaxBackward>) |
批量预测 | 批量预测: import torch from fairseq.data.data_utils import collate_tokens roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli') roberta.eval() batch_of_pairs = [ ['Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.'], ['Roberta is a heavily optimized version of BERT.', 'Roberta is based on BERT.'], ['potatoes are awesome.', 'I like to run.'], ['Mars is very far from earth.', 'Mars is very close.'], ] batch = collate_tokens( [roberta.encode(pair[0], pair[1]) for pair in batch_of_pairs], pad_idx=1 ) logprobs = roberta.predict('mnli', batch) print(logprobs.argmax(dim=1)) # tensor([0, 2, 1, 0]) |
使用GPU | 使用GPU: roberta.cuda() roberta.predict('new_task', tokens) # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>) |
填充掩码 | 填充掩码: RoBERTa可以用于填充输入中的<mask>标记。以下是来自自然问题数据集的一些示例: roberta.fill_mask('The first Star wars movie came out in <mask>', topk=3) # [('The first Star wars movie came out in 1977', 0.9504708051681519, ' 1977'), ('The first Star wars movie came out in 1978', 0.009986862540245056, ' 1978'), ('The first Star wars movie came out in 1979', 0.009574787691235542, ' 1979')] roberta.fill_mask('Vikram samvat calender is official in <mask>', topk=3) # [('Vikram samvat calender is official in India', 0.21878819167613983, ' India'), ('Vikram samvat calender is official in Delhi', 0.08547237515449524, ' Delhi'), ('Vikram samvat calender is official in Gujarat', 0.07556215673685074, ' Gujarat')] roberta.fill_mask('<mask> is the common currency of the European Union', topk=3) # [('Euro is the common currency of the European Union', 0.9456493854522705, 'Euro'), ('euro is the common currency of the European Union', 0.025748178362846375, 'euro'), ('€ is the common currency of the European Union', 0.011183084920048714, '€')] |
代词消歧 | 代词消歧(Winograd Schema Challenge): RoBERTa可用于消除代词歧义。首先安装spaCy并下载英语语言模型: pip install spacy python -m spacy download en_core_web_lg 然后加载roberta.large.wsc模型并调用disambiguate_pronoun函数。代词应该被方括号([])括起来,查询指示应该被下划线(_)括起来,或者留空以直接返回预测的候选文本: roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.wsc', user_dir='examples/roberta/wsc') roberta.cuda() # 使用GPU(可选) roberta.disambiguate_pronoun('The _trophy_ would not fit in the brown suitcase because [it] was too big.') # True roberta.disambiguate_pronoun('The trophy would not fit in the brown _suitcase_ because [it] was too big.') # False roberta.disambiguate_pronoun('The city councilmen refused the demonstrators a permit because [they] feared violence.') # 'The city councilmen' roberta.disambiguate_pronoun('The city councilmen refused the demonstrators a permit because [they] advocated violence.') # 'demonstrators' 有关如何训练此模型的详细信息,请参阅RoBERTA Winograd Schema Challenge(WSC)README。 |
提取与单词对齐的特征 | 提取与单词对齐的特征: 默认情况下,RoBERTa对每个BPE标记输出一个特征向量。您可以使用extract_features_aligned_to_words方法重新对齐特征,以匹配spaCy的单词级标记化,并将其暴露在spaCy的Token.vector属性中: doc = roberta.extract_features_aligned_to_words('I said, "hello RoBERTa."') assert len(doc) == 10 for tok in doc: print('{:10}{} (...)'.format(str(tok), tok.vector[:5])) # <s> tensor([-0.1316, -0.0386, -0.0832, -0.0477, 0.1943], grad_fn=<SliceBackward>) (...) # I tensor([ 0.0559, 0.1541, -0.4832, 0.0880, 0.0120], grad_fn=<SliceBackward>) (...) # said tensor([-0.1565, -0.0069, -0.8915, 0.0501, -0.0647], grad_fn=<SliceBackward>) (...) # , tensor([-0.1318, -0.0387, -0.0834, -0.0477, 0.1944], grad_fn=<SliceBackward>) (...) # " tensor([-0.0486, 0.1818, -0.3946, -0.0553, 0.0981], grad_fn=<SliceBackward>) (...) # hello tensor([ 0.0079, 0.1799, -0.6204, -0.0777, -0.0923], grad_fn=<SliceBackward>) (...) # RoBERTa tensor([-0.2339, -0.1184, -0.7343, -0.0492, 0.5829], grad_fn=<SliceBackward>) (...) # . tensor([-0.1341, -0.1203, -0.1012, -0.0621, 0.1892], grad_fn=<SliceBackward>) (...) # " tensor([-0.1341, -0.1203, -0.1012, -0.0621, 0.1892], grad_fn=<SliceBackward>) (...) # </s> tensor([-0.0930, -0.0392, -0.0821, 0.0158, 0.0649], grad_fn=<SliceBackward>) (...) |
评估模型 | 评估roberta.large.mnli模型: 用于在MNLI dev_matched集上评估准确性的示例Python代码片段。 label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'} ncorrect, nsamples = 0, 0 roberta.cuda() roberta.eval() with open('glue_data/MNLI/dev_matched.tsv') as fin: fin.readline() for index, line in enumerate(fin): tokens = line.strip().split('\t') sent1, sent2, target = tokens[8], tokens[9], tokens[-1] tokens = roberta.encode(sent1, sent2) prediction = roberta.predict('mnli', tokens).argmax().item() prediction_label = label_map[prediction] ncorrect += int(prediction_label == target) nsamples += 1 print('| Accuracy: ', float(ncorrect)/float(nsamples)) # 期望输出:0.9060 |
地址:https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.glue.md
地址:https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/wsc/README.md
地址:https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/commonsense_qa/README.md
地址:https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.pretraining.md
更新中……
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。