赞
踩
Task06 BERT应用
本次学习参照Datawhale开源学习:https://github.com/datawhalechina/learn-nlp-with-transformers
内容大体源自原文,结合自己学习思路有所调整。
个人总结:一、BERT 预训练任务包括Masked Language Model(MLM训练模型根据上下文理解单词的意思)和Next Sentence Prediction(NSP训练模型理解预测句子间的关系)。二、 Fine-tune 包括句子分类、多项选择、词分类、问答任务等。
本文基于 Transformers 版本 4.4.2(2021 年 3 月 19 日发布)项目中,pytorch 版的 BERT 相关代码,从代码结构、具体实现与原理,以及使用的角度进行分析,包含以下内容:
Bert解决NLP任务
BERT训练与优化
首先,以下所有的模型都是基于BertPreTrainedModel
这一抽象基类,BertPreTrainedModel
的功能:
用于初始化模型权重,同时维护继承自PreTrainedModel
的一些标记身份或者加载模型时的类变量。
下面,首先从预训练模型开始分析。
BERT 预训练任务包括两个:
[MASK]
替换一部分单词,然后将句子传入 BERT 中编码每一个单词的信息,最终用[MASK]
的编码信息预测该位置的正确单词,这一任务旨在训练模型根据上下文理解单词的意思;[CLS]
的编码信息进行预测 B 是否 A 的下一句,这一任务旨在训练模型理解预测句子间的关系。而对应到代码中,这一融合两个任务的模型就是BertForPreTraining,其中包含两个组件:
class BertForPreTraining(BertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.bert = BertModel(config)
self.cls = BertPreTrainingHeads(config)
self.init_weights()
# ...
这里的BertModel在上一章节中已经详细介绍了(注意,这里设置的是默认add_pooling_layer=True
,即会提取[CLS]
对应的输出用于 NSP 任务),而BertPreTrainingHeads
则是负责两个任务的预测模块:
class BertPreTrainingHeads(nn.Module):
def __init__(self, config):
super().__init__()
self.predictions = BertLMPredictionHead(config)
self.seq_relationship = nn.Linear(config.hidden_size, 2)
def forward(self, sequence_output, pooled_output):
prediction_scores = self.predictions(sequence_output)
seq_relationship_score = self.seq_relationship(pooled_output)
return prediction_scores, seq_relationship_score
又是一层封装:BertPreTrainingHeads
包裹了BertLMPredictionHead
和一个代表 NSP 任务的线性层。这里不把 NSP 对应的任务也封装一个BertXXXPredictionHead
。
其实是有封装这个类的,不过它叫做BertOnlyNSPHead,在这里用不上
继续下探BertPreTrainingHeads
:
class BertLMPredictionHead(nn.Module): def __init__(self, config): super().__init__() self.transform = BertPredictionHeadTransform(config) # The output weights are the same as the input embeddings, but there is # an output-only bias for each token. self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False) self.bias = nn.Parameter(torch.zeros(config.vocab_size)) # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings` self.decoder.bias = self.bias def forward(self, hidden_states): hidden_states = self.transform(hidden_states) hidden_states = self.decoder(hidden_states) return hidden_states
这个类用于预测[MASK]
位置的输出在每个词作为类别的分类输出,注意到:
class BertPredictionHeadTransform(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
if isinstance(config.hidden_act, str):
self.transform_act_fn = ACT2FN[config.hidden_act]
else:
self.transform_act_fn = config.hidden_act
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.transform_act_fn(hidden_states)
hidden_states = self.LayerNorm(hidden_states)
return hidden_states
回到BertForPreTraining
,继续看两块 loss
是怎么处理的。它的前向传播和BertModel的有所不同,多了labels
和next_sentence_label
两个输入:
labels:形状为[batch_size, seq_length] ,代表 MLM 任务的标签,注意这里对于原本未被遮盖的词设置为 -100,被遮盖词才会有它们对应的 id,和任务设置是反过来的。
next_sentence_label:这一个输入很简单,就是 0 和 1 的二分类标签。
# ...
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
next_sentence_label=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
): ...
接下来两部分 loss 的组合:
# ...
total_loss = None
if labels is not None and next_sentence_label is not None:
loss_fct = CrossEntropyLoss()
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
total_loss = masked_lm_loss + next_sentence_loss
# ...
直接相加,就是这么单纯的策略。
当然,这份代码里面也包含了对于只想对单个目标进行预训练的 BERT 模型(具体细节不作展开):
from transformers import BertTokenizer, BertForPreTraining
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForPreTraining.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
prediction_logits = outputs.prediction_logits
seq_relationship_logits = outputs.seq_relationship_logits
Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from transformers import BertTokenizer, BertLMHeadModel, BertConfig
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
config = BertConfig.from_pretrained("bert-base-uncased")
config.is_decoder = True
model = BertLMHeadModel.from_pretrained('bert-base-uncased', config=config)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
prediction_logits = outputs.logits
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
next_sentence = "The sky is blue due to the shorter wavelength of blue light."
encoding = tokenizer(prompt, next_sentence, return_tensors='pt')
outputs = model(**encoding, labels=torch.LongTensor([1]))
logits = outputs.logits
assert logits[0, 0] < logits[0, 1] # next sentence was random
Downloading: 100%|██████████| 440M/440M [00:30<00:00, 14.5MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
接下来介绍的是各种 Fine-tune 模型,基本都是分类任务:
这一模型用于句子分类(也可以是回归)任务,比如 GLUE benchmark 的各个任务。
结构上很简单,就是BertModel
(有 pooling)过一个 dropout 后接一个线性层输出分类:
class BertForSequenceClassification(BertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
self.init_weights()
# ...
在前向传播时,和上面预训练模型一样需要传入labels输入。
如果初始化的num_labels=1,那么就默认为回归任务,使用 MSELoss;
否则认为是分类任务。
from transformers.models.bert.tokenization_bert import BertTokenizer from transformers.models.bert.modeling_bert import BertForSequenceClassification tokenizer = BertTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc") model = BertForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc") classes = ["not paraphrase", "is paraphrase"] sequence_0 = "The company HuggingFace is based in New York City" sequence_1 = "Apples are especially bad for your health" sequence_2 = "HuggingFace's headquarters are situated in Manhattan" # The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks. paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt") not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt") paraphrase_classification_logits = model(**paraphrase).logits not_paraphrase_classification_logits = model(**not_paraphrase).logits paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0] not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0] # Should be paraphrase for i in range(len(classes)): print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%") # Should not be paraphrase for i in range(len(classes)): print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
Downloading: 100%|██████████| 213k/213k [00:00<00:00, 596kB/s]
Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 12.4kB/s]
Downloading: 100%|██████████| 436k/436k [00:00<00:00, 808kB/s]
Downloading: 100%|██████████| 433/433 [00:00<00:00, 166kB/s]
Downloading: 100%|██████████| 433M/433M [00:29<00:00, 14.5MB/s]
not paraphrase: 10%
is paraphrase: 90%
not paraphrase: 94%
is paraphrase: 6%
这一模型用于多项选择,如 RocStories/SWAG 任务。
这一模型用于序列标注(词分类),如 NER 任务。
_keys_to_ignore_on_load_unexpected
这一个类参数设置为[r"pooler"]
,也就是在加载模型时对于出现不需要的权重不发生报错。from transformers import BertForTokenClassification, BertTokenizer import torch model = BertForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english") tokenizer = BertTokenizer.from_pretrained("bert-base-cased") label_list = [ "O", # Outside of a named entity "B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity "I-MISC", # Miscellaneous entity "B-PER", # Beginning of a person's name right after another person's name "I-PER", # Person's name "B-ORG", # Beginning of an organisation right after another organisation "I-ORG", # Organisation "B-LOC", # Beginning of a location right after another location "I-LOC" # Location ] sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge." # Bit of a hack to get the tokens with the special tokens tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence))) inputs = tokenizer.encode(sequence, return_tensors="pt") outputs = model(inputs).logits predictions = torch.argmax(outputs, dim=2)
Downloading: 100%|██████████| 998/998 [00:00<00:00, 382kB/s]
Downloading: 100%|██████████| 1.33G/1.33G [01:30<00:00, 14.7MB/s]
for token, prediction in zip(tokens, predictions[0].numpy()):
print((token, model.config.id2label[prediction]))
('[CLS]', 'O') ('Hu', 'I-ORG') ('##gging', 'I-ORG') ('Face', 'I-ORG') ('Inc', 'I-ORG') ('.', 'O') ('is', 'O') ('a', 'O') ('company', 'O') ('based', 'O') ('in', 'O') ('New', 'I-LOC') ('York', 'I-LOC') ('City', 'I-LOC') ('.', 'O') ('Its', 'O') ('headquarters', 'O') ('are', 'O') ('in', 'O') ('D', 'I-LOC') ('##UM', 'I-LOC') ('##BO', 'I-LOC') (',', 'O') ('therefore', 'O') ('very', 'O') ('close', 'O') ('to', 'O') ('the', 'O') ('Manhattan', 'I-LOC') ('Bridge', 'I-LOC') ('.', 'O') ('[SEP]', 'O')
这一模型用于解决问答任务,例如 SQuAD 任务。
以上就是关于 BERT 源码的介绍,下面介绍一些关于 BERT 模型实用的训练细节。
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = "声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/花生_TL007/article/detail/200350
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。