当前位置:   article > 正文

pytorch bert测试代码中BertForSequenceClassification函数的输入(一条或多条)输出

bertforsequenceclassification

代码来源是对huggingface的修改,侵删。链接:https://github.com/huggingface/ 现在更新以后大不同了

从后面贴的源码来看,如果是测试,输入就是那三个tensor.1.input_id   2.token_type_ids  3.attention_mask.  训练的话还包括label。

三个特征的代码:参数example是文本的list:[sentence,sentence,sentence....]这里sentence用的是text_a+text_b,没有分句处理。max_seq_length是最大序列长度。tokenizer是 tokenizer = BertTokenizer.from_pretrained(WORK_DIR)得到的。WORK_DIR是放bert模型的vocab.txt的路径,用来加载tokenizer.

  1. def convert_lines(example, max_seq_length, tokenizer):
  2. max_seq_length -= 2
  3. all_tokens = []
  4. all_segments = []
  5. all_masks = []
  6. longer = 0
  7. for text in tqdm(example):
  8. tokens_a = tokenizer.tokenize(text)
  9. if len(tokens_a) > max_seq_length:
  10. tokens_a = tokens_a[:max_seq_length]
  11. longer += 1
  12. one_token = tokenizer.convert_tokens_to_ids(["[CLS]"] + tokens_a + ["[SEP]"]) + [0] * (
  13. max_seq_length - len(tokens_a))
  14. one_segment = [0] * (len(tokens_a) + 2) + [0] * (max_seq_length - len(tokens_a))
  15. one_mask = [1] * (len(tokens_a) + 2) + [0] * (max_seq_length - len(tokens_a))
  16. all_tokens.append(one_token)
  17. all_segments.append(one_segment)
  18. all_masks.append(one_mask)
  19. print(longer)
  20. return np.array(all_tokens), np.array(all_segments), np.array(all_masks)

这个代码是有很多条数据的情况,这个函数的输出是三个np类型的数组。all_tokens装所有句子的input_id,是一个n行,max_seq_length列的二维数组。其他两个也是。

部分main函数,

  1. if __name__ == '__main__':
  2. test = pd.read_csv("./dataset/3_abstracts.csv", encoding='utf-8')
  3. test['NAME'] = test['NAME'].fillna("无")
  4. test['CONTENT'] = test['CONTENT'].fillna("无")
  5. test['title_content'] = test['NAME'] + test['CONTENT']
  6. seed_everything()
  7. #######config
  8. device = torch.device('cuda')
  9. WORK_DIR = "./bert_pretrain/"
  10. #我这里分的三类
  11. bert_config = BertConfig.from_pretrained(WORK_DIR + 'bert_config.json', num_labels=3)
  12. tokenizer = BertTokenizer.from_pretrained(WORK_DIR)
  13. MAX_SEQUENCE_LENGTH = 512
  14. test_tokens, test_segments, test_masks = convert_lines(test["title_content"],MAX_SEQUENCE_LENGTH,tokenizer)
  15. #把得到的二维数组包装成tensor.内部为tensor([[id,id...][id..]]),用test_features包装这三个tensor
  16. test_features = [
  17. torch.tensor(test_tokens, dtype=torch.long),
  18. torch.tensor(test_segments, dtype=torch.long),
  19. torch.tensor(test_masks, dtype=torch.long)
  20. ]
  21. #pytorch的普遍用法,这个函数把参数处理成一个tensor数据集,是为了后面的loader之类的
  22. test_dataset = torch.utils.data.TensorDataset(*test_features)
  23. #调我的预测函数对标签值进行预测
  24. test_preds = test_model(test_dataset)

 预测函数:

  1. def test_model(test_dataset):
  2. WORK_DIR = "./bert_pretrain/"
  3. # WORK_DIR = "./bert_pretrain/"
  4. output_model_file = WORK_DIR + '423_model.bin' #自己训练好的模型
  5. model = BertForSequenceClassification.from_pretrained(WORK_DIR, config=bert_config)
  6. model.load_state_dict(torch.load(output_model_file))
  7. model.to(device)
  8. model.eval()
  9. # for param in model.parameters():
  10. # param.requires_grad = False
  11. test_preds = np.zeros((len(test_dataset), 3))
  12. #SequentialSampler这里是把测试数据顺序排,还有RandomSampler是随机采样,随机排的
  13. test_sampler = SequentialSampler(test_dataset)
  14. #这里加载数据集,主要是设batch是4,也可以设其他,就把刚刚处理过的三兄弟,每个取四个来处理
  15. test_loader = DataLoader(test_dataset, sampler=test_sampler, batch_size=4)
  16. #总的数据量除以4
  17. tk0 = tqdm_notebook(test_loader)
  18. # x_batch1 是一个tentor数据类型:tensor([[id,id..]]),其他两个也是。侧面说明bert的model的输入需要一个二维tensor.源码里有个 num_choices = input_ids.shape[1]这个好像就是求tensor的第二维的长度,我这里设的512
  19. for i, (x_batch1, x_batch2, x_batch3,) in enumerate(tk0):
  20. #注意这里要把数据转到GPU类型,不然会报错
  21. pred = model(x_batch1.to(device), x_batch2.to(device), x_batch3.to(device))
  22. test_preds[i * 4:(i + 1) * 4] = pred[0].detach().cpu().numpy()
  23. return test_preds

预测的结果是一个二元组,第二元大概是什么cuda啥啥的,用第一元就行了pred[0],给它转成cpu的numpy。我这里是三分类,得到的结果中[[小数,小数,小数],[],[]...[]]。写一个for循环取三个小数里最大的的索引,就是最终需要的标签。

  1. predict = []
  2. for prediction in test_preds: # predict is one by one, so the length of probabilities=1
  3. pred_label = np.argmax(prediction)
  4. predict.append(pred_label)

下面写写如果只输入一条数据有啥要改的,用来部署接口用:

提取特征就只用得到三兄弟了每个是一个list one_token=[id,id,id] one_segment=[seg][seg][seg]...

  1. def convert_lines(example, max_seq_length, tokenizer):
  2. max_seq_length -= 2
  3. longer = 0
  4. tokens_a = tokenizer.tokenize(example)
  5. if len(tokens_a) > max_seq_length:
  6. tokens_a = tokens_a[:max_seq_length]
  7. longer += 1
  8. one_token = tokenizer.convert_tokens_to_ids(["[CLS]"] + tokens_a + ["[SEP]"])+\
  9. [0] * (max_seq_length - len(tokens_a))
  10. one_segment = [0] * (len(tokens_a) + 2) + [0] * (max_seq_length - len(tokens_a))
  11. one_mask = [1] * (len(tokens_a) + 2) + [0] * (max_seq_length - len(tokens_a))
  12. return one_token, one_segment, one_mask

因为模型需要一个二维tensor,所以这里转tensor要多加一个中括号。当然也可以unsqueeze(0)

  1. test_token = torch.tensor([test_token])
  2. test_segment = torch.tensor([test_segment])
  3. test_mask=torch.tensor([test_mask])

然后测试时这样再这样得到结果,这里我没写全,加载模型啥的跟上面是差不多的

  1. pred = model(test_token.to(device), test_segment.to(device), test_mask.to(device))
  2. predic = pred[0].detach().cpu().numpy()
  3. res = np.argmax(predic)

hugging face 的源码:(现在好像更新了封装得更好了)

  1. @add_start_docstrings("""Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks. """,
  2. BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
  3. class BertForSequenceClassification(BertPreTrainedModel):
  4. r"""
  5. **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
  6. Labels for computing the sequence classification/regression loss.
  7. Indices should be in ``[0, ..., config.num_labels - 1]``.
  8. If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
  9. If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
  10. Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
  11. **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
  12. Classification (or regression if config.num_labels==1) loss.
  13. **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
  14. Classification (or regression if config.num_labels==1) scores (before SoftMax).
  15. **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
  16. list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
  17. of shape ``(batch_size, sequence_length, hidden_size)``:
  18. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
  19. **attentions**: (`optional`, returned when ``config.output_attentions=True``)
  20. list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
  21. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
  22. Examples:
  23. tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
  24. model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
  25. input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
  26. labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
  27. outputs = model(input_ids, labels=labels)
  28. loss, logits = outputs[:2]
  29. """
  30. def __init__(self, config):
  31. super(BertForSequenceClassification, self).__init__(config)
  32. self.num_labels = config.num_labels
  33. self.bert = BertModel(config)
  34. self.dropout = nn.Dropout(config.hidden_dropout_prob)
  35. self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
  36. self.init_weights()
  37. def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,
  38. position_ids=None, head_mask=None):
  39. outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
  40. attention_mask=attention_mask, head_mask=head_mask)
  41. pooled_output = outputs[1]
  42. pooled_output = self.dropout(pooled_output)
  43. logits = self.classifier(pooled_output)
  44. outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
  45. if labels is not None:
  46. if self.num_labels == 1:
  47. # We are doing regression
  48. loss_fct = MSELoss()
  49. loss = loss_fct(logits.view(-1), labels.view(-1))
  50. else:
  51. loss_fct = CrossEntropyLoss()
  52. loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
  53. outputs = (loss,) + outputs
  54. return outputs # (loss), logits, (hidden_states), (attentions)

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/从前慢现在也慢/article/detail/235498?site
推荐阅读
相关标签
  

闽ICP备14008679号