当前位置:   article > 正文

python-pytorch基础之加载bert模型获取字向量_distilbert-base-uncased-finetuned-sst-2-english

distilbert-base-uncased-finetuned-sst-2-english

安装transformers

# !pip install transformers
  • 1

实例化tokenizer和model

from transformers import AutoModel,AutoTokenizer
  • 1
tokenizer=AutoTokenizer.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english")
model=AutoModel.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english")
  • 1
  • 2
Some weights of the model checkpoint at ./distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  • 1
  • 2
  • 3

文字转ids(string->ids)

en=tokenizer.encode("how are you and you")
en,type(en)
  • 1
  • 2
([101, 2129, 2024, 2017, 1998, 2017, 102], list)
  • 1
import torch

torch.tensor(en)
  • 1
  • 2
  • 3
tensor([ 101, 2129, 2024, 2017, 1998, 2017,  102])
  • 1

将ids输入到模型中

# 方法一
# out=model(torch.tensor(en).unsqueeze(0))

# 方法二
out=model(torch.tensor([en]))

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

查看输出结果

print(out)
  • 1
BaseModelOutput(last_hidden_state=tensor([[[ 0.4692,  0.5402,  0.2137,  ..., -0.1891,  1.0371, -0.8645],
         [ 0.9280,  0.8054, -0.0353,  ..., -0.0706,  1.0147, -0.9412],
         [ 1.1769,  0.4334, -0.4291,  ..., -0.3780,  0.6734, -0.5759],
         ...,
         [ 1.0213,  0.6273,  0.5482,  ..., -0.2374,  1.0714, -0.5215],
         [ 0.4576,  0.2577,  0.3044,  ..., -0.1127,  1.1128, -0.9350],
         [ 1.2613,  0.2868,  0.2176,  ...,  0.7057,  0.1919, -0.7504]]],
       grad_fn=<NativeLayerNormBackward>), hidden_states=None, attentions=None)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
type(out)
  • 1
transformers.modeling_outputs.BaseModelOutput
  • 1
out[0]
  • 1
tensor([[[ 0.4692,  0.5402,  0.2137,  ..., -0.1891,  1.0371, -0.8645],
         [ 0.9280,  0.8054, -0.0353,  ..., -0.0706,  1.0147, -0.9412],
         [ 1.1769,  0.4334, -0.4291,  ..., -0.3780,  0.6734, -0.5759],
         ...,
         [ 1.0213,  0.6273,  0.5482,  ..., -0.2374,  1.0714, -0.5215],
         [ 0.4576,  0.2577,  0.3044,  ..., -0.1127,  1.1128, -0.9350],
         [ 1.2613,  0.2868,  0.2176,  ...,  0.7057,  0.1919, -0.7504]]],
       grad_fn=<NativeLayerNormBackward>)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

总结

  1. AutoTokenizer是BertTokenizer的封装,使用AutoTokenizer会自动生成attention_mask以及token_type_ids,参见https://blog.csdn.net/m0_45478865/article/details/118219919
  2. 传入模型的参数必须是tensor,如果是从list转tensor,则需要list是二维
  3. 不同的预训练模型实例化后的结果参数不一样,如使用distilbert-base-uncased-finetuned-sst-2-english的时候out就没有pooler_output的参数,但是如果使用chinese-roberta-wwm-ext-large模型,out的属性就会有
  4. pooler_output是句子向量,out[0]是字向量
声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop】
推荐阅读
相关标签
  

闽ICP备14008679号