python-pytorch基础之加载bert模型获取字向量_distilbert-base-uncased-finetuned-sst-2-english

作者：不正经 | 2024-04-24 19:11:15

踩

distilbert-base-uncased-finetuned-sst-2-english

python-pytorch基础之加载模型获取字向量

安装transformers
实例化tokenizer和model
文字转ids（string->ids）
将ids输入到模型中
查看输出结果
总结

安装transformers

# !pip install transformers
1

实例化tokenizer和model

from transformers import AutoModel,AutoTokenizer
1

tokenizer=AutoTokenizer.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english")
model=AutoModel.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english")
1
2

Some weights of the model checkpoint at ./distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
1
2
3

文字转ids（string->ids）

en=tokenizer.encode("how are you and you")
en,type(en)
1
2

([101, 2129, 2024, 2017, 1998, 2017, 102], list)
1

import torch

torch.tensor(en)
1
2
3

tensor([ 101, 2129, 2024, 2017, 1998, 2017,  102])
1

将ids输入到模型中

# 方法一
# out=model(torch.tensor(en).unsqueeze(0))

# 方法二
out=model(torch.tensor([en]))

1
2
3
4
5
6

查看输出结果

print(out)
1

BaseModelOutput(last_hidden_state=tensor([[[ 0.4692,  0.5402,  0.2137,  ..., -0.1891,  1.0371, -0.8645],
         [ 0.9280,  0.8054, -0.0353,  ..., -0.0706,  1.0147, -0.9412],
         [ 1.1769,  0.4334, -0.4291,  ..., -0.3780,  0.6734, -0.5759],
         ...,
         [ 1.0213,  0.6273,  0.5482,  ..., -0.2374,  1.0714, -0.5215],
         [ 0.4576,  0.2577,  0.3044,  ..., -0.1127,  1.1128, -0.9350],
         [ 1.2613,  0.2868,  0.2176,  ...,  0.7057,  0.1919, -0.7504]]],
       grad_fn=<NativeLayerNormBackward>), hidden_states=None, attentions=None)
1
2
3
4
5
6
7
8

type(out)
1

transformers.modeling_outputs.BaseModelOutput
1

out[0]
1

tensor([[[ 0.4692,  0.5402,  0.2137,  ..., -0.1891,  1.0371, -0.8645],
         [ 0.9280,  0.8054, -0.0353,  ..., -0.0706,  1.0147, -0.9412],
         [ 1.1769,  0.4334, -0.4291,  ..., -0.3780,  0.6734, -0.5759],
         ...,
         [ 1.0213,  0.6273,  0.5482,  ..., -0.2374,  1.0714, -0.5215],
         [ 0.4576,  0.2577,  0.3044,  ..., -0.1127,  1.1128, -0.9350],
         [ 1.2613,  0.2868,  0.2176,  ...,  0.7057,  0.1919, -0.7504]]],
       grad_fn=<NativeLayerNormBackward>)
1
2
3
4
5
6
7
8

总结

AutoTokenizer是BertTokenizer的封装，使用AutoTokenizer会自动生成attention_mask以及token_type_ids，参见https://blog.csdn.net/m0_45478865/article/details/118219919
传入模型的参数必须是tensor，如果是从list转tensor，则需要list是二维
不同的预训练模型实例化后的结果参数不一样，如使用distilbert-base-uncased-finetuned-sst-2-english的时候out就没有pooler_output的参数，但是如果使用chinese-roberta-wwm-ext-large模型，out的属性就会有
pooler_output是句子向量，out[0]是字向量

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/不正经/article/detail/481119