当前位置:   article > 正文

transformers里的AutoTokenizer和AutoModel的原理分析(一)

autotokenizer

以情感分析的例子来进行剖析transformers里的AutoTokenizer和AutoModel:

通过ipython进行查看sentiment-analynise依赖的Hugging Face – The AI community building the future.We’re on a journey to advance and democratize artificial intelligence through open source and open science.icon-default.png?t=N7T8https://huggingface.co/

里的库:

 从上面的输出得到了pipeline模式进行预训练所用到的库:

distilbert-base-uncased-finetuned-sst-2-english

 AutoTokenizer的原理解释:

 如果

test_sentences = ("today is not that bad", "today is so bad" , "so good")

 里面有三个元素,则会报错。

因为

inputs_tensor = tokenizer.encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")

里的句子必须是一个或者Tuple()里面两个元素!!!

 使用tokenizer()则不会报错!!!

 底层代码分析后,发现:

当输入为一个字符串,则tokenizer() == tokenizer.encode_plus();

当输入的是list或者tuple,则tokenizer() == tokenizer.batch_encode_plus()

 这里解释一下padding=True和padding="max_length"的区别:

padding=True:

test_sentences = ("today is not that bad", "today is so bad", "so good")

表示当

test_sentences里面的所有句子长度不同时,padding会以最长的句子长度作为max_length进行填充[PAD],即补零。

padding="max_length":

test_sentences = ("today is not that bad", "today is so bad", "so good")

 padding="max_length"一般配合max_length=XXXX 参数使用,以max_length长度进行补零。

解释一下tokenizer.batch_encode_plus() 或者 tokenizer.encode_plus()的返回值:

 input_ids:表示tokenizer编码后的vocab对应的id

attention_mask:其中的1表示未被padding的词,0表示被padding的词。

与tokenizer.batch_encode_plus() 或者 tokenizer.encode_plus()的底层语句:

tokenizer.convert_tokens_to_ids(tokenizer.tokenize(test_sentences[0]))

 查看tokenizer里的词库:

 AutoModel的原理解释:

这里使用的是:AutoModelForSequenceClassification

 返回值logits是维度为(3,2)的tensor,我们处理一下就得到了[1,0,1],

[1,0,1]代表什么呢?

我们看它的config:

 由上得出,1代表positive,0代表negative。

完整的代码:

  1. # ---encoding:utf-8---
  2. # @Time : 2023/8/1 10:54
  3. # @Author : CBAiotAigc
  4. # @Email :1050100468@qq.com
  5. # @Site :
  6. # @File : tokenizer_sentiment_analysis.py
  7. # @Project : AI_Review
  8. # @Software: PyCharm
  9. from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
  10. import torch
  11. import torch.nn as nn
  12. model_name = "../model/distilbert-base-uncased-finetuned-sst-2-english"
  13. tokenizer = AutoTokenizer.from_pretrained(model_name)
  14. model = AutoModelForSequenceClassification.from_pretrained(model_name)
  15. test_sentences = ["today is not that bad", "today is so bad", "so good"]
  16. # inputs_tensor = tokenizer.encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")
  17. # print(inputs_tensor)
  18. inputs_tensor = tokenizer(test_sentences, padding=True, truncation=True, return_tensors="pt")
  19. print(inputs_tensor)
  20. inputs_tensor = tokenizer.batch_encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")
  21. print(inputs_tensor)
  22. outputs = model(**inputs_tensor)
  23. print(outputs)
  24. model.eval()
  25. with torch.no_grad():
  26. labels = torch.argmax(outputs.logits, dim=-1)
  27. print(labels)
  28. print(model.config.id2label)
  29. print([model.config.id2label[id] for id in labels.tolist()])

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/830196
推荐阅读
相关标签
  

闽ICP备14008679号