赞
踩
通过ipython进行查看sentiment-analynise依赖的Hugging Face – The AI community building the future.We’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/
里的库:
从上面的输出得到了pipeline模式进行预训练所用到的库:
distilbert-base-uncased-finetuned-sst-2-english
如果
test_sentences = ("today is not that bad", "today is so bad" , "so good")
里面有三个元素,则会报错。
因为
inputs_tensor = tokenizer.encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")
里的句子必须是一个或者Tuple()里面两个元素!!!
而
使用tokenizer()则不会报错!!!
底层代码分析后,发现:
当输入为一个字符串,则tokenizer() == tokenizer.encode_plus();
当输入的是list或者tuple,则tokenizer() == tokenizer.batch_encode_plus()
padding=True:
test_sentences = ("today is not that bad", "today is so bad", "so good")
表示当
test_sentences里面的所有句子长度不同时,padding会以最长的句子长度作为max_length进行填充[PAD],即补零。
padding="max_length":
test_sentences = ("today is not that bad", "today is so bad", "so good")
padding="max_length"一般配合max_length=XXXX 参数使用,以max_length长度进行补零。
input_ids:表示tokenizer编码后的vocab对应的id
attention_mask:其中的1表示未被padding的词,0表示被padding的词。
tokenizer.convert_tokens_to_ids(tokenizer.tokenize(test_sentences[0]))
这里使用的是:AutoModelForSequenceClassification
返回值logits是维度为(3,2)的tensor,我们处理一下就得到了[1,0,1],
[1,0,1]代表什么呢?
我们看它的config:
由上得出,1代表positive,0代表negative。
完整的代码:
- # ---encoding:utf-8---
- # @Time : 2023/8/1 10:54
- # @Author : CBAiotAigc
- # @Email :1050100468@qq.com
- # @Site :
- # @File : tokenizer_sentiment_analysis.py
- # @Project : AI_Review
- # @Software: PyCharm
- from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
- import torch
- import torch.nn as nn
-
- model_name = "../model/distilbert-base-uncased-finetuned-sst-2-english"
-
- tokenizer = AutoTokenizer.from_pretrained(model_name)
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
-
- test_sentences = ["today is not that bad", "today is so bad", "so good"]
-
- # inputs_tensor = tokenizer.encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")
- # print(inputs_tensor)
- inputs_tensor = tokenizer(test_sentences, padding=True, truncation=True, return_tensors="pt")
-
- print(inputs_tensor)
-
- inputs_tensor = tokenizer.batch_encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")
- print(inputs_tensor)
-
- outputs = model(**inputs_tensor)
- print(outputs)
-
- model.eval()
- with torch.no_grad():
- labels = torch.argmax(outputs.logits, dim=-1)
- print(labels)
-
- print(model.config.id2label)
- print([model.config.id2label[id] for id in labels.tolist()])
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。