当前位置:   article > 正文

transformers学习笔记2_"return_tensors=\"pt"

"return_tensors=\"pt"

pipeline

快速使用

  1. from transformers import pipeline
  2. classifier = pipeline("sentiment-analysis")
  3. classifier(
  4. [
  5. "I've been waiting for a HuggingFace course my whole life.",
  6. "I hate this so much!",
  7. ]
  8. )
[{'label': 'POSITIVE', 'score': 0.9598047137260437},
{'label': 'NEGATIVE', 'score': 0.9994558095932007}]

3大结构

  1. tokenizer:原始单词—input ids(互相转化)

  1. 原始文本被划分为token列表,再为其加上特殊的首位token进行区分,最后根据预训练模型的词表为所有token找到id

  1. transformers提供了autotokenizer API实现该功能

  1. from transformers import AutoTokenizer
  2. checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
  3. tokenizer = AutoTokenizer.from_pretrained(checkpoint)
  1. raw_inputs = [
  2. "I've been waiting for a HuggingFace course my whole life.",
  3. "I hate this so much!",
  4. ]
  5. # 每个句子的单词数目不同,可以padding用0来把短句补齐
  6. # truncation=True,此时,如果句子的向量长度超过模型可以处理的范围,就会被截断
  7. # return_tensors="pt",这样返回的结果就是tensor类型了,因为transformers只接受tensor输入
  8. inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
  9. print(inputs)
{
'input_ids': tensor([
[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102],
[ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0]
]),
# mask可以告诉我们哪里做了padding
'attention_mask': tensor([
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
])
}

b. model:input ids—logits

  1. transformers提供了automodel API:

  1. from transformers import AutoModel
  2. checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
  3. model = AutoModel.from_pretrained(checkpoint)
  1. # outputs.last_hidden_state获得最后一层隐藏网络的输出的向量
  2. outputs = model(**inputs)
  3. print(outputs.last_hidden_state.shape)
  4. ###
  5. 1.Batch size: The number of sequences processed at a time (2 in our example).
  6. 2.Sequence length: The length of the numerical representation of the sequence (16 in our example).
  7. 3.Hidden size: The vector dimension of each model input.
  8. ###
torch.Size([2, 16, 768])
  1. model的架构

  1. embedding层将输入的input id转换为vector

  1. 随后的层使用注意力机制操纵这些向量,以产生句子的最终表示

  1. head是有多个线性层组成的网络,它可以把高纬的hidden states映射到不同的维度

补充:除了model,transformers还有很多head:

  • *Model (retrieve the hidden states)

  • *ForCausalLM

  • *ForMaskedLM

  • *ForMultipleChoice

  • *ForQuestionAnswering

  • *ForSequenceClassification

  • *ForTokenClassification

  1. # 例如为了区分矩阵的正负情感,我们用AutoModelForSequenceClassification
  2. from transformers import AutoModelForSequenceClassification
  3. checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
  4. model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
  5. outputs = model(**inputs)
print(outputs.logits.shape)
torch.Size([2, 2])
print(outputs.logits)
tensor([[-1.5607, 1.6123],
[ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)

c. post processing:预测,得到标签结果和分数

  1. 可以发现model层输出的并非概率,而是裸分数logits,我们需要做一个softmax将其转换为概率:(例如这里的输出每个tensor的和都是1了)

  1. import torch
  2. predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
  3. print(predictions)
tensor([[4.0195e-02, 9.5980e-01],
[9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/347501
推荐阅读
相关标签
  

闽ICP备14008679号