赞
踩
以gpt2为例
- from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
- import torch
- config = GPT2Config.from_pretrained("../model/gpt2")
- model = GPT2LMHeadModel.from_pretrained("../model/gpt2")
- tokenizer = GPT2Tokenizer.from_pretrained("../model/gpt2")
-
-
- prompt = "I thought this movie was glorious, I appreciated it. Conclusion: This movie is"
- inputs = tokenizer(prompt, return_tensors="pt")
- output = model(inputs.input_ids, output_hidden_states=True)
查看modeling_gpt2的源代码,在import部分:
- from ...modeling_outputs import (
- BaseModelOutputWithPastAndCrossAttentions,
- CausalLMOutputWithCrossAttentions,
- QuestionAnsweringModelOutput,
- SequenceClassifierOutputWithPast,
- TokenClassifierOutput,
- )
再进一步查看modeling_outputs.py文件,可以看到output的类
- class CausalLMOutputWithCrossAttentions(ModelOutput):
- """
- Base class for causal language model (or autoregressive) outputs.
- Args:
- loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
- Language modeling loss (for next-token prediction).
- logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
- Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
- one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
- Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
- attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
- sequence_length)`.
- Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
- heads.
- cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
- sequence_length)`.
- Cross attentions weights after the attention softmax, used to compute the weighted average in the
- cross-attention heads.
- past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
- Tuple of `torch.FloatTensor` tuples of length `config.n_layers`, with each tuple containing the cached key,
- value states of the self-attention and the cross-attention layers if model is used in encoder-decoder
- setting. Only relevant if `config.is_decoder = True`.
- Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
- `past_key_values` input) to speed up sequential decoding.
- """
-
- loss: Optional[torch.FloatTensor] = None
- logits: torch.FloatTensor = None
- past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
- hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
- attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
- cross_attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
因此,output可以访问loss、logits、hidden_states(需要在model()加一个参数:output_hidden_states=True或者在config设置:config.output_hidden_states=True)等等。
hidden_states包含了每一个transformer block的输出结果,因此可以通过hidden_states[-1]来访问最后一层的输出结果,再经过一个线性变换,即可得到logits。
以gpt2(d_model = 768) 和上述prompt为例(18个token),先放上gpt2的结构,详见gpt2结构-CSDN博客
- GPT2LMHeadModel(
- (transformer): GPT2Model(
- (wte): Embedding(50257, 768)
- (wpe): Embedding(1024, 768)
- (drop): Dropout(p=0.1, inplace=False)
- (h): ModuleList(
- (0-11): 12 x GPT2Block(
- (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
- (attn): GPT2Attention(
- (c_attn): Conv1D()
- (c_proj): Conv1D()
- (attn_dropout): Dropout(p=0.1, inplace=False)
- (resid_dropout): Dropout(p=0.1, inplace=False)
- )
- (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
- (mlp): GPT2MLP(
- (c_fc): Conv1D()
- (c_proj): Conv1D()
- (act): NewGELUActivation()
- (dropout): Dropout(p=0.1, inplace=False)
- )
- )
- )
- (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
- )
- (lm_head): Linear(in_features=768, out_features=50257, bias=False)
- )
可以看到最后一行的线性层(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)即为hidden_states[-1]通向logits的变换。
hidden_states[-1]的维度是[1,18,768],线性层model.lm_head.weight的维度是[50257,768],将这两个矩阵相乘
logits2 = torch.matmul(output.hidden_states[-1], model.lm_head.weight.transpose(0, 1))
得出的logits2和model.logits一摸一样,维度是[1,18,50257],50257是词表的大小。
得到logits之后,找到分数最大的,对应词表中的单词就是next token
- # 得到logits后
- probs = torch.softmax(logits, dim=-1)
- print(probs.size()) #[50257]
- # 选择最可能的词的索引
- next_token_index = torch.argmax(probs, dim=-1)
-
- # 使用tokenizer将索引转换为单词
- next_token = tokenizer.decode(next_token_index.tolist()[0])
- print(next_token) #"a"
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。