当前位置:   article > 正文

last_hidden_state vs pooler_output的区别

last_hidden_state vs pooler_output的区别

一、问题来源:

from transformers import AutoTokenizer, AutoModel
import torch
# Load model from HuggingFace Hub
MODEL_NAME_PATH = 'xxxx/model/bge-large-zh'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_PATH)
model = AutoModel.from_pretrained(MODEL_NAME_PATH)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

模型结构如下:

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(21128, 1024, padding_idx=0)
    (position_embeddings): Embedding(512, 1024)
    (token_type_embeddings): Embedding(2, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-23): 24 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=1024, out_features=1024, bias=True)
            (key): Linear(in_features=1024, out_features=1024, bias=True)
            (value): Linear(in_features=1024, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=1024, out_features=4096, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=4096, out_features=1024, bias=True)
          (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=1024, out_features=1024, bias=True)
    (activation): Tanh()
  )
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41

Q1、cls的值和pooler的值是一样的吗?
Q2、最后的pooler层和hidden层是什么关系?

二、实验证明:

Q1、cls的值和pooler的值是一样的吗?
# Sentences we want sentence embeddings for
sentences = ["开心", "快乐", "难过", "天气", "今天会有大大的台风吗?"]
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=200)
# for retrieval task, add an instruction to query
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

print(‘cls:’, model_output[0][:, 0, :])

cls: tensor([[ 0.3269, -0.6412, -0.2382,  ...,  0.0255, -0.1801, -0.3025],
        [ 0.1351, -0.5155, -0.1700,  ...,  0.1093, -0.3750, -0.1323],
        [ 0.2752, -0.1703, -0.2730,  ...,  0.0376, -0.0339, -0.3541],
        [ 0.1346, -0.0378, -0.5070,  ...,  0.0078,  0.0472, -0.1815],
        [-0.4051,  0.1123, -0.3873,  ...,  0.3585,  0.4913,  0.3192]])
  • 1
  • 2
  • 3
  • 4
  • 5

print(‘pooler:’, model_output[1])

pooler: tensor([[ 0.3888, -0.2329, -0.1749,  ...,  0.1678,  0.3938, -0.3191],
        [ 0.3949, -0.2882, -0.0945,  ...,  0.1802,  0.2705, -0.1891],
        [ 0.4765, -0.1235, -0.2330,  ...,  0.3005,  0.3487, -0.1290],
        [ 0.3851, -0.1853, -0.3189,  ...,  0.2757,  0.3601, -0.3220],
        [ 0.3008, -0.3742, -0.4550,  ...,  0.4318,  0.2130, -0.1575]])
  • 1
  • 2
  • 3
  • 4
  • 5

cls的值和pooler的值不一样

Q2、最后的pooler层和hidden层是什么关系?
理论层面:

transformers.models.bert.modeling_bert.BertModel.forward方法中这么一行代码:

sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
  • 1
  • 2

pooler的定义:

self.pooler = BertPooler(config) if add_pooling_layer else None
  • 1

BertPooler的定义:

class BertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

从上面的源码可以看出,pooler_output 就是[CLS]embedding又经历了一次全连接层的输出

数据层面:
model.pooler(model_output[0])
  • 1
tensor([[ 0.3888, -0.2329, -0.1749,  ...,  0.1678,  0.3938, -0.3191],
        [ 0.3949, -0.2882, -0.0945,  ...,  0.1802,  0.2705, -0.1891],
        [ 0.4765, -0.1235, -0.2330,  ...,  0.3005,  0.3487, -0.1290],
        [ 0.3851, -0.1853, -0.3189,  ...,  0.2757,  0.3601, -0.3220],
        [ 0.3008, -0.3742, -0.4550,  ...,  0.4318,  0.2130, -0.1575]],
       grad_fn=<TanhBackward0>)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

在这里插入图片描述
pooler_output 就是[CLS]embedding又经历了一次全连接层的输出

三、结论:

pooler就是将[CLS]这个token再过一下全连接层+Tanh激活函数,作为该句子的特征向量

四、Bert的Pooler_output的由来

我们知道,BERT的训练包含两个任务:MLM和NSP任务(Next Sentence Prediction)。 对这两个任务不熟悉的朋友可以参考:BERT源码实现与解读(Pytorch) 和 【论文阅读】BERT 两篇文章。

其中MLM就是挖空,然后让bert预测这个空是什么。做该任务是使用token embedding进行预测。

而Next Sentence Prediction就是预测bert接受的两句话是否为一对。例如:窗前明月光,疑是地上霜 为 True,窗前明月光,李白打开窗为False。

所以,NSP任务需要句子的语义信息来预测,但是我们看下源码是怎么做的。

class BertForNextSentencePrediction(BertPreTrainedModel):
	
    def __init__(self, config):
        super().__init__(config)

        self.bert = BertModel(config)
        self.cls = BertOnlyNSPHead(config)	# 这个就是一个 nn.Linear(config.hidden_size, 2)
		...
	
	def forward(...):
		...
		outputs = self.bert(...)
		pooled_output = outputs[1] # 取pooler_output
		seq_relationship_scores = self.cls(pooled_output)	# 使用pooler_ouput送给后续的全连接层进行预测
		...

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

从上面的源码可以看出,在NSP任务训练时,并不是直接使用[CLS]token的embedding作为句子特征传给后续分类头的,而是使用的是pooler_output。个人原因可能是因为直接使用[CLS]的embedding效果不够好。
但在MLM任务时,是直接使用的是last_hidden_state,有兴趣可以看一下

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/黑客灵魂/article/detail/772203
推荐阅读
相关标签
  

闽ICP备14008679号