当前位置:   article > 正文

bert代码解读2之完整模型解读_max_position_embeddings

max_position_embeddings

bert代码模型部分的解读

bert_config.josn 模型中参数的配置

  1. {
  2. "attention_probs_dropout_prob": 0.1, #乘法attention时,softmax后dropout概率
  3. "hidden_act": "gelu", #激活函数
  4. "hidden_dropout_prob": 0.1, #隐藏层dropout概率
  5. "hidden_size": 768, #隐藏单元数
  6. "initializer_range": 0.02, #初始化范围
  7. "intermediate_size": 3072, #升维维度
  8. "max_position_embeddings": 512,#一个大于seq_length的参数,用于生成position_embedding "num_attention_heads": 12, #每个隐藏层中的attention head数
  9. "num_hidden_layers": 12, #隐藏层数
  10. "type_vocab_size": 2, #segment_ids类别 [0,1]
  11. "vocab_size": 30522 #词典中词数
  12. }

输入参数:input_ids,input_mask,token_type_ids对应上篇文章中输出的input_ids,input_mask,segment_ids

首先对input_ids和token_type_ids进行embedding操作,将embedding结果送入Transformer训练,最后得到编码结果。

模型配置类

  1. def __init__(self,
  2. vocab_size,
  3. hidden_size=768,
  4. num_hidden_layers=12,
  5. num_attention_heads=12,
  6. intermediate_size=3072,
  7. hidden_act="gelu",
  8. hidden_dropout_prob=0.1,
  9. attention_probs_dropout_prob=0.1,
  10. max_position_embeddings=512,
  11. type_vocab_size=16,
  12. initializer_range=0.02):
  13. """Constructs BertConfig.
  14. Args:
  15. vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.字典大小
  16. hidden_size: Size of the encoder layers and the pooler layer.隐层节点个数
  17. num_hidden_layers: Number of hidden layers in the Transformer encoder.隐层层数
  18. num_attention_heads: Number of attention heads for each attention layer in
  19. the Transformer encoder.有多少个muiti-attention head
  20. intermediate_size: The size of the "intermediate" (i.e., feed-forward)
  21. layer in the Transformer encoder.
  22. hidden_act: The non-linear activation function (function or string) in the
  23. encoder and pooler.
  24. hidden_dropout_prob: The dropout probability for all fully connected
  25. layers in the embeddings, encoder, and pooler.
  26. attention_probs_dropout_prob: The dropout ratio for the attention
  27. probabilities.
  28. max_position_embeddings: The maximum sequence length that this model might
  29. ever be used with. Typically set this to something large just in case
  30. (e.g., 512 or 1024 or 2048).
  31. type_vocab_size: The vocabulary size of the `token_type_ids` passed into
  32. `BertModel`.
  33. initializer_range: The stdev of the truncated_normal_initializer for
  34. initializing all weight matrices.
  35. """
  36. self.vocab_size = vocab_size
  37. self.hidden_size = hidden_size
  38. self.num_hidden_layers = num_hidden_layers
  39. self.num_attention_heads = num_attention_heads
  40. self.hidden_act = hidden_act
  41. self.intermediate_size = intermediate_size
  42. self.hidden_dropout_prob = hidden_dropout_prob
  43. self.attention_probs_dropout_prob = attention_probs_dropout_prob
  44. self.max_position_embeddings = max_position_embeddings
  45. self.type_vocab_size = type_vocab_size
  46. self.initializer_range = initializer_range
  47. @classmethod
  48. def from_dict(cls, json_object):
  49. """Constructs a `BertConfig` from a Python dictionary of parameters."""
  50. config = BertConfig(vocab_size=None)
  51. for (key, value) in six.iteritems(json_object):
  52. config.__dict__[key] = value
  53. return config
  54. @classmethod
  55. def from_json_file(cls, json_file):
  56. """Constructs a `BertConfig` from a json file of parameters."""
  57. with tf.gfile.GFile(json_file, "r") as reader:
  58. text = reader.read()
  59. return cls.from_dict(json.loads(text))
  60. def to_dict(self):
  61. """Serializes this instance to a Python dictionary."""
  62. output = copy.deepcopy(self.__dict__)
  63. return output
  64. def to_json_string(self):
  65. """Serializes this instance to a JSON string."""
  66. return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

对于整个模型来说,分清下面的值,很重要

1.输入模型的值是啥?

2.模型的标签是啥?

3.loss是如何计算的?

1.模型的输入值

模型的输入是:train_input_fn

  1. [tokens: [CLS] ancient sage [MASK] [MASK] the name kang un ##im [MASK] ##ant to a monk - - pumped water nightly that he might study by day , so i [MASK] the [MASK] of cloak ##s [MASK] para ##sol ##acies , at the sacred doors of her [MASK] - room [MASK] im ##bib ##e celestial knowledge . from my youth i felt in me a [SEP] fallen star , i am , bobbie ! ' continued he , [MASK] ##ively , stroking his lean [MASK] - - ' a fallen star ! - [MASK] fallen , if the dignity [MASK] philosophy will allow of the simi ##le , among the hog [MASK] of the lower world - [MASK] indeed , even into the hog - bucket itself . [SEP]
  2. segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  3. is_random_next: False
  4. masked_lm_positions: 3 4 6 7 10 29 31 35 38 46 49 71 77 83 92 98 110 116 124
  5. masked_lm_labels: - - name is ##port , guardian and ##s lecture , sir pens stomach - of ##s - bucket
  6. ]

预测单词部分的loss计算:

对应的标签是 masked_lm_ids----label_ids,对应的词的遮蔽。将label_ids做成one_hot_labels----shape(32*20,30522) 是一个30522维的向量。

输入:model.get_sequence_output()

预测下一句的loss计算:

对应的标签是next_sentence_labels。是一个二分类,做成one_hot向量是一个二维向量。输入model.get_pooled_output()

  1. self.sequence_output Tensor("bert/encoder/Reshape_13:0", shape=(32, 128, 768), dtype=float32)
  2. self.pooled_output Tensor("bert/pooler/dense/Tanh:0", shape=(32, 768), dtype=float32)
  3. self.embedding_table <tf.Variable 'bert/embeddings/word_embeddings:0' shape=(30522, 768) dtype=float32_ref>

self.sequence_output:预测单词的输入值
self.pooled_output:预测下一句的输入值

embedding_lookup()

  1. def embedding_lookup(input_ids,
  2. vocab_size,
  3. embedding_size=128,
  4. initializer_range=0.02,
  5. word_embedding_name="word_embeddings",
  6. use_one_hot_embeddings=False):
  7. """Looks up words embeddings for id tensor.
  8. Args:
  9. input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
  10. ids.
  11. vocab_size: int. Size of the embedding vocabulary.
  12. embedding_size: int. Width of the word embeddings.
  13. initializer_range: float. Embedding initialization range.
  14. word_embedding_name: string. Name of the embedding table.
  15. use_one_hot_embeddings: bool. If True, use one-hot method for word
  16. embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
  17. for TPUs.
  18. Returns:
  19. float Tensor of shape [batch_size, seq_length, embedding_size].
  20. """
  21. # This function assumes that the input is of shape [batch_size, seq_length,
  22. # num_inputs].
  23. #
  24. # If the input is a 2D tensor of shape [batch_size, seq_length], we
  25. # reshape to [batch_size, seq_length, 1].
  26. if input_ids.shape.ndims == 2:
  27. input_ids = tf.expand_dims(input_ids, axis=[-1])
  28. #print(input_ids) #shape=(32, 128, 1)
  29. embedding_table = tf.get_variable(
  30. name=word_embedding_name,
  31. shape=[vocab_size, embedding_size],
  32. initializer=create_initializer(initializer_range))
  33. #print(embedding_table) #shape=(30522, 768)
  34. if use_one_hot_embeddings:
  35. flat_input_ids = tf.reshape(input_ids, [-1])
  36. one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
  37. output = tf.matmul(one_hot_input_ids, embedding_table)
  38. else:
  39. output = tf.nn.embedding_lookup(embedding_table, input_ids)
  40. input_shape = get_shape_list(input_ids)
  41. output = tf.reshape(output,
  42. input_shape[0:-1] + [input_shape[-1] * embedding_size])
  43. #print(output) #shape=(32, 128, 768) batch_size=32,embedding_size=128,hidden_size=768
  44. #print(embedding_table) #shape=(30522, 768)
  45. return (output, embedding_table)

#print(output)   #shape=(32, 128, 768)  batch_size=32,embedding_size=128,hidden_size=768

embedding_table = (30522,768)

embedding_postprocessor

embedding_postprocessor 它包括token_type_embedding和position_embedding。也就是图中的Segement Embeddings和Position Embeddings。模型输入的结构


embedding结构图:选自《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》。
但此代码中Position Embeddings部分与之前提出的Transformer不同,此代码中Position Embeddings是训练出来的,而传统的Transformer(如下)是固定值
在这里插入图片描述

如上所示,输入有 A 句「my dog is cute」和 B 句「he likes playing」这两个自然句,我们首先需要将每个单词及特殊符号都转化为词嵌入向量,因为神经网络只能进行数值计算。其中特殊符 [SEP] 是用于分割两个句子的符号,前面半句会加上分割编码 A,后半句会加上分割编码 B。

因为要建模句子之间的关系,BERT 有一个任务是预测 B 句是不是 A 句后面的一句话,而这个分类任务会借助 A/B 句最前面的特殊符 [CLS] 实现,该特殊符可以视为汇集了整个输入序列的表征。

最后的位置编码是 Transformer 架构本身决定的,因为基于完全注意力的方法并不能像 CNN 或 RNN 那样编码词与词之间的位置关系,但是正因为这种属性才能无视距离长短建模两个词之间的关系。因此为了令 Transformer 感知词与词之间的位置关系,我们需要使用位置编码给每个词加上位置信息。

  1. #主要是对位置等进行embedding
  2. def embedding_postprocessor(input_tensor,
  3. use_token_type=False,
  4. token_type_ids=None,
  5. token_type_vocab_size=16,
  6. token_type_embedding_name="token_type_embeddings",
  7. use_position_embeddings=True,
  8. position_embedding_name="position_embeddings",
  9. initializer_range=0.02,
  10. max_position_embeddings=512,
  11. dropout_prob=0.1):
  12. #print(input_tensor) #shape=(32, 128, 768)
  13. """Performs various post-processing on a word embedding tensor.
  14. Args:
  15. input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].
  16. use_token_type: bool. Whether to add embeddings for `token_type_ids`.
  17. token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
  18. Must be specified if `use_token_type` is True.
  19. token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
  20. token_type_embedding_name: string. The name of the embedding table variable
  21. for token type ids.
  22. use_position_embeddings: bool. Whether to add position embeddings for the
  23. position of each token in the sequence.
  24. position_embedding_name: string. The name of the embedding table variable
  25. for positional embeddings.
  26. initializer_range: float. Range of the weight initialization.
  27. max_position_embeddings: int. Maximum sequence length that might ever be
  28. used with this model. This can be longer than the sequence length of
  29. input_tensor, but cannot be shorter.
  30. dropout_prob: float. Dropout probability applied to the final output tensor.
  31. Returns:
  32. float tensor with same shape as `input_tensor`.
  33. Raises:
  34. ValueError: One of the tensor shapes or input values is invalid.
  35. """
  36. input_shape = get_shape_list(input_tensor, expected_rank=3)
  37. batch_size = input_shape[0] #32
  38. seq_length = input_shape[1] #128
  39. width = input_shape[2] #768
  40. output = input_tensor
  41. if use_token_type:
  42. if token_type_ids is None:
  43. raise ValueError("`token_type_ids` must be specified if"
  44. "`use_token_type` is True.")
  45. token_type_table = tf.get_variable(
  46. name=token_type_embedding_name,
  47. shape=[token_type_vocab_size, width],
  48. initializer=create_initializer(initializer_range))
  49. # This vocab will be small so we always do one-hot here, since it is always
  50. # faster for a small vocabulary.
  51. flat_token_type_ids = tf.reshape(token_type_ids, [-1])
  52. one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
  53. token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
  54. token_type_embeddings = tf.reshape(token_type_embeddings,
  55. [batch_size, seq_length, width])
  56. output += token_type_embeddings
  57. if use_position_embeddings:
  58. assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
  59. with tf.control_dependencies([assert_op]):
  60. full_position_embeddings = tf.get_variable(
  61. name=position_embedding_name,
  62. shape=[max_position_embeddings, width],
  63. initializer=create_initializer(initializer_range))
  64. # Since the position embedding table is a learned variable, we create it
  65. # using a (long) sequence length `max_position_embeddings`. The actual
  66. # sequence length might be shorter than this, for faster training of
  67. # tasks that do not have long sequences.
  68. #
  69. # So `full_position_embeddings` is effectively an embedding table
  70. # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
  71. # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
  72. # perform a slice.
  73. position_embeddings = tf.slice(full_position_embeddings, [0, 0],
  74. [seq_length, -1])
  75. num_dims = len(output.shape.as_list())
  76. # Only the last two dimensions are relevant (`seq_length` and `width`), so
  77. # we broadcast among the first dimensions, which is typically just
  78. # the batch size.
  79. position_broadcast_shape = []
  80. for _ in range(num_dims - 2):
  81. position_broadcast_shape.append(1)
  82. position_broadcast_shape.extend([seq_length, width])
  83. position_embeddings = tf.reshape(position_embeddings,
  84. position_broadcast_shape)
  85. output += position_embeddings
  86. output = layer_norm_and_dropout(output, dropout_prob)
  87. #print(output) #shape=(32, 128, 768)
  88. return output
self.all_encoder_layers一共12层的encoder
  1. [<tf.Tensor 'bert/encoder/Reshape_2:0' shape=(32, 128, 768) dtype=float32>,
  2. <tf.Tensor 'bert/encoder/Reshape_3:0' shape=(32, 128, 768) dtype=float32>,
  3. <tf.Tensor 'bert/encoder/Reshape_4:0' shape=(32, 128, 768) dtype=float32>,
  4. <tf.Tensor 'bert/encoder/Reshape_5:0' shape=(32, 128, 768) dtype=float32>,
  5. <tf.Tensor 'bert/encoder/Reshape_6:0' shape=(32, 128, 768) dtype=float32>,
  6. <tf.Tensor 'bert/encoder/Reshape_7:0' shape=(32, 128, 768) dtype=float32>,
  7. <tf.Tensor 'bert/encoder/Reshape_8:0' shape=(32, 128, 768) dtype=float32>,
  8. <tf.Tensor 'bert/encoder/Reshape_9:0' shape=(32, 128, 768) dtype=float32>,
  9. <tf.Tensor 'bert/encoder/Reshape_10:0' shape=(32, 128, 768) dtype=float32>,
  10. <tf.Tensor 'bert/encoder/Reshape_11:0' shape=(32, 128, 768) dtype=float32>,
  11. <tf.Tensor 'bert/encoder/Reshape_12:0' shape=(32, 128, 768) dtype=float32>,
  12. <tf.Tensor 'bert/encoder/Reshape_13:0' shape=(32, 128, 768) dtype=float32>]

transformer结构

上图是完整的transformer结构在bert中似乎子用到了下面的部分

也就是只有encoder层,右边的部分没有用到,N=12

首先对embedding进行multi-head attention,对输入进行残差layer_norm。后传入feed forward,再进行残差layer_norm

本块代码中与原论文中不一样的地方为:在进行multi-head attention后先链接了一个全连接层,再进行的残差和layer_norm。而原论文中貌似没有那个全连接层。下面也说明只有encoder部分

 This is almost an exact implementation of the original Transformer encoder
  1. #transformer的模型
  2. def transformer_model(input_tensor,
  3. attention_mask=None,
  4. hidden_size=768,
  5. num_hidden_layers=12,
  6. num_attention_heads=12,
  7. intermediate_size=3072,
  8. intermediate_act_fn=gelu,
  9. hidden_dropout_prob=0.1,
  10. attention_probs_dropout_prob=0.1,
  11. initializer_range=0.02,
  12. do_return_all_layers=False):
  13. """Multi-headed, multi-layer Transformer from "Attention is All You Need".
  14. This is almost an exact implementation of the original Transformer encoder.
  15. See the original paper:
  16. https://arxiv.org/abs/1706.03762
  17. Also see:
  18. https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
  19. Args: 参数的意思
  20. input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
  21. attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
  22. seq_length], with 1 for positions that can be attended to and 0 in
  23. positions that should not be.
  24. hidden_size: int. Hidden size of the Transformer.
  25. num_hidden_layers: int. Number of layers (blocks) in the Transformer.
  26. num_attention_heads: int. Number of attention heads in the Transformer.
  27. intermediate_size: int. The size of the "intermediate" (a.k.a., feed
  28. forward) layer.
  29. intermediate_act_fn: function. The non-linear activation function to apply
  30. to the output of the intermediate/feed-forward layer.
  31. hidden_dropout_prob: float. Dropout probability for the hidden layers.
  32. attention_probs_dropout_prob: float. Dropout probability of the attention
  33. probabilities.
  34. initializer_range: float. Range of the initializer (stddev of truncated
  35. normal).
  36. do_return_all_layers: Whether to also return all layers or just the final
  37. layer.
  38. Returns:
  39. float Tensor of shape [batch_size, seq_length, hidden_size], the final
  40. hidden layer of the Transformer. 最后一层的张量
  41. Raises:
  42. ValueError: A Tensor shape or parameter is invalid.
  43. """
  44. if hidden_size % num_attention_heads != 0:
  45. raise ValueError(
  46. "The hidden size (%d) is not a multiple of the number of attention "
  47. "heads (%d)" % (hidden_size, num_attention_heads))
  48. attention_head_size = int(hidden_size / num_attention_heads)
  49. input_shape = get_shape_list(input_tensor, expected_rank=3)
  50. batch_size = input_shape[0] #32
  51. seq_length = input_shape[1] #128
  52. input_width = input_shape[2] #768
  53. # The Transformer performs sum residuals on all layers so the input needs
  54. # to be the same as the hidden size.
  55. if input_width != hidden_size:
  56. raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
  57. (input_width, hidden_size))
  58. # We keep the representation as a 2D tensor to avoid re-shaping it back and
  59. # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
  60. # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
  61. # help the optimizer.
  62. prev_output = reshape_to_matrix(input_tensor)
  63. all_layer_outputs = []
  64. for layer_idx in range(num_hidden_layers):
  65. with tf.variable_scope("layer_%d" % layer_idx):
  66. layer_input = prev_output
  67. #注意力层
  68. with tf.variable_scope("attention"):
  69. attention_heads = []
  70. with tf.variable_scope("self"):
  71. #注意力层的构造
  72. attention_head = attention_layer(
  73. from_tensor=layer_input,
  74. to_tensor=layer_input,
  75. attention_mask=attention_mask,
  76. num_attention_heads=num_attention_heads,
  77. size_per_head=attention_head_size,
  78. attention_probs_dropout_prob=attention_probs_dropout_prob,
  79. initializer_range=initializer_range,
  80. do_return_2d_tensor=True,
  81. batch_size=batch_size,
  82. from_seq_length=seq_length,
  83. to_seq_length=seq_length)
  84. attention_heads.append(attention_head)
  85. attention_output = None
  86. if len(attention_heads) == 1:
  87. attention_output = attention_heads[0]
  88. else:
  89. # In the case where we have other sequences, we just concatenate
  90. # them to the self-attention head before the projection.
  91. attention_output = tf.concat(attention_heads, axis=-1)
  92. # Run a linear projection of `hidden_size` then add a residual
  93. # with `layer_input`.
  94. #attention层的输出,连接到一个全连接层
  95. with tf.variable_scope("output"):
  96. attention_output = tf.layers.dense(
  97. attention_output,
  98. hidden_size,
  99. kernel_initializer=create_initializer(initializer_range))
  100. attention_output = dropout(attention_output, hidden_dropout_prob)
  101. attention_output = layer_norm(attention_output + layer_input)
  102. # The activation is only applied to the "intermediate" hidden layer.
  103. #中间层
  104. with tf.variable_scope("intermediate"):
  105. intermediate_output = tf.layers.dense(
  106. attention_output,
  107. intermediate_size,
  108. activation=intermediate_act_fn,
  109. kernel_initializer=create_initializer(initializer_range))
  110. # Down-project back to `hidden_size` then add the residual.
  111. with tf.variable_scope("output"):
  112. layer_output = tf.layers.dense(
  113. intermediate_output,
  114. hidden_size,
  115. kernel_initializer=create_initializer(initializer_range))
  116. layer_output = dropout(layer_output, hidden_dropout_prob)
  117. layer_output = layer_norm(layer_output + attention_output) ##加入残差
  118. prev_output = layer_output # #本层输出作为下一层输入
  119. all_layer_outputs.append(layer_output) #所有层的输出结果列表
  120. if do_return_all_layers: #返回所有的层
  121. final_outputs = []
  122. for layer_output in all_layer_outputs:
  123. final_output = reshape_from_matrix(layer_output, input_shape)
  124. final_outputs.append(final_output)
  125. return final_outputs
  126. else:
  127. final_output = reshape_from_matrix(prev_output, input_shape)
  128. return final_output

self_attention自注意力机制

首先将输入的key和value,reshape成[batch_size,num_head,seq_length,size_per_head]。在对这些head进行乘法注意力运算。经过softmax后乘以value。最后返回tensor with shape [batch_size*seq_length,hidden_size]
 

  1. #multi-headed attention
  2. def attention_layer(from_tensor,
  3. to_tensor,
  4. attention_mask=None,
  5. num_attention_heads=1,
  6. size_per_head=256,
  7. query_act=None,
  8. key_act=None,
  9. value_act=None,
  10. attention_probs_dropout_prob=0.0,
  11. initializer_range=0.02,
  12. do_return_2d_tensor=False,
  13. batch_size=None,
  14. from_seq_length=None,
  15. to_seq_length=None):
  16. """Performs multi-headed attention from `from_tensor` to `to_tensor`.
  17. This is an implementation of multi-headed attention based on "Attention
  18. is all you Need". If `from_tensor` and `to_tensor` are the same, then
  19. this is self-attention. Each timestep in `from_tensor` attends to the
  20. corresponding sequence in `to_tensor`, and returns a fixed-with vector.
  21. This function first projects `from_tensor` into a "query" tensor and
  22. `to_tensor` into "key" and "value" tensors. These are (effectively) a list
  23. of tensors of length `num_attention_heads`, where each tensor is of shape
  24. [batch_size, seq_length, size_per_head].
  25. Then, the query and key tensors are dot-producted and scaled. These are
  26. softmaxed to obtain attention probabilities. The value tensors are then
  27. interpolated by these probabilities, then concatenated back to a single
  28. tensor and returned.
  29. In practice, the multi-headed attention are done with transposes and
  30. reshapes rather than actual separate tensors.
  31. Args:
  32. from_tensor: float Tensor of shape [batch_size, from_seq_length,
  33. from_width].
  34. to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
  35. attention_mask: (optional) int32 Tensor of shape [batch_size,
  36. from_seq_length, to_seq_length]. The values should be 1 or 0. The
  37. attention scores will effectively be set to -infinity for any positions in
  38. the mask that are 0, and will be unchanged for positions that are 1.
  39. num_attention_heads: int. Number of attention heads.
  40. size_per_head: int. Size of each attention head.
  41. query_act: (optional) Activation function for the query transform.
  42. key_act: (optional) Activation function for the key transform.
  43. value_act: (optional) Activation function for the value transform.
  44. attention_probs_dropout_prob: (optional) float. Dropout probability of the
  45. attention probabilities.
  46. initializer_range: float. Range of the weight initializer.
  47. do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
  48. * from_seq_length, num_attention_heads * size_per_head]. If False, the
  49. output will be of shape [batch_size, from_seq_length, num_attention_heads
  50. * size_per_head].
  51. batch_size: (Optional) int. If the input is 2D, this might be the batch size
  52. of the 3D version of the `from_tensor` and `to_tensor`.
  53. from_seq_length: (Optional) If the input is 2D, this might be the seq length
  54. of the 3D version of the `from_tensor`.
  55. to_seq_length: (Optional) If the input is 2D, this might be the seq length
  56. of the 3D version of the `to_tensor`.
  57. Returns:
  58. float Tensor of shape [batch_size, from_seq_length,
  59. num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
  60. true, this will be of shape [batch_size * from_seq_length,
  61. num_attention_heads * size_per_head]).
  62. Raises:
  63. ValueError: Any of the arguments or tensor shapes are invalid.
  64. """
  65. def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
  66. seq_length, width):
  67. output_tensor = tf.reshape(
  68. input_tensor, [batch_size, seq_length, num_attention_heads, width])
  69. output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
  70. return output_tensor
  71. from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  72. to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
  73. if len(from_shape) != len(to_shape):
  74. raise ValueError(
  75. "The rank of `from_tensor` must match the rank of `to_tensor`.")
  76. if len(from_shape) == 3:
  77. batch_size = from_shape[0]
  78. from_seq_length = from_shape[1]
  79. to_seq_length = to_shape[1]
  80. elif len(from_shape) == 2:
  81. if (batch_size is None or from_seq_length is None or to_seq_length is None):
  82. raise ValueError(
  83. "When passing in rank 2 tensors to attention_layer, the values "
  84. "for `batch_size`, `from_seq_length`, and `to_seq_length` "
  85. "must all be specified.")
  86. # Scalar dimensions referenced here:
  87. # B = batch size (number of sequences)
  88. # F = `from_tensor` sequence length
  89. # T = `to_tensor` sequence length
  90. # N = `num_attention_heads`
  91. # H = `size_per_head`
  92. from_tensor_2d = reshape_to_matrix(from_tensor)
  93. to_tensor_2d = reshape_to_matrix(to_tensor)
  94. # `query_layer` = [B*F, N*H]
  95. query_layer = tf.layers.dense(
  96. from_tensor_2d,
  97. num_attention_heads * size_per_head,
  98. activation=query_act,
  99. name="query",
  100. kernel_initializer=create_initializer(initializer_range))
  101. # `key_layer` = [B*T, N*H]
  102. key_layer = tf.layers.dense(
  103. to_tensor_2d,
  104. num_attention_heads * size_per_head,
  105. activation=key_act,
  106. name="key",
  107. kernel_initializer=create_initializer(initializer_range))
  108. # `value_layer` = [B*T, N*H]
  109. value_layer = tf.layers.dense(
  110. to_tensor_2d,
  111. num_attention_heads * size_per_head,
  112. activation=value_act,
  113. name="value",
  114. kernel_initializer=create_initializer(initializer_range))
  115. # `query_layer` = [B, N, F, H]
  116. query_layer = transpose_for_scores(query_layer, batch_size,
  117. num_attention_heads, from_seq_length,
  118. size_per_head)
  119. # `key_layer` = [B, N, T, H]
  120. key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
  121. to_seq_length, size_per_head)
  122. # Take the dot product between "query" and "key" to get the raw
  123. # attention scores.
  124. # `attention_scores` = [B, N, F, T]
  125. attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  126. attention_scores = tf.multiply(attention_scores,
  127. 1.0 / math.sqrt(float(size_per_head)))
  128. if attention_mask is not None:
  129. # `attention_mask` = [B, 1, F, T]
  130. attention_mask = tf.expand_dims(attention_mask, axis=[1])
  131. # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
  132. # masked positions, this operation will create a tensor which is 0.0 for
  133. # positions we want to attend and -10000.0 for masked positions.
  134. adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
  135. # Since we are adding it to the raw scores before the softmax, this is
  136. # effectively the same as removing these entirely.
  137. attention_scores += adder
  138. # Normalize the attention scores to probabilities.
  139. # `attention_probs` = [B, N, F, T]
  140. attention_probs = tf.nn.softmax(attention_scores)
  141. # This is actually dropping out entire tokens to attend to, which might
  142. # seem a bit unusual, but is taken from the original Transformer paper.
  143. attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
  144. # `value_layer` = [B, T, N, H]
  145. value_layer = tf.reshape(
  146. value_layer,
  147. [batch_size, to_seq_length, num_attention_heads, size_per_head])
  148. # `value_layer` = [B, N, T, H]
  149. value_layer = tf.transpose(value_layer, [0, 2, 1, 3])
  150. # `context_layer` = [B, N, F, H]
  151. context_layer = tf.matmul(attention_probs, value_layer)
  152. # `context_layer` = [B, F, N, H]
  153. context_layer = tf.transpose(context_layer, [0, 2, 1, 3])
  154. if do_return_2d_tensor:
  155. # `context_layer` = [B*F, N*V]
  156. context_layer = tf.reshape(
  157. context_layer,
  158. [batch_size * from_seq_length, num_attention_heads * size_per_head])
  159. else:
  160. # `context_layer` = [B, F, N*V]
  161. context_layer = tf.reshape(
  162. context_layer,
  163. [batch_size, from_seq_length, num_attention_heads * size_per_head])
  164. return context_layer
  1. Tensor("bert/encoder/layer_0/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  2. Tensor("bert/encoder/layer_1/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  3. Tensor("bert/encoder/layer_2/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  4. Tensor("bert/encoder/layer_3/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  5. Tensor("bert/encoder/layer_4/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  6. Tensor("bert/encoder/layer_5/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  7. Tensor("bert/encoder/layer_6/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  8. Tensor("bert/encoder/layer_7/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  9. Tensor("bert/encoder/layer_8/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  10. Tensor("bert/encoder/layer_9/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  11. Tensor("bert/encoder/layer_10/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
  12. Tensor("bert/encoder/layer_11/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)

[32*128,12*64]---------[batch_size, from_seq_length, num_attention_heads * size_per_head]

对于训练完成的模型,问题的关键在于如何使用???

模型怎么用呢,在BertModel class中有两个函数。get_pool_output表示获取每个batch第一个词的[CLS]表示结果。BERT认为这个词包含了整条语料的信息;适用于句子级别的分类问题。get_sequence_output表示BERT最终的输出结果,shape为[batch_size,seq_length,hidden_size]。可以直观理解为对每条语料的最终表示,适用于seq2seq问题。

BERT 的全称是基于 Transformer 的双向编码器表征,其中「双向」表示模型在处理某一个词时,它能同时利用前面的词和后面的词两部分信息。这种「双向」的来源在于 BERT 与传统语言模型不同,它不是在给定所有前面词的条件下预测最可能的当前词,而是随机遮掩一些词,并利用所有没被遮掩的词进行预测。下图展示了三种预训练模型,其中 BERT 和 ELMo 都使用双向信息,OpenAI GPT 使用单向信息。

如上所示为不同预训练模型的架构,BERT 可以视为结合了 OpenAI GPT 和 ELMo 优势的新模型。其中 ELMo 使用两条独立训练的 LSTM 获取双向信息,而 OpenAI GPT 使用新型的 Transformer 和经典语言模型只能获取单向信息。BERT 的主要目标是在 OpenAI GPT 的基础上对预训练任务做一些改进,以同时利用 Transformer 深度模型与双向信息的优势。

微调过程

最后预训练完模型,就要尝试把它们应用到各种 NLP 任务中,并进行简单的微调。不同的任务在微调上有一些差别,但 BERT 已经强大到能为大多数 NLP 任务提供高效的信息抽取功能。对于分类问题而言,例如预测 A/B 句是不是问答对、预测单句是不是语法正确等,它们可以直接利用特殊符 [CLS] 所输出的向量 C,即 P = softmax(C * W),新任务只需要微调权重矩阵 W 就可以了。

对于其它序列标注或生成任务,我们也可以使用 BERT 对应的输出信息作出预测,例如每一个时间步输出一个标注或词等。下图展示了 BERT 在 11 种任务中的微调方法,它们都只添加了一个额外的输出层。在下图中,Tok 表示不同的词、E 表示输入的嵌入向量、T_i 表示第 i 个词在经过 BERT 处理后输出的上下文向量。

如上图所示,句子级的分类问题只需要使用对应 [CLS] 的 C 向量,例如(a)中判断问答对是不是包含正确回答的 QNLI、判断两句话有多少相似性的 STS-B 等,它们都用于处理句子之间的关系。句子级的分类还包含(b)中判语句中断情感趋向的 SST-2 和判断语法正确性的 CoLA 任务,它们都是处理句子内部的关系。

在 SQuAD v1.1 问答数据集中,研究者将问题和包含回答的段落分别作为 A 句与 B 句,并输入到 BERT 中。通过 B 句的输出向量,模型能预测出正确答案的位置与长度。最后在命名实体识别数据集 CoNLL 中,每一个 Tok 对应的输出向量 T 都会预测它的标注是什么,例如人物或地点等。

微调预训练 BERT

该项目表示原论文中 11 项 NLP 任务的微调都是在单块 Cloud TPU(64GB RAM)上进行的,目前无法使用 12GB - 16GB 内存的 GPU 复现论文中 BERT-Large 模型的大部分结果,因为内存匹配的最大批大小仍然太小。但是基于给定的超参数,BERT-Base 模型在不同任务上的微调应该能够在一块 GPU(显存至少 12GB)上运行。

这里主要介绍如何在句子级的分类任务以及标准问答数据集(SQuAD)微调 BERT-Base 模型,其中微调过程主要使用一块 GPU。而 BERT-Large 模型的微调读者可以参考原项目。

以下为原项目中展示的句子级分类任务的微调,在运行该示例之前,你必须运行一个脚本下载GLUE data,并将它放置到目录$GLUE_DIR。然后,下载预训练BERT-Base模型,解压缩后存储到目录$BERT_BASE_DIR。

GLUE data 脚本地址:https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e

该示例代码在Microsoft Research Paraphrase Corpus(MRPC)上对BERT-Base进行微调,该语料库仅包含3600个样本,在大多数GPU上该微调过程仅需几分钟。

exportBERT_BASE_DIR= /path/to/bert/uncased_L -12_H -768_A -12exportGLUE_DIR= /path/to/glue

python run_classifier.py

--task_name=MRPC

--do_train= true

--do_eval= true

--data_dir=$GLUE_DIR/MRPC

--vocab_file=$BERT_BASE_DIR/vocab.txt

--bert_config_file=$BERT_BASE_DIR/bert_config.json

--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt

--max_seq_length= 128

--train_batch_size= 32

--learning_rate= 2e-5

--num_train_epochs= 3.0

--output_dir= /tmp/mrpc_output/

输出如下:

***** Eval results *****

eval_accuracy = 0.845588

eval_loss = 0.505248

global_step = 343

loss = 0.505248

可以看到,开发集准确率是84.55%。类似MRPC这样的较小数据集在开发集准确率上方差较高,即使是从同样的预训练检查点开始运行。如果你重新运行多次(确保使用不同的output_dir),结果将在84%和88%之间。注意:你或许会看到信息“Running train on CPU.”这只是表示模型不是运行在Cloud TPU上而已。

通过预训练BERT抽取语义特征

对于原论文11项任务之外的试验,我们也可以通过预训练BERT抽取定长的语义特征向量。因为在特定案例中,与其端到端微调整个预训练模型,直接获取预训练上下文嵌入向量会更有效果,并且也可以缓解大多数内存不足问题。在这个过程中,每个输入token的上下文嵌入向量指预训练模型隐藏层生成的定长上下文表征。

例如,我们可以使用脚本extract_features.py 抽取语义特征:

# Sentence A and Sentence B are separated by the ||| delimiter.# For single sentence inputs, don 't use the delimiter.echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer ' > /tmp/input.txt

python extract_features.py

--input_file=/tmp/input.txt

--output_file=/tmp/output.jsonl

--vocab_file=$BERT_BASE_DIR/vocab.txt

--bert_config_file=$BERT_BASE_DIR/bert_config.json

--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt

--layers=-1,-2,-3,-4

--max_seq_length=128

--batch_size=8

上面的脚本会创建一个JSON文件(每行输入占一行),JSON文件包含layers指定的每个Transformer层的BERT激活值(-1是Transformer的最后一个隐藏层)。注意这个脚本将生成非常大的输出文件,默认情况下每个输入token 会占据 15kb 左右的空间。

最后,项目作者表示它们近期会解决GPU显存占用太多的问题,并且会发布多语言版的BERT预训练模型。他们表示只要在维基百科有比较大型的数据,那么他们就能提供预训练模型,因此我们还能期待下次谷歌发布基于中文语料的BERT预训练模型。

  1. def get_pooled_output(self):
  2. #print("self.pooled_output",self.pooled_output) # shape=(32, 768)
  3. return self.pooled_output
  4. def get_sequence_output(self):
  5. """Gets final hidden layer of encoder.
  6. Returns:
  7. float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
  8. to the final hidden of the transformer encoder.
  9. """
  10. #print("self.sequence_output",self.sequence_output)
  11. # #self.sequence_output Tensor("bert/encoder/Reshape_13:0", shape=(32, 128, 768), dtype=float32)
  12. return self.sequence_output
  13. def get_all_encoder_layers(self):
  14. return self.all_encoder_layers
  15. def get_embedding_output(self):
  16. """Gets output of the embedding lookup (i.e., input to the transformer).
  17. Returns:
  18. float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
  19. to the output of the embedding layer, after summing the word
  20. embeddings with the positional embeddings and the token type embeddings,
  21. then performing layer normalization. This is the input to the transformer.
  22. """
  23. #print("self.embedding_output",self.embedding_output)
  24. return self.embedding_output

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/312392
推荐阅读
相关标签
  

闽ICP备14008679号