赞
踩
bert_config.josn 模型中参数的配置
- {
- "attention_probs_dropout_prob": 0.1, #乘法attention时,softmax后dropout概率
- "hidden_act": "gelu", #激活函数
- "hidden_dropout_prob": 0.1, #隐藏层dropout概率
- "hidden_size": 768, #隐藏单元数
- "initializer_range": 0.02, #初始化范围
- "intermediate_size": 3072, #升维维度
- "max_position_embeddings": 512,#一个大于seq_length的参数,用于生成position_embedding "num_attention_heads": 12, #每个隐藏层中的attention head数
- "num_hidden_layers": 12, #隐藏层数
- "type_vocab_size": 2, #segment_ids类别 [0,1]
- "vocab_size": 30522 #词典中词数
- }
输入参数:input_ids,input_mask,token_type_ids对应上篇文章中输出的input_ids,input_mask,segment_ids
首先对input_ids和token_type_ids进行embedding操作,将embedding结果送入Transformer训练,最后得到编码结果。
模型配置类
- def __init__(self,
- vocab_size,
- hidden_size=768,
- num_hidden_layers=12,
- num_attention_heads=12,
- intermediate_size=3072,
- hidden_act="gelu",
- hidden_dropout_prob=0.1,
- attention_probs_dropout_prob=0.1,
- max_position_embeddings=512,
- type_vocab_size=16,
- initializer_range=0.02):
- """Constructs BertConfig.
- Args:
- vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.字典大小
- hidden_size: Size of the encoder layers and the pooler layer.隐层节点个数
- num_hidden_layers: Number of hidden layers in the Transformer encoder.隐层层数
- num_attention_heads: Number of attention heads for each attention layer in
- the Transformer encoder.有多少个muiti-attention head
- intermediate_size: The size of the "intermediate" (i.e., feed-forward)
- layer in the Transformer encoder.
- hidden_act: The non-linear activation function (function or string) in the
- encoder and pooler.
- hidden_dropout_prob: The dropout probability for all fully connected
- layers in the embeddings, encoder, and pooler.
- attention_probs_dropout_prob: The dropout ratio for the attention
- probabilities.
- max_position_embeddings: The maximum sequence length that this model might
- ever be used with. Typically set this to something large just in case
- (e.g., 512 or 1024 or 2048).
- type_vocab_size: The vocabulary size of the `token_type_ids` passed into
- `BertModel`.
- initializer_range: The stdev of the truncated_normal_initializer for
- initializing all weight matrices.
- """
- self.vocab_size = vocab_size
- self.hidden_size = hidden_size
- self.num_hidden_layers = num_hidden_layers
- self.num_attention_heads = num_attention_heads
- self.hidden_act = hidden_act
- self.intermediate_size = intermediate_size
- self.hidden_dropout_prob = hidden_dropout_prob
- self.attention_probs_dropout_prob = attention_probs_dropout_prob
- self.max_position_embeddings = max_position_embeddings
- self.type_vocab_size = type_vocab_size
- self.initializer_range = initializer_range
-
- @classmethod
- def from_dict(cls, json_object):
- """Constructs a `BertConfig` from a Python dictionary of parameters."""
- config = BertConfig(vocab_size=None)
- for (key, value) in six.iteritems(json_object):
- config.__dict__[key] = value
- return config
-
- @classmethod
- def from_json_file(cls, json_file):
- """Constructs a `BertConfig` from a json file of parameters."""
- with tf.gfile.GFile(json_file, "r") as reader:
- text = reader.read()
- return cls.from_dict(json.loads(text))
-
- def to_dict(self):
- """Serializes this instance to a Python dictionary."""
- output = copy.deepcopy(self.__dict__)
- return output
-
- def to_json_string(self):
- """Serializes this instance to a JSON string."""
- return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
-
对于整个模型来说,分清下面的值,很重要
1.输入模型的值是啥?
2.模型的标签是啥?
3.loss是如何计算的?
模型的输入是:train_input_fn
- [tokens: [CLS] ancient sage [MASK] [MASK] the name kang un ##im [MASK] ##ant to a monk - - pumped water nightly that he might study by day , so i [MASK] the [MASK] of cloak ##s [MASK] para ##sol ##acies , at the sacred doors of her [MASK] - room [MASK] im ##bib ##e celestial knowledge . from my youth i felt in me a [SEP] fallen star , i am , bobbie ! ' continued he , [MASK] ##ively , stroking his lean [MASK] - - ' a fallen star ! - [MASK] fallen , if the dignity [MASK] philosophy will allow of the simi ##le , among the hog [MASK] of the lower world - [MASK] indeed , even into the hog - bucket itself . [SEP]
- segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
- is_random_next: False
- masked_lm_positions: 3 4 6 7 10 29 31 35 38 46 49 71 77 83 92 98 110 116 124
- masked_lm_labels: - - name is ##port , guardian and ##s lecture , sir pens stomach - of ##s - bucket
-
- ]
- self.sequence_output Tensor("bert/encoder/Reshape_13:0", shape=(32, 128, 768), dtype=float32)
- self.pooled_output Tensor("bert/pooler/dense/Tanh:0", shape=(32, 768), dtype=float32)
- self.embedding_table <tf.Variable 'bert/embeddings/word_embeddings:0' shape=(30522, 768) dtype=float32_ref>
self.sequence_output:预测单词的输入值
self.pooled_output:预测下一句的输入值
-
- def embedding_lookup(input_ids,
- vocab_size,
- embedding_size=128,
- initializer_range=0.02,
- word_embedding_name="word_embeddings",
- use_one_hot_embeddings=False):
- """Looks up words embeddings for id tensor.
- Args:
- input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
- ids.
- vocab_size: int. Size of the embedding vocabulary.
- embedding_size: int. Width of the word embeddings.
- initializer_range: float. Embedding initialization range.
- word_embedding_name: string. Name of the embedding table.
- use_one_hot_embeddings: bool. If True, use one-hot method for word
- embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
- for TPUs.
- Returns:
- float Tensor of shape [batch_size, seq_length, embedding_size].
- """
- # This function assumes that the input is of shape [batch_size, seq_length,
- # num_inputs].
- #
- # If the input is a 2D tensor of shape [batch_size, seq_length], we
- # reshape to [batch_size, seq_length, 1].
- if input_ids.shape.ndims == 2:
- input_ids = tf.expand_dims(input_ids, axis=[-1])
- #print(input_ids) #shape=(32, 128, 1)
-
- embedding_table = tf.get_variable(
- name=word_embedding_name,
- shape=[vocab_size, embedding_size],
- initializer=create_initializer(initializer_range))
- #print(embedding_table) #shape=(30522, 768)
-
- if use_one_hot_embeddings:
- flat_input_ids = tf.reshape(input_ids, [-1])
- one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
- output = tf.matmul(one_hot_input_ids, embedding_table)
- else:
- output = tf.nn.embedding_lookup(embedding_table, input_ids)
-
- input_shape = get_shape_list(input_ids)
-
- output = tf.reshape(output,
- input_shape[0:-1] + [input_shape[-1] * embedding_size])
- #print(output) #shape=(32, 128, 768) batch_size=32,embedding_size=128,hidden_size=768
- #print(embedding_table) #shape=(30522, 768)
- return (output, embedding_table)
#print(output) #shape=(32, 128, 768) batch_size=32,embedding_size=128,hidden_size=768
embedding_table = (30522,768)
embedding_postprocessor 它包括token_type_embedding和position_embedding。也就是图中的Segement Embeddings和Position Embeddings。模型输入的结构
embedding结构图:选自《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》。
但此代码中Position Embeddings部分与之前提出的Transformer不同,此代码中Position Embeddings是训练出来的,而传统的Transformer(如下)是固定值
在这里插入图片描述
如上所示,输入有 A 句「my dog is cute」和 B 句「he likes playing」这两个自然句,我们首先需要将每个单词及特殊符号都转化为词嵌入向量,因为神经网络只能进行数值计算。其中特殊符 [SEP] 是用于分割两个句子的符号,前面半句会加上分割编码 A,后半句会加上分割编码 B。
因为要建模句子之间的关系,BERT 有一个任务是预测 B 句是不是 A 句后面的一句话,而这个分类任务会借助 A/B 句最前面的特殊符 [CLS] 实现,该特殊符可以视为汇集了整个输入序列的表征。
最后的位置编码是 Transformer 架构本身决定的,因为基于完全注意力的方法并不能像 CNN 或 RNN 那样编码词与词之间的位置关系,但是正因为这种属性才能无视距离长短建模两个词之间的关系。因此为了令 Transformer 感知词与词之间的位置关系,我们需要使用位置编码给每个词加上位置信息。
-
- #主要是对位置等进行embedding
- def embedding_postprocessor(input_tensor,
- use_token_type=False,
- token_type_ids=None,
- token_type_vocab_size=16,
- token_type_embedding_name="token_type_embeddings",
- use_position_embeddings=True,
- position_embedding_name="position_embeddings",
- initializer_range=0.02,
- max_position_embeddings=512,
- dropout_prob=0.1):
- #print(input_tensor) #shape=(32, 128, 768)
- """Performs various post-processing on a word embedding tensor.
- Args:
- input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].
- use_token_type: bool. Whether to add embeddings for `token_type_ids`.
- token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
- Must be specified if `use_token_type` is True.
- token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
- token_type_embedding_name: string. The name of the embedding table variable
- for token type ids.
- use_position_embeddings: bool. Whether to add position embeddings for the
- position of each token in the sequence.
- position_embedding_name: string. The name of the embedding table variable
- for positional embeddings.
- initializer_range: float. Range of the weight initialization.
- max_position_embeddings: int. Maximum sequence length that might ever be
- used with this model. This can be longer than the sequence length of
- input_tensor, but cannot be shorter.
- dropout_prob: float. Dropout probability applied to the final output tensor.
- Returns:
- float tensor with same shape as `input_tensor`.
- Raises:
- ValueError: One of the tensor shapes or input values is invalid.
- """
- input_shape = get_shape_list(input_tensor, expected_rank=3)
- batch_size = input_shape[0] #32
- seq_length = input_shape[1] #128
- width = input_shape[2] #768
-
- output = input_tensor
-
- if use_token_type:
- if token_type_ids is None:
- raise ValueError("`token_type_ids` must be specified if"
- "`use_token_type` is True.")
- token_type_table = tf.get_variable(
- name=token_type_embedding_name,
- shape=[token_type_vocab_size, width],
- initializer=create_initializer(initializer_range))
- # This vocab will be small so we always do one-hot here, since it is always
- # faster for a small vocabulary.
- flat_token_type_ids = tf.reshape(token_type_ids, [-1])
- one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
- token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
- token_type_embeddings = tf.reshape(token_type_embeddings,
- [batch_size, seq_length, width])
- output += token_type_embeddings
-
- if use_position_embeddings:
- assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
- with tf.control_dependencies([assert_op]):
- full_position_embeddings = tf.get_variable(
- name=position_embedding_name,
- shape=[max_position_embeddings, width],
- initializer=create_initializer(initializer_range))
- # Since the position embedding table is a learned variable, we create it
- # using a (long) sequence length `max_position_embeddings`. The actual
- # sequence length might be shorter than this, for faster training of
- # tasks that do not have long sequences.
- #
- # So `full_position_embeddings` is effectively an embedding table
- # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
- # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
- # perform a slice.
- position_embeddings = tf.slice(full_position_embeddings, [0, 0],
- [seq_length, -1])
- num_dims = len(output.shape.as_list())
-
- # Only the last two dimensions are relevant (`seq_length` and `width`), so
- # we broadcast among the first dimensions, which is typically just
- # the batch size.
- position_broadcast_shape = []
- for _ in range(num_dims - 2):
- position_broadcast_shape.append(1)
- position_broadcast_shape.extend([seq_length, width])
- position_embeddings = tf.reshape(position_embeddings,
- position_broadcast_shape)
- output += position_embeddings
-
- output = layer_norm_and_dropout(output, dropout_prob)
- #print(output) #shape=(32, 128, 768)
- return output
-
self.all_encoder_layers一共12层的encoder
- [<tf.Tensor 'bert/encoder/Reshape_2:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_3:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_4:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_5:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_6:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_7:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_8:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_9:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_10:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_11:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_12:0' shape=(32, 128, 768) dtype=float32>,
- <tf.Tensor 'bert/encoder/Reshape_13:0' shape=(32, 128, 768) dtype=float32>]
上图是完整的transformer结构在bert中似乎子用到了下面的部分
也就是只有encoder层,右边的部分没有用到,N=12
首先对embedding进行multi-head attention,对输入进行残差和layer_norm。后传入feed forward,再进行残差和layer_norm。
本块代码中与原论文中不一样的地方为:在进行multi-head attention后先链接了一个全连接层,再进行的残差和layer_norm。而原论文中貌似没有那个全连接层。下面也说明只有encoder部分
This is almost an exact implementation of the original Transformer encoder
-
- #transformer的模型
- def transformer_model(input_tensor,
- attention_mask=None,
- hidden_size=768,
- num_hidden_layers=12,
- num_attention_heads=12,
- intermediate_size=3072,
- intermediate_act_fn=gelu,
- hidden_dropout_prob=0.1,
- attention_probs_dropout_prob=0.1,
- initializer_range=0.02,
- do_return_all_layers=False):
- """Multi-headed, multi-layer Transformer from "Attention is All You Need".
- This is almost an exact implementation of the original Transformer encoder.
- See the original paper:
- https://arxiv.org/abs/1706.03762
- Also see:
- https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
- Args: 参数的意思
- input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
- attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
- seq_length], with 1 for positions that can be attended to and 0 in
- positions that should not be.
- hidden_size: int. Hidden size of the Transformer.
- num_hidden_layers: int. Number of layers (blocks) in the Transformer.
- num_attention_heads: int. Number of attention heads in the Transformer.
- intermediate_size: int. The size of the "intermediate" (a.k.a., feed
- forward) layer.
- intermediate_act_fn: function. The non-linear activation function to apply
- to the output of the intermediate/feed-forward layer.
- hidden_dropout_prob: float. Dropout probability for the hidden layers.
- attention_probs_dropout_prob: float. Dropout probability of the attention
- probabilities.
- initializer_range: float. Range of the initializer (stddev of truncated
- normal).
- do_return_all_layers: Whether to also return all layers or just the final
- layer.
- Returns:
- float Tensor of shape [batch_size, seq_length, hidden_size], the final
- hidden layer of the Transformer. 最后一层的张量
- Raises:
- ValueError: A Tensor shape or parameter is invalid.
- """
- if hidden_size % num_attention_heads != 0:
- raise ValueError(
- "The hidden size (%d) is not a multiple of the number of attention "
- "heads (%d)" % (hidden_size, num_attention_heads))
-
- attention_head_size = int(hidden_size / num_attention_heads)
- input_shape = get_shape_list(input_tensor, expected_rank=3)
- batch_size = input_shape[0] #32
- seq_length = input_shape[1] #128
- input_width = input_shape[2] #768
-
- # The Transformer performs sum residuals on all layers so the input needs
- # to be the same as the hidden size.
- if input_width != hidden_size:
- raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
- (input_width, hidden_size))
-
- # We keep the representation as a 2D tensor to avoid re-shaping it back and
- # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
- # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
- # help the optimizer.
- prev_output = reshape_to_matrix(input_tensor)
-
- all_layer_outputs = []
- for layer_idx in range(num_hidden_layers):
- with tf.variable_scope("layer_%d" % layer_idx):
- layer_input = prev_output
- #注意力层
- with tf.variable_scope("attention"):
- attention_heads = []
- with tf.variable_scope("self"):
- #注意力层的构造
- attention_head = attention_layer(
- from_tensor=layer_input,
- to_tensor=layer_input,
- attention_mask=attention_mask,
- num_attention_heads=num_attention_heads,
- size_per_head=attention_head_size,
- attention_probs_dropout_prob=attention_probs_dropout_prob,
- initializer_range=initializer_range,
- do_return_2d_tensor=True,
- batch_size=batch_size,
- from_seq_length=seq_length,
- to_seq_length=seq_length)
-
- attention_heads.append(attention_head)
-
- attention_output = None
- if len(attention_heads) == 1:
- attention_output = attention_heads[0]
- else:
- # In the case where we have other sequences, we just concatenate
- # them to the self-attention head before the projection.
- attention_output = tf.concat(attention_heads, axis=-1)
- # Run a linear projection of `hidden_size` then add a residual
- # with `layer_input`.
- #attention层的输出,连接到一个全连接层
- with tf.variable_scope("output"):
- attention_output = tf.layers.dense(
- attention_output,
- hidden_size,
- kernel_initializer=create_initializer(initializer_range))
- attention_output = dropout(attention_output, hidden_dropout_prob)
- attention_output = layer_norm(attention_output + layer_input)
-
- # The activation is only applied to the "intermediate" hidden layer.
- #中间层
- with tf.variable_scope("intermediate"):
- intermediate_output = tf.layers.dense(
- attention_output,
- intermediate_size,
- activation=intermediate_act_fn,
- kernel_initializer=create_initializer(initializer_range))
-
- # Down-project back to `hidden_size` then add the residual.
- with tf.variable_scope("output"):
- layer_output = tf.layers.dense(
- intermediate_output,
- hidden_size,
- kernel_initializer=create_initializer(initializer_range))
- layer_output = dropout(layer_output, hidden_dropout_prob)
- layer_output = layer_norm(layer_output + attention_output) ##加入残差
- prev_output = layer_output # #本层输出作为下一层输入
- all_layer_outputs.append(layer_output) #所有层的输出结果列表
-
- if do_return_all_layers: #返回所有的层
- final_outputs = []
- for layer_output in all_layer_outputs:
- final_output = reshape_from_matrix(layer_output, input_shape)
- final_outputs.append(final_output)
- return final_outputs
- else:
- final_output = reshape_from_matrix(prev_output, input_shape)
- return final_output
首先将输入的key和value,reshape成[batch_size,num_head,seq_length,size_per_head]。在对这些head进行乘法注意力运算。经过softmax后乘以value。最后返回tensor with shape [batch_size*seq_length,hidden_size]
-
- #multi-headed attention
- def attention_layer(from_tensor,
- to_tensor,
- attention_mask=None,
- num_attention_heads=1,
- size_per_head=256,
- query_act=None,
- key_act=None,
- value_act=None,
- attention_probs_dropout_prob=0.0,
- initializer_range=0.02,
- do_return_2d_tensor=False,
- batch_size=None,
- from_seq_length=None,
- to_seq_length=None):
- """Performs multi-headed attention from `from_tensor` to `to_tensor`.
- This is an implementation of multi-headed attention based on "Attention
- is all you Need". If `from_tensor` and `to_tensor` are the same, then
- this is self-attention. Each timestep in `from_tensor` attends to the
- corresponding sequence in `to_tensor`, and returns a fixed-with vector.
- This function first projects `from_tensor` into a "query" tensor and
- `to_tensor` into "key" and "value" tensors. These are (effectively) a list
- of tensors of length `num_attention_heads`, where each tensor is of shape
- [batch_size, seq_length, size_per_head].
- Then, the query and key tensors are dot-producted and scaled. These are
- softmaxed to obtain attention probabilities. The value tensors are then
- interpolated by these probabilities, then concatenated back to a single
- tensor and returned.
- In practice, the multi-headed attention are done with transposes and
- reshapes rather than actual separate tensors.
- Args:
- from_tensor: float Tensor of shape [batch_size, from_seq_length,
- from_width].
- to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
- attention_mask: (optional) int32 Tensor of shape [batch_size,
- from_seq_length, to_seq_length]. The values should be 1 or 0. The
- attention scores will effectively be set to -infinity for any positions in
- the mask that are 0, and will be unchanged for positions that are 1.
- num_attention_heads: int. Number of attention heads.
- size_per_head: int. Size of each attention head.
- query_act: (optional) Activation function for the query transform.
- key_act: (optional) Activation function for the key transform.
- value_act: (optional) Activation function for the value transform.
- attention_probs_dropout_prob: (optional) float. Dropout probability of the
- attention probabilities.
- initializer_range: float. Range of the weight initializer.
- do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
- * from_seq_length, num_attention_heads * size_per_head]. If False, the
- output will be of shape [batch_size, from_seq_length, num_attention_heads
- * size_per_head].
- batch_size: (Optional) int. If the input is 2D, this might be the batch size
- of the 3D version of the `from_tensor` and `to_tensor`.
- from_seq_length: (Optional) If the input is 2D, this might be the seq length
- of the 3D version of the `from_tensor`.
- to_seq_length: (Optional) If the input is 2D, this might be the seq length
- of the 3D version of the `to_tensor`.
- Returns:
- float Tensor of shape [batch_size, from_seq_length,
- num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
- true, this will be of shape [batch_size * from_seq_length,
- num_attention_heads * size_per_head]).
- Raises:
- ValueError: Any of the arguments or tensor shapes are invalid.
- """
-
- def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
- seq_length, width):
- output_tensor = tf.reshape(
- input_tensor, [batch_size, seq_length, num_attention_heads, width])
-
- output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
- return output_tensor
-
- from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
- to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
-
- if len(from_shape) != len(to_shape):
- raise ValueError(
- "The rank of `from_tensor` must match the rank of `to_tensor`.")
-
- if len(from_shape) == 3:
- batch_size = from_shape[0]
- from_seq_length = from_shape[1]
- to_seq_length = to_shape[1]
- elif len(from_shape) == 2:
- if (batch_size is None or from_seq_length is None or to_seq_length is None):
- raise ValueError(
- "When passing in rank 2 tensors to attention_layer, the values "
- "for `batch_size`, `from_seq_length`, and `to_seq_length` "
- "must all be specified.")
-
- # Scalar dimensions referenced here:
- # B = batch size (number of sequences)
- # F = `from_tensor` sequence length
- # T = `to_tensor` sequence length
- # N = `num_attention_heads`
- # H = `size_per_head`
-
- from_tensor_2d = reshape_to_matrix(from_tensor)
- to_tensor_2d = reshape_to_matrix(to_tensor)
-
- # `query_layer` = [B*F, N*H]
- query_layer = tf.layers.dense(
- from_tensor_2d,
- num_attention_heads * size_per_head,
- activation=query_act,
- name="query",
- kernel_initializer=create_initializer(initializer_range))
-
- # `key_layer` = [B*T, N*H]
- key_layer = tf.layers.dense(
- to_tensor_2d,
- num_attention_heads * size_per_head,
- activation=key_act,
- name="key",
- kernel_initializer=create_initializer(initializer_range))
-
- # `value_layer` = [B*T, N*H]
- value_layer = tf.layers.dense(
- to_tensor_2d,
- num_attention_heads * size_per_head,
- activation=value_act,
- name="value",
- kernel_initializer=create_initializer(initializer_range))
-
- # `query_layer` = [B, N, F, H]
- query_layer = transpose_for_scores(query_layer, batch_size,
- num_attention_heads, from_seq_length,
- size_per_head)
-
- # `key_layer` = [B, N, T, H]
- key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
- to_seq_length, size_per_head)
-
- # Take the dot product between "query" and "key" to get the raw
- # attention scores.
- # `attention_scores` = [B, N, F, T]
- attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
- attention_scores = tf.multiply(attention_scores,
- 1.0 / math.sqrt(float(size_per_head)))
-
- if attention_mask is not None:
- # `attention_mask` = [B, 1, F, T]
- attention_mask = tf.expand_dims(attention_mask, axis=[1])
-
- # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
- # masked positions, this operation will create a tensor which is 0.0 for
- # positions we want to attend and -10000.0 for masked positions.
- adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
-
- # Since we are adding it to the raw scores before the softmax, this is
- # effectively the same as removing these entirely.
- attention_scores += adder
-
- # Normalize the attention scores to probabilities.
- # `attention_probs` = [B, N, F, T]
- attention_probs = tf.nn.softmax(attention_scores)
-
- # This is actually dropping out entire tokens to attend to, which might
- # seem a bit unusual, but is taken from the original Transformer paper.
- attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
-
- # `value_layer` = [B, T, N, H]
- value_layer = tf.reshape(
- value_layer,
- [batch_size, to_seq_length, num_attention_heads, size_per_head])
-
- # `value_layer` = [B, N, T, H]
- value_layer = tf.transpose(value_layer, [0, 2, 1, 3])
-
- # `context_layer` = [B, N, F, H]
- context_layer = tf.matmul(attention_probs, value_layer)
-
- # `context_layer` = [B, F, N, H]
- context_layer = tf.transpose(context_layer, [0, 2, 1, 3])
-
- if do_return_2d_tensor:
- # `context_layer` = [B*F, N*V]
- context_layer = tf.reshape(
- context_layer,
- [batch_size * from_seq_length, num_attention_heads * size_per_head])
- else:
- # `context_layer` = [B, F, N*V]
- context_layer = tf.reshape(
- context_layer,
- [batch_size, from_seq_length, num_attention_heads * size_per_head])
-
- return context_layer
- Tensor("bert/encoder/layer_0/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_1/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_2/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_3/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_4/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_5/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_6/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_7/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_8/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_9/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_10/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
- Tensor("bert/encoder/layer_11/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
[32*128,12*64]---------[batch_size, from_seq_length, num_attention_heads * size_per_head]
模型怎么用呢,在BertModel class中有两个函数。get_pool_output表示获取每个batch第一个词的[CLS]表示结果。BERT认为这个词包含了整条语料的信息;适用于句子级别的分类问题。get_sequence_output表示BERT最终的输出结果,shape为[batch_size,seq_length,hidden_size]。可以直观理解为对每条语料的最终表示,适用于seq2seq问题。
BERT 的全称是基于 Transformer 的双向编码器表征,其中「双向」表示模型在处理某一个词时,它能同时利用前面的词和后面的词两部分信息。这种「双向」的来源在于 BERT 与传统语言模型不同,它不是在给定所有前面词的条件下预测最可能的当前词,而是随机遮掩一些词,并利用所有没被遮掩的词进行预测。下图展示了三种预训练模型,其中 BERT 和 ELMo 都使用双向信息,OpenAI GPT 使用单向信息。
如上所示为不同预训练模型的架构,BERT 可以视为结合了 OpenAI GPT 和 ELMo 优势的新模型。其中 ELMo 使用两条独立训练的 LSTM 获取双向信息,而 OpenAI GPT 使用新型的 Transformer 和经典语言模型只能获取单向信息。BERT 的主要目标是在 OpenAI GPT 的基础上对预训练任务做一些改进,以同时利用 Transformer 深度模型与双向信息的优势。
微调过程
最后预训练完模型,就要尝试把它们应用到各种 NLP 任务中,并进行简单的微调。不同的任务在微调上有一些差别,但 BERT 已经强大到能为大多数 NLP 任务提供高效的信息抽取功能。对于分类问题而言,例如预测 A/B 句是不是问答对、预测单句是不是语法正确等,它们可以直接利用特殊符 [CLS] 所输出的向量 C,即 P = softmax(C * W),新任务只需要微调权重矩阵 W 就可以了。
对于其它序列标注或生成任务,我们也可以使用 BERT 对应的输出信息作出预测,例如每一个时间步输出一个标注或词等。下图展示了 BERT 在 11 种任务中的微调方法,它们都只添加了一个额外的输出层。在下图中,Tok 表示不同的词、E 表示输入的嵌入向量、T_i 表示第 i 个词在经过 BERT 处理后输出的上下文向量。
如上图所示,句子级的分类问题只需要使用对应 [CLS] 的 C 向量,例如(a)中判断问答对是不是包含正确回答的 QNLI、判断两句话有多少相似性的 STS-B 等,它们都用于处理句子之间的关系。句子级的分类还包含(b)中判语句中断情感趋向的 SST-2 和判断语法正确性的 CoLA 任务,它们都是处理句子内部的关系。
在 SQuAD v1.1 问答数据集中,研究者将问题和包含回答的段落分别作为 A 句与 B 句,并输入到 BERT 中。通过 B 句的输出向量,模型能预测出正确答案的位置与长度。最后在命名实体识别数据集 CoNLL 中,每一个 Tok 对应的输出向量 T 都会预测它的标注是什么,例如人物或地点等。
微调预训练 BERT
该项目表示原论文中 11 项 NLP 任务的微调都是在单块 Cloud TPU(64GB RAM)上进行的,目前无法使用 12GB - 16GB 内存的 GPU 复现论文中 BERT-Large 模型的大部分结果,因为内存匹配的最大批大小仍然太小。但是基于给定的超参数,BERT-Base 模型在不同任务上的微调应该能够在一块 GPU(显存至少 12GB)上运行。
这里主要介绍如何在句子级的分类任务以及标准问答数据集(SQuAD)微调 BERT-Base 模型,其中微调过程主要使用一块 GPU。而 BERT-Large 模型的微调读者可以参考原项目。
以下为原项目中展示的句子级分类任务的微调,在运行该示例之前,你必须运行一个脚本下载GLUE data,并将它放置到目录$GLUE_DIR。然后,下载预训练BERT-Base模型,解压缩后存储到目录$BERT_BASE_DIR。
GLUE data 脚本地址:https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e
该示例代码在Microsoft Research Paraphrase Corpus(MRPC)上对BERT-Base进行微调,该语料库仅包含3600个样本,在大多数GPU上该微调过程仅需几分钟。
exportBERT_BASE_DIR= /path/to/bert/uncased_L -12_H -768_A -12exportGLUE_DIR= /path/to/glue
python run_classifier.py
--task_name=MRPC
--do_train= true
--do_eval= true
--data_dir=$GLUE_DIR/MRPC
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/bert_config.json
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt
--max_seq_length= 128
--train_batch_size= 32
--learning_rate= 2e-5
--num_train_epochs= 3.0
--output_dir= /tmp/mrpc_output/
输出如下:
***** Eval results *****
eval_accuracy = 0.845588
eval_loss = 0.505248
global_step = 343
loss = 0.505248
可以看到,开发集准确率是84.55%。类似MRPC这样的较小数据集在开发集准确率上方差较高,即使是从同样的预训练检查点开始运行。如果你重新运行多次(确保使用不同的output_dir),结果将在84%和88%之间。注意:你或许会看到信息“Running train on CPU.”这只是表示模型不是运行在Cloud TPU上而已。
通过预训练BERT抽取语义特征
对于原论文11项任务之外的试验,我们也可以通过预训练BERT抽取定长的语义特征向量。因为在特定案例中,与其端到端微调整个预训练模型,直接获取预训练上下文嵌入向量会更有效果,并且也可以缓解大多数内存不足问题。在这个过程中,每个输入token的上下文嵌入向量指预训练模型隐藏层生成的定长上下文表征。
例如,我们可以使用脚本extract_features.py 抽取语义特征:
# Sentence A and Sentence B are separated by the ||| delimiter.# For single sentence inputs, don 't use the delimiter.echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer ' > /tmp/input.txt
python extract_features.py
--input_file=/tmp/input.txt
--output_file=/tmp/output.jsonl
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/bert_config.json
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt
--layers=-1,-2,-3,-4
--max_seq_length=128
--batch_size=8
上面的脚本会创建一个JSON文件(每行输入占一行),JSON文件包含layers指定的每个Transformer层的BERT激活值(-1是Transformer的最后一个隐藏层)。注意这个脚本将生成非常大的输出文件,默认情况下每个输入token 会占据 15kb 左右的空间。
最后,项目作者表示它们近期会解决GPU显存占用太多的问题,并且会发布多语言版的BERT预训练模型。他们表示只要在维基百科有比较大型的数据,那么他们就能提供预训练模型,因此我们还能期待下次谷歌发布基于中文语料的BERT预训练模型。
-
- def get_pooled_output(self):
- #print("self.pooled_output",self.pooled_output) # shape=(32, 768)
- return self.pooled_output
-
- def get_sequence_output(self):
- """Gets final hidden layer of encoder.
- Returns:
- float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
- to the final hidden of the transformer encoder.
- """
- #print("self.sequence_output",self.sequence_output)
- # #self.sequence_output Tensor("bert/encoder/Reshape_13:0", shape=(32, 128, 768), dtype=float32)
- return self.sequence_output
-
- def get_all_encoder_layers(self):
- return self.all_encoder_layers
-
- def get_embedding_output(self):
- """Gets output of the embedding lookup (i.e., input to the transformer).
- Returns:
- float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
- to the output of the embedding layer, after summing the word
- embeddings with the positional embeddings and the token type embeddings,
- then performing layer normalization. This is the input to the transformer.
- """
- #print("self.embedding_output",self.embedding_output)
- return self.embedding_output
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。