当前位置:   article > 正文

Bert关键源码详细解读_num_hidden_layers

num_hidden_layers

本篇文章主要是解读模型主体代码modeling.py。在阅读这篇文章之前希望读者们对bert的相关理论有一定的了解,尤其是transformer的结构原理,网上的资料很多,本文内容对原理部分就不做过多的介绍了。

我自己写出来其中一个目的也是帮助自己学习整理、当你输出的时候才也会明白哪里懂了哪里不懂。因为水平有限,很多地方理解不到位的,还请各位批评指正。

1、配置

 

  1. class BertConfig(object):
  2. """Configuration for `BertModel`."""
  3. def __init__(self,
  4. vocab_size,
  5. hidden_size=768,
  6. num_hidden_layers=12,
  7. num_attention_heads=12,
  8. intermediate_size=3072,
  9. hidden_act="gelu",
  10. hidden_dropout_prob=0.1,
  11. attention_probs_dropout_prob=0.1,
  12. max_position_embeddings=512,
  13. type_vocab_size=16,
  14. initializer_range=0.02):
  15. self.vocab_size = vocab_size
  16. self.hidden_size = hidden_size
  17. self.num_hidden_layers = num_hidden_layers
  18. self.num_attention_heads = num_attention_heads
  19. self.hidden_act = hidden_act
  20. self.intermediate_size = intermediate_size
  21. self.hidden_dropout_prob = hidden_dropout_prob
  22. self.attention_probs_dropout_prob = attention_probs_dropout_prob
  23. self.max_position_embeddings = max_position_embeddings
  24. self.type_vocab_size = type_vocab_size
  25. self.initializer_range = initializer_range

模型配置,比较简单,依次是:词典大小、隐层神经元个数、transformer的层数、attention的头数、激活函数、中间层神经元个数、隐层dropout比例、attention里面dropout比例、sequence最大长度、token_type_ids的词典大小、truncated_normal_initializer的stdev。

2、word embedding

 

  1. def embedding_lookup(input_ids,
  2. vocab_size,
  3. embedding_size=128,
  4. initializer_range=0.02,
  5. word_embedding_name="word_embeddings",
  6. use_one_hot_embeddings=False):
  7. if input_ids.shape.ndims == 2:
  8. input_ids = tf.expand_dims(input_ids, axis=[-1])
  9. embedding_table = tf.get_variable(
  10. name=word_embedding_name,
  11. shape=[vocab_size, embedding_size],
  12. initializer=create_initializer(initializer_range))
  13. if use_one_hot_embeddings:
  14. flat_input_ids = tf.reshape(input_ids, [-1])
  15. one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
  16. output = tf.matmul(one_hot_input_ids, embedding_table)
  17. else:
  18. output = tf.nn.embedding_lookup(embedding_table, input_ids)
  19. input_shape = get_shape_list(input_ids)
  20. output = tf.reshape(output,
  21. input_shape[0:-1] + [input_shape[-1] * embedding_size])
  22. return (output, embedding_table)

构造embedding_table,进行word embedding,可选one_hot的方式,返回embedding的结果和embedding_table

3、词向量的后续处理

 

  1. def embedding_postprocessor(input_tensor,
  2. use_token_type=False,
  3. token_type_ids=None,
  4. token_type_vocab_size=16,
  5. token_type_embedding_name="token_type_embeddings",
  6. use_position_embeddings=True,
  7. position_embedding_name="position_embeddings",
  8. initializer_range=0.02,
  9. max_position_embeddings=512,
  10. dropout_prob=0.1):
  11. input_shape = get_shape_list(input_tensor, expected_rank=3)
  12. batch_size = input_shape[0]
  13. seq_length = input_shape[1]
  14. width = input_shape[2]
  15. output = input_tensor
  16. if use_token_type:
  17. if token_type_ids is None:
  18. raise ValueError("`token_type_ids` must be specified if"
  19. "`use_token_type` is True.")
  20. token_type_table = tf.get_variable(
  21. name=token_type_embedding_name,
  22. shape=[token_type_vocab_size, width],
  23. initializer=create_initializer(initializer_range))
  24. flat_token_type_ids = tf.reshape(token_type_ids, [-1])
  25. one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
  26. token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
  27. token_type_embeddings = tf.reshape(token_type_embeddings,
  28. [batch_size, seq_length, width])
  29. output += token_type_embeddings
  30. if use_position_embeddings:
  31. assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
  32. with tf.control_dependencies([assert_op]):
  33. full_position_embeddings = tf.get_variable(
  34. name=position_embedding_name,
  35. shape=[max_position_embeddings, width],
  36. initializer=create_initializer(initializer_range))
  37. position_embeddings = tf.slice(full_position_embeddings, [0, 0],
  38. [seq_length, -1])
  39. num_dims = len(output.shape.as_list())
  40. position_broadcast_shape = []
  41. for _ in range(num_dims - 2):
  42. position_broadcast_shape.append(1)
  43. position_broadcast_shape.extend([seq_length, width])
  44. position_embeddings = tf.reshape(position_embeddings,
  45. position_broadcast_shape)
  46. output += position_embeddings
  47. output = layer_norm_and_dropout(output, dropout_prob)
  48. return output

主要是信息添加,可以将word的位置和word对应的token type等信息添加到词向量里面,并且layer正则化和dropout之后返回

4、构造attention mask

 

  1. def create_attention_mask_from_input_mask(from_tensor, to_mask):
  2. from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  3. batch_size = from_shape[0]
  4. from_seq_length = from_shape[1]
  5. to_shape = get_shape_list(to_mask, expected_rank=2)
  6. to_seq_length = to_shape[1]
  7. to_mask = tf.cast(
  8. tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
  9. broadcast_ones = tf.ones(
  10. shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
  11. mask = broadcast_ones * to_mask
  12. return mask

将shape为[batch_size, to_seq_length]的2D mask转换为一个shape 为[batch_size, from_seq_length, to_seq_length] 的3D mask用于attention当中。

5、attention layer

 

  1. def attention_layer(from_tensor,
  2. to_tensor,
  3. attention_mask=None,
  4. num_attention_heads=1,
  5. size_per_head=512,
  6. query_act=None,
  7. key_act=None,
  8. value_act=None,
  9. attention_probs_dropout_prob=0.0,
  10. initializer_range=0.02,
  11. do_return_2d_tensor=False,
  12. batch_size=None,
  13. from_seq_length=None,
  14. to_seq_length=None):
  15. def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
  16. seq_length, width):
  17. output_tensor = tf.reshape(
  18. input_tensor, [batch_size, seq_length, num_attention_heads, width])
  19. output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
  20. return output_tensor
  21. from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  22. to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
  23. if len(from_shape) != len(to_shape):
  24. raise ValueError(
  25. "The rank of `from_tensor` must match the rank of `to_tensor`.")
  26. if len(from_shape) == 3:
  27. batch_size = from_shape[0]
  28. from_seq_length = from_shape[1]
  29. to_seq_length = to_shape[1]
  30. elif len(from_shape) == 2:
  31. if (batch_size is None or from_seq_length is None or to_seq_length is None):
  32. raise ValueError(
  33. "When passing in rank 2 tensors to attention_layer, the values "
  34. "for `batch_size`, `from_seq_length`, and `to_seq_length` "
  35. "must all be specified.")
  36. # Scalar dimensions referenced here:
  37. # B = batch size (number of sequences)
  38. # F = `from_tensor` sequence length
  39. # T = `to_tensor` sequence length
  40. # N = `num_attention_heads`
  41. # H = `size_per_head`
  42. from_tensor_2d = reshape_to_matrix(from_tensor)
  43. to_tensor_2d = reshape_to_matrix(to_tensor)
  44. # `query_layer` = [B*F, N*H]
  45. query_layer = tf.layers.dense(
  46. from_tensor_2d,
  47. num_attention_heads * size_per_head,
  48. activation=query_act,
  49. name="query",
  50. kernel_initializer=create_initializer(initializer_range))
  51. # `key_layer` = [B*T, N*H]
  52. key_layer = tf.layers.dense(
  53. to_tensor_2d,
  54. num_attention_heads * size_per_head,
  55. activation=key_act,
  56. name="key",
  57. kernel_initializer=create_initializer(initializer_range))
  58. # `value_layer` = [B*T, N*H]
  59. value_layer = tf.layers.dense(
  60. to_tensor_2d,
  61. num_attention_heads * size_per_head,
  62. activation=value_act,
  63. name="value",
  64. kernel_initializer=create_initializer(initializer_range))
  65. # `query_layer` = [B, N, F, H]
  66. query_layer = transpose_for_scores(query_layer, batch_size,
  67. num_attention_heads, from_seq_length,
  68. size_per_head)
  69. # `key_layer` = [B, N, T, H]
  70. key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
  71. to_seq_length, size_per_head)
  72. attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  73. attention_scores = tf.multiply(attention_scores,
  74. 1.0 / math.sqrt(float(size_per_head)))
  75. if attention_mask is not None:
  76. # `attention_mask` = [B, 1, F, T]
  77. attention_mask = tf.expand_dims(attention_mask, axis=[1])
  78. adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
  79. attention_scores += adder
  80. attention_probs = tf.nn.softmax(attention_scores)
  81. attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
  82. # `value_layer` = [B, T, N, H]
  83. value_layer = tf.reshape(
  84. value_layer,
  85. [batch_size, to_seq_length, num_attention_heads, size_per_head])
  86. # `value_layer` = [B, N, T, H]
  87. value_layer = tf.transpose(value_layer, [0, 2, 1, 3])
  88. # `context_layer` = [B, N, F, H]
  89. context_layer = tf.matmul(attention_probs, value_layer)
  90. # `context_layer` = [B, F, N, H]
  91. context_layer = tf.transpose(context_layer, [0, 2, 1, 3])
  92. if do_return_2d_tensor:
  93. # `context_layer` = [B*F, N*V]
  94. context_layer = tf.reshape(
  95. context_layer,
  96. [batch_size * from_seq_length, num_attention_heads * size_per_head])
  97. else:
  98. # `context_layer` = [B, F, N*V]
  99. context_layer = tf.reshape(
  100. context_layer,
  101. [batch_size, from_seq_length, num_attention_heads * size_per_head])
  102. return context_layer

整个网络的重头戏来了!tansformer的主要内容都在这里面,输入的from_tensor当作query,to_tensor当作key和value。当self attention的时候from_tensor和to_tensor是同一个值。

(1)函数一开始对输入的shape进行校验,获取batch_size、from_seq_length 、to_seq_length 。输入如果是3D张量则转化成2D矩阵(以输入为word_embedding为例[batch_size, seq_lenth, hidden_size] -> [batch_size*seq_lenth, hidden_size])

(2)通过全连接线性投影生成query_layer、key_layer 、value_layer,输出的第二个维度变成num_attention_heads * size_per_head(整个模型默认hidden_size=num_attention_heads * size_per_head)。然后通过transpose_for_scores转换成多头。

(3)根据公式计算attention_probs(attention score):

 

Attention Score计算公式

 

如果attention_mask is not None,对mask的部分加上一个很大的负数,这样softmax之后相应的概率值接近为0,再dropout。

(4)最后再将value和attention_probs相乘,返回3D张量或者2D矩阵

总结:

同学们可以将这段代码与网络结构图对照起来看:

Attention Layer


该函数相比其他版本的的transformer很多地方都有简化,有以下四点:

 

(1)缺少scale的操作;

(2)没有Causality mask,个人猜测主要是bert没有decoder的操作,所以对角矩阵mask是不需要的,从另一方面来说正好体现了双向transformer的特点;

(3)没有query mask。跟(2)理由类似,encoder都是self attention,query和key相同所以只需要一次key mask就够了

(4)没有query的Residual层和normalize

6、Transformer

 

  1. def transformer_model(input_tensor,
  2. attention_mask=None,
  3. hidden_size=768,
  4. num_hidden_layers=12,
  5. num_attention_heads=12,
  6. intermediate_size=3072,
  7. intermediate_act_fn=gelu,
  8. hidden_dropout_prob=0.1,
  9. attention_probs_dropout_prob=0.1,
  10. initializer_range=0.02,
  11. do_return_all_layers=False):
  12. if hidden_size % num_attention_heads != 0:
  13. raise ValueError(
  14. "The hidden size (%d) is not a multiple of the number of attention "
  15. "heads (%d)" % (hidden_size, num_attention_heads))
  16. attention_head_size = int(hidden_size / num_attention_heads)
  17. input_shape = get_shape_list(input_tensor, expected_rank=3)
  18. batch_size = input_shape[0]
  19. seq_length = input_shape[1]
  20. input_width = input_shape[2]
  21. if input_width != hidden_size:
  22. raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
  23. (input_width, hidden_size))
  24. prev_output = reshape_to_matrix(input_tensor)
  25. all_layer_outputs = []
  26. for layer_idx in range(num_hidden_layers):
  27. with tf.variable_scope("layer_%d" % layer_idx):
  28. layer_input = prev_output
  29. with tf.variable_scope("attention"):
  30. attention_heads = []
  31. with tf.variable_scope("self"):
  32. attention_head = attention_layer(
  33. from_tensor=layer_input,
  34. to_tensor=layer_input,
  35. attention_mask=attention_mask,
  36. num_attention_heads=num_attention_heads,
  37. size_per_head=attention_head_size,
  38. attention_probs_dropout_prob=attention_probs_dropout_prob,
  39. initializer_range=initializer_range,
  40. do_return_2d_tensor=True,
  41. batch_size=batch_size,
  42. from_seq_length=seq_length,
  43. to_seq_length=seq_length)
  44. attention_heads.append(attention_head)
  45. attention_output = None
  46. if len(attention_heads) == 1:
  47. attention_output = attention_heads[0]
  48. else:
  49. attention_output = tf.concat(attention_heads, axis=-1)
  50. with tf.variable_scope("output"):
  51. attention_output = tf.layers.dense(
  52. attention_output,
  53. hidden_size,
  54. kernel_initializer=create_initializer(initializer_range))
  55. attention_output = dropout(attention_output, hidden_dropout_prob)
  56. attention_output = layer_norm(attention_output + layer_input)
  57. with tf.variable_scope("intermediate"):
  58. intermediate_output = tf.layers.dense(
  59. attention_output,
  60. intermediate_size,
  61. activation=intermediate_act_fn,
  62. kernel_initializer=create_initializer(initializer_range))
  63. with tf.variable_scope("output"):
  64. layer_output = tf.layers.dense(
  65. intermediate_output,
  66. hidden_size,
  67. kernel_initializer=create_initializer(initializer_range))
  68. layer_output = dropout(layer_output, hidden_dropout_prob)
  69. layer_output = layer_norm(layer_output + attention_output)
  70. prev_output = layer_output
  71. all_layer_outputs.append(layer_output)
  72. if do_return_all_layers:
  73. final_outputs = []
  74. for layer_output in all_layer_outputs:
  75. final_output = reshape_from_matrix(layer_output, input_shape)
  76. final_outputs.append(final_output)
  77. return final_outputs
  78. else:
  79. final_output = reshape_from_matrix(prev_output, input_shape)
  80. return final_output

transformer是对attention的利用,分以下几步:

(1)计算attention_head_size,attention_head_size = int(hidden_size / num_attention_heads)即将隐层的输出等分给各个attention头。然后将input_tensor转换成2D矩阵;

(2)对input_tensor进行多头attention操作,再做:线性投影——dropout——layer norm——intermediate线性投影——线性投影——dropout——attention_output的residual——layer norm

其中intermediate线性投影的hidden_size可以自行指定,其他层的线性投影hidden_size需要统一,目的是为了对齐。

(3)如此循环计算若干次,且保存每一次的输出,最后返回所有层的输出或者最后一层的输出。

总结:

进一步证实该函数transformer只存在encoder,而不存在decoder操作,所以所有层的多头attention操作都是基于self encoder的。对应论文红框的部分:

The Transformer - model architecture

7、BertModel

 

  1. class BertModel(object):
  2. def __init__(self,
  3. config,
  4. is_training,
  5. input_ids,
  6. input_mask=None,
  7. token_type_ids=None,
  8. use_one_hot_embeddings=True,
  9. scope=None):
  10. config = copy.deepcopy(config)
  11. if not is_training:
  12. config.hidden_dropout_prob = 0.0
  13. config.attention_probs_dropout_prob = 0.0
  14. input_shape = get_shape_list(input_ids, expected_rank=2)
  15. batch_size = input_shape[0]
  16. seq_length = input_shape[1]
  17. if input_mask is None:
  18. input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
  19. if token_type_ids is None:
  20. token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
  21. with tf.variable_scope(scope, default_name="bert"):
  22. with tf.variable_scope("embeddings"):
  23. (self.embedding_output, self.embedding_table) = embedding_lookup(
  24. input_ids=input_ids,
  25. vocab_size=config.vocab_size,
  26. embedding_size=config.hidden_size,
  27. initializer_range=config.initializer_range,
  28. word_embedding_name="word_embeddings",
  29. use_one_hot_embeddings=use_one_hot_embeddings)
  30. self.embedding_output = embedding_postprocessor(
  31. input_tensor=self.embedding_output,
  32. use_token_type=True,
  33. token_type_ids=token_type_ids,
  34. token_type_vocab_size=config.type_vocab_size,
  35. token_type_embedding_name="token_type_embeddings",
  36. use_position_embeddings=True,
  37. position_embedding_name="position_embeddings",
  38. initializer_range=config.initializer_range,
  39. max_position_embeddings=config.max_position_embeddings,
  40. dropout_prob=config.hidden_dropout_prob)
  41. with tf.variable_scope("encoder"):
  42. attention_mask = create_attention_mask_from_input_mask(
  43. input_ids, input_mask)
  44. self.all_encoder_layers = transformer_model(
  45. input_tensor=self.embedding_output,
  46. attention_mask=attention_mask,
  47. hidden_size=config.hidden_size,
  48. num_hidden_layers=config.num_hidden_layers,
  49. num_attention_heads=config.num_attention_heads,
  50. intermediate_size=config.intermediate_size,
  51. intermediate_act_fn=get_activation(config.hidden_act),
  52. hidden_dropout_prob=config.hidden_dropout_prob,
  53. attention_probs_dropout_prob=config.attention_probs_dropout_prob,
  54. initializer_range=config.initializer_range,
  55. do_return_all_layers=True)
  56. self.sequence_output = self.all_encoder_layers[-1]
  57. with tf.variable_scope("pooler"):
  58. first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
  59. self.pooled_output = tf.layers.dense(
  60. first_token_tensor,
  61. config.hidden_size,
  62. activation=tf.tanh,
  63. kernel_initializer=create_initializer(config.initializer_range))

终于到模型入口了。

(1)设置各种参数,如果input_mask为None的话,就指定所有input_mask值为1,即不进行过滤;如果token_type_ids是None的话,就指定所有token_type_ids值为0;

(2)对输入的input_ids进行embedding操作,再embedding_postprocessor操作,前面我们说了。主要是加入位置和token_type信息到词向量里面;

(3)转换attention_mask 后,通过调用transformer_model进行encoder操作;

(4)获取最后一层的输出sequence_output和pooled_output,pooled_output是取sequence_output的第一个切片然后线性投影获得(可以用于分类问题)

8、总结:

(1)bert主要流程是先embedding(包括位置和token_type的embedding),然后调用transformer得到输出结果,其中embedding、embedding_table、所有transformer层输出、最后transformer层输出以及pooled_output都可以获得,用于迁移学习的fine-tune和预测任务;

(2)bert对于transformer的使用仅限于encoder,没有decoder的过程。这是因为模型存粹是为了预训练服务,而预训练是通过语言模型,不同于NLP其他特定任务。在做迁移学习时可以自行添加;

(3)正因为没有decoder的操作,所以在attention函数里面也相应地减少了很多不必要的功能。

其他非主要函数这里不做过多介绍,感兴趣的同学可以去看源码

下一篇文章我们将继续学习bert源码的其他模块,包括训练、预测以及输入输出等相关功能。



作者:西溪雷神
链接:https://www.jianshu.com/p/d7ce41b58801
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/花生_TL007/article/detail/348662
推荐阅读
相关标签
  

闽ICP备14008679号