当前位置:   article > 正文

TensorFlow|基于Transformer的自然语言推理(SNLI)_transformer和tensorflow关系

transformer和tensorflow关系

在经历了看论文,看源码,看Bert源码之后,整理思路,实现了一下Transformer,并搭建了一个小型的Transformer做了一下SNLI任务。

同时吸取以前的教训,这次好好的写了注释

1.Transofrmer

原理不再重述,其他博客中讲的很好,

比如:https://jalammar.github.io/illustrated-transformer/

和他的翻译版:https://blog.csdn.net/qq_41664845/article/details/84969266

直接进入代码

1.1.激活函数

Transformer原文中使用的都是Relu,但Bert包括之后的工作,大多采用的是Gelu(高斯误差线性单元),效果更好(只是参考了论文中的数据对比,还并未亲自实验对比)。

抱着举贤不举亲的原则,就算平时使用的大多Relu,在此也将默认的激活函数设为Gelu。

关于Gelu的原论文:https://arxiv.org/abs/1606.08415

Gelu:

  1. def gelu(inputs):
  2. """
  3. gelu: https://arxiv.org/abs/1606.08415
  4. :param inputs: [Tensor]
  5. :return: [Tensor] outputs after activation
  6. """
  7. cdf = 0.5 * (1.0 + tf.tanh(tf.sqrt(2 / np.pi) * (inputs + 0.044715 * tf.pow(inputs, 3))))
  8. return inputs * cdf

获得激活函数的方法(设置默认gelu):

  1. def get_activation(activation_name):
  2. """
  3. get activate function
  4. :param activation_name: [Tensor]
  5. :return: [Function] activation function
  6. """
  7. if activation_name is None:
  8. return gelu
  9. else:
  10. act = activation_name.lower()
  11. if act == "relu":
  12. return tf.nn.relu
  13. elif act == "gelu":
  14. return gelu
  15. elif act == "tanh":
  16. return tf.tanh
  17. else:
  18. raise ValueError("Unsupported activation: %s" % act)

1.2.嵌入(embedding)

Transformer除了词嵌入,还做了位置嵌入(Positional Encoding),来使每个单词携带位置信息,否则可以想象它只是一个复杂一些的,通过训练获得每个单词权重的词袋模型了。

同时为了完成SNLI这类需要最终输出shape一致的任务,采用了Bert的想法,对每个输入的起始加入[CLS]token,使用该token的最终输出做预测,而这样做的话,需要加入segment embedding来更好的区分两个不同的句子(参考Bert)

1.2.1.词嵌入(Word Embedding)

这里可以通过随机初始化嵌入矩阵,也可以通过载入其他任务(比如Glove,Fast text)产生的词嵌入矩阵来完成这部分,只需要在restore的时候声明一下即可。paper中提到需要对embedding做scale,这里照做。

  1. def get_embedding(inputs, vocab_size, channels, scale=True, scope="embedding", reuse=None):
  2. """
  3. embedding
  4. :param inputs: [Tensor] Tensor with first dimension of "batch_size"
  5. :param vocab_size: [Int] Vocabulary size
  6. :param channels: [Int] Embedding size
  7. :param scale: [Boolean] If True, the output will be multiplied by sqrt num_units
  8. :param scope: [String] name of "variable_scope"
  9. :param reuse: [Boolean] tf parameter reuse
  10. :return: [Tensor] outputs of embedding of sentence with shape of "batch_size * length * channels"
  11. """
  12. with tf.variable_scope(scope, reuse=reuse):
  13. lookup_table = tf.get_variable('lookup_table',
  14. dtype=tf.float32,
  15. shape=[vocab_size, channels],
  16. initializer=tf.contrib.layers.xavier_initializer())
  17. lookup_table = tf.concat((tf.zeros(shape=[1, channels], dtype=tf.float32),
  18. lookup_table[1:, :]), 0)
  19. outputs = tf.nn.embedding_lookup(lookup_table, inputs)
  20. if scale:
  21. outputs = outputs * math.sqrt(channels)
  22. return outputs

1.2.2.位置嵌入(Position Embedding)

获得和inputs经过word embedding之后相同shape的位置嵌入,没有使用word embedding之后的作为输入,是考虑这样可以为之后的mask提供便利

  1. def get_positional_encoding(inputs, channels, scale=False, scope="positional_embedding", reuse=None):
  2. """
  3. positional encoding
  4. :param inputs: [Tensor] with dimension of "batch_size * max_length"
  5. :param channels: [Int] Embedding size
  6. :param scale: [Boolean] If True, the output will be multiplied by sqrt num_units
  7. :param scope: [String] name of "variable_scope"
  8. :param reuse: [Boolean] tf parameter reuse
  9. :return: [Tensor] outputs after positional encoding
  10. """
  11. batch_size = tf.shape(inputs)[0]
  12. max_length = tf.shape(inputs)[1]
  13. with tf.variable_scope(scope, reuse=reuse):
  14. position_ind = tf.tile(tf.expand_dims(tf.range(tf.to_int32(1), tf.add(max_length, 1)), 0), [batch_size, 1])
  15. # Convert to a tensor
  16. lookup_table = tf.convert_to_tensor(get_timing_signal_1d(max_length, channels))
  17. lookup_table = tf.concat((tf.zeros(shape=[1, channels]),
  18. lookup_table[:, :]), 0)
  19. position_inputs = tf.where(tf.equal(inputs, 0), tf.zeros_like(inputs), position_ind)
  20. outputs = tf.nn.embedding_lookup(lookup_table, position_inputs)
  21. if scale:
  22. outputs = outputs * math.sqrt(channels)
  23. return tf.cast(outputs, tf.float32)

通过get_timing_signal_1d()方法获得 [ 句子长度 * embedding维度 ]的矩阵

  1. def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4, start_index=0):
  2. """
  3. positional encoding的方法
  4. :param length: [Int] max_length size
  5. :param channels: [Int] Embedding size
  6. :param min_timescale: [Float]
  7. :param max_timescale: [Float]
  8. :param start_index: [Int] index of first position
  9. :return: [Tensor] positional encoding of shape "length * channels"
  10. """
  11. position = tf.to_float(tf.range(start_index, length))
  12. num_timescales = channels // 2
  13. log_timescale_increment = (math.log(float(min_timescale) / float(max_timescale)) /
  14. (tf.to_float(num_timescales) - 1))
  15. inv_timescales = min_timescale * tf.exp(tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
  16. scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)
  17. signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
  18. signal = tf.pad(signal, [[0, 0], [0, tf.mod(channels, 2)]])
  19. return signal

1.2.3.Segment Embedding

该嵌入仅仅是为了让模型能够更好的区分输入的两个句子,其实通过[SEP]这个token以及能够区分两个句子了,但是对于模型来说显然还不够,在不加入segment embedding的情况下,模型的表现不太良好。

对于[PAD]这个token,所有的embedding(seg、pos)都设为了全零向量,以便后面attention的时候加入mask

  1. def get_seg_embedding(inputs, channels, scale=True, scope="seg_embedding", reuse=None):
  2. """
  3. segment embedding
  4. :param inputs: [Tensor] with first dimension of "batch_size" like [1 1 1 2 2 2 2 0 0 0 ...]
  5. :param channels: [Int] Embedding size
  6. :param scale: [Boolean] If True, the output will be multiplied by sqrt num_units
  7. :param scope: [String] name of "variable_scope"
  8. :param reuse: [Boolean] tf parameter reuse
  9. :return: [Tensor] outputs of embedding of sentence with shape of "batch_size * length * channels"
  10. """
  11. with tf.variable_scope(scope, reuse=reuse):
  12. lookup_table = tf.get_variable('lookup_table',
  13. dtype=tf.float32,
  14. shape=[3, channels],
  15. initializer=tf.contrib.layers.xavier_initializer())
  16. lookup_table = tf.concat((tf.zeros(shape=[1, channels], dtype=tf.float32),
  17. lookup_table[1:, :]), 0)
  18. outputs = tf.nn.embedding_lookup(lookup_table, inputs)
  19. if scale:
  20. outputs = outputs * math.sqrt(channels)
  21. return outputs

1.3.Self-Attention和Encoder-Decoder Attention

到这里,输入的处理就算完成了,到了重头戏Attention机制

两个输入的tensor总觉的一行用英语讲不清楚,就写在这里吧,from tensor对于两个Attention都是一致的就是输入,to tensor对于self-attention来说也是一致的,但对于encoder-decoder attention来说是最后一层encoder的输出,用来捕捉decoder和encoder之间的attention关系。

因为前面做了处理,所有的[PAD]这个token的embedding都是全零,所以对这个维度求绝对值后reduce sum之后,零就是[PAD]这个token,这样就不用再额外的添加一个mask ids作为输入了。

按照paper中的描述

  1. def multi_head_attention(from_tensor: tf.Tensor, to_tensor: tf.Tensor, channels=None, num_units=None, num_heads=8,
  2. dropout_rate=0, is_training=True, attention_mask_flag=False, scope="multihead_attention",
  3. activation=None, reuse=None):
  4. """
  5. multihead attention
  6. :param from_tensor: [Tensor]
  7. :param to_tensor: [Tensor]
  8. :param channels: [Int] channel of last dimension of output
  9. :param num_units: [Int] channel size of matrix Q, K, V
  10. :param num_heads: [Int] head number of attention
  11. :param dropout_rate: [Float] dropout rate when 0 means no dropout
  12. :param is_training: [Boolean] whether it is training, If true, use dropout
  13. :param attention_mask_flag: [Boolean] If true, units that reference the future are masked
  14. :param scope: [String] name of "variable_scope"
  15. :param activation: [String] name of activate function
  16. :param reuse: [Boolean] tf parameter reuse
  17. :return: [Tensor] outputs after multihead self attention with shape of "batch_size * max_length * (channels*num_heads)"
  18. """
  19. with tf.variable_scope(scope, reuse=reuse):
  20. if channels is None:
  21. channels = from_tensor.get_shape().as_list()[-1]
  22. if num_units is None:
  23. num_units = channels//num_heads
  24. activation_fn = get_activation(activation)
  25. # shape [batch_size, max_length, channels*num_heads]
  26. query_layer = tf.layers.dense(from_tensor, num_units * num_heads, activation=activation_fn)
  27. key_layer = tf.layers.dense(to_tensor, num_units * num_heads, activation=activation_fn)
  28. value_layer = tf.layers.dense(to_tensor, num_units * num_heads, activation=activation_fn)
  29. # shape [batch_size*num_heads, max_length, channels]
  30. query_layer_ = tf.concat(tf.split(query_layer, num_heads, axis=2), axis=0)
  31. key_layer_ = tf.concat(tf.split(key_layer, num_heads, axis=2), axis=0)
  32. value_layer_ = tf.concat(tf.split(value_layer, num_heads, axis=2), axis=0)
  33. # shape = [batch_size*num_heads, max_length, max_length]
  34. attention_scores = tf.matmul(query_layer_, tf.transpose(key_layer_, [0, 2, 1]))
  35. # Scale
  36. attention_scores = tf.multiply(attention_scores, 1.0 / tf.sqrt(float(channels)))
  37. # attention masks
  38. attention_masks = tf.sign(tf.abs(tf.reduce_sum(to_tensor, axis=-1)))
  39. attention_masks = tf.tile(attention_masks, [num_heads, 1])
  40. attention_masks = tf.tile(tf.expand_dims(attention_masks, axis=1), [1, tf.shape(from_tensor)[1], 1])
  41. neg_inf_matrix = tf.multiply(tf.ones_like(attention_scores), (-math.pow(2, 32) + 1))
  42. attention_scores = tf.where(tf.equal(attention_masks, 0), neg_inf_matrix, attention_scores)
  43. if attention_mask_flag:
  44. diag_vals = tf.ones_like(attention_scores[0, :, :])
  45. tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense()
  46. masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(attention_scores)[0], 1, 1])
  47. neg_inf_matrix = tf.multiply(tf.ones_like(masks), (-math.pow(2, 32) + 1))
  48. attention_scores = tf.where(tf.equal(masks, 0), neg_inf_matrix, attention_scores)
  49. # attention probability
  50. attention_probs = tf.nn.softmax(attention_scores)
  51. # query mask
  52. query_masks = tf.sign(tf.abs(tf.reduce_sum(from_tensor, axis=-1)))
  53. query_masks = tf.tile(query_masks, [num_heads, 1])
  54. query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(to_tensor)[1]])
  55. attention_probs *= query_masks
  56. # dropout
  57. attention_probs = tf.layers.dropout(attention_probs, rate=dropout_rate,
  58. training=tf.convert_to_tensor(is_training))
  59. outputs = tf.matmul(attention_probs, value_layer_)
  60. # shape [batch_size, max_length, channels*num_heads]
  61. outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2)
  62. # reshape to from tensor
  63. outputs = tf.layers.dense(outputs, channels, activation=activation_fn)
  64. # Residual connection
  65. outputs += from_tensor
  66. # group normalization
  67. outputs = group_norm(outputs)
  68. return outputs

1.4.Feed Ward

论文中的Position-wise Feed-Forward Networks,论文中第二层的激活函数为线性激活函数,将第二层的activation function参数改为None才是原论文的做法,这里出于一些实验的原因没有照做

  1. def feed_forward(inputs, channels, hidden_dims=None, scope="multihead_attention", activation=None, reuse=None):
  2. """
  3. :param inputs: [Tensor] with first dimension of "batch_size"
  4. :param channels: [Int] Embedding size
  5. :param hidden_dims: [List] hidden dimensions
  6. :param scope: [String] name of "variable_scope"
  7. :param activation: [String] name of activate function
  8. :param reuse: [Boolean] tf parameter reuse
  9. :return: [Tensor] outputs after feed forward with shape of "batch_size * max_length * channels"
  10. """
  11. if hidden_dims is None:
  12. hidden_dims = 2*channels
  13. with tf.variable_scope(scope, reuse=reuse):
  14. activation_fn = get_activation(activation)
  15. params = {"inputs": inputs, "num_outputs": hidden_dims, "activation_fn": activation_fn}
  16. outputs = tf.contrib.layers.fully_connected(**params)
  17. params = {"inputs": outputs, "num_outputs": channels, "activation_fn": activation_fn} # activation_fn可以改为None
  18. outputs = tf.contrib.layers.fully_connected(**params)
  19. outputs += inputs
  20. outputs = group_norm(outputs)
  21. return outputs

1.5.Layer Normalization

对了,还有layer normalization。

  1. def group_norm(inputs: tf.Tensor, epsilon=1e-8, scope="layer_normalization", reuse=None):
  2. """
  3. layer normalization
  4. :param inputs: [Tensor] with first dimension of "batch_size"
  5. :param epsilon: [Float] a number for preventing ZeroDivision
  6. :param scope: [String] name of "variable_scope"
  7. :param reuse: [Boolean] tf parameter reuse
  8. :return: [Tensor] outputs after normalized
  9. """
  10. with tf.variable_scope(scope, reuse=reuse):
  11. inputs_shape = inputs.get_shape()
  12. params_shape = inputs_shape[-1:]
  13. mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
  14. beta = tf.Variable(tf.zeros(params_shape))
  15. gamma = tf.Variable(tf.ones(params_shape))
  16. normalized = (inputs - mean) * tf.rsqrt(variance + epsilon)
  17. outputs = gamma * normalized + beta
  18. return outputs

2.Transformer for SNLI

基本工作都做好了,接下来使用之前写好的代码来搭建一个6层的Transformer

先定义一些模型的细节配置:

  1. class ConfigModel(object):
  2. vocab_size_en = len(word_dict_en)
  3. channels = 400
  4. learning_rate = 0.0005
  5. layer_num = 6
  6. is_training = True
  7. is_transfer_learning = False
  8. restore_embedding = False
  9. shuffle_pool_size = 2560
  10. dropout_rate = 0.1
  11. num_heads = 8
  12. batch_size = 64
  13. max_length = 100
  14. num_tags = 3

然后搭建模型,整体按照Bert的思路搭建,最后取 [CLS] token的输出:

  1. class TransformerSNLICls():
  2. def __init__(self, inputs, segs, label, config):
  3. self.inputs = tf.to_int32(inputs) # batch_size*max_length
  4. self.segs = tf.to_int32(segs) # 标识属于第几个句子 ([1 1 2 2 2 0 0 0 ...])
  5. self.target = tf.to_int32(label)
  6. self.vocab_size_en = config.vocab_size_en
  7. self.channels = config.channels
  8. self.num_heads = config.num_heads
  9. self.dropout_rate = config.dropout_rate
  10. self.is_training = config.is_training
  11. self.num_layer = config.layer_num
  12. self.learning_rate = config.learning_rate
  13. # {'_PAD': 0, '_BEGIN': 1, '_EOS': 2, '_CLS': 3, '_SEP': 4, '_MASK': 5}
  14. self.inputs = tf.concat((tf.ones_like(self.inputs[:, :1])*3, self.inputs), axis=-1)
  15. self.segs = tf.concat((tf.ones_like(self.segs[:, :1]), self.segs), axis=-1)
  16. with tf.variable_scope("encoder"):
  17. self.encode = get_embedding(self.inputs, self.vocab_size_en, self.channels, scope="en_embed")
  18. self.encode += get_positional_encoding(self.inputs, self.channels, scope="en_pe")
  19. self.encode += get_seg_embedding(self.segs, self.channels, scope="en_se")
  20. self.encode = tf.layers.dropout(self.encode, rate=self.dropout_rate,
  21. training=tf.convert_to_tensor(self.is_training))
  22. for i in range(self.num_layer):
  23. with tf.variable_scope("encoder_layer_{}".format(i)):
  24. self.encode = multi_head_attention(self.encode, self.encode, self.channels,
  25. num_heads=self.num_heads,
  26. dropout_rate=self.dropout_rate,
  27. is_training=self.is_training,
  28. attention_mask_flag=False)
  29. self.encode = feed_forward(self.encode, self.channels)
  30. self.encode_cls = tf.reshape(self.encode[:, :1, :], [-1, self.channels])
  31. self.output = tf.layers.dense(self.encode_cls, config.num_tags)
  32. self.preds = tf.to_int32(tf.argmax(self.output, axis=-1))
  33. self.acc = tf.reduce_mean(tf.to_float(tf.equal(self.preds, self.target)))
  34. if self.is_training:
  35. self.loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.output, labels=self.target))
  36. self.global_step = tf.Variable(0, name='global_step', trainable=False)
  37. self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate,
  38. beta1=0.9, beta2=0.98, epsilon=1e-8)
  39. self.grads = self.optimizer.compute_gradients(self.loss)
  40. if config.is_transfer_learning:
  41. var_list = tf.trainable_variables()
  42. layer_name_list = ["encoder_layer_" + str(i) for i in range(4)]
  43. var_list_ = [v for v in var_list if v.name.split("/")[1] in layer_name_list]
  44. var_list_ += [v for v in var_list if "lookup_table" in v.name]
  45. for index, grad in enumerate(self.grads):
  46. if grad[1] in var_list_:
  47. self.grads[index] = (grad[0]*0.2, grad[1])
  48. self.train_op = self.optimizer.apply_gradients(self.grads, global_step=self.global_step)

到这里模型就准备好了,接下来对数据进行一些处理

3.SNLI to TF Record

将所有数据进行处理并生成segment ids:

  1. def get_data(snli_name, max_length=config.max_length//2, word_dict=word_dict_en):
  2. sentence_1 = list()
  3. sentence_2 = list()
  4. label = list()
  5. texts = list()
  6. seg_ids = list()
  7. with open(os.path.join(data_path, snli_name), 'r') as f:
  8. for item in jsonlines.Reader(f):
  9. try:
  10. label.append(label_to_num_dict[item["gold_label"]])
  11. except KeyError:
  12. continue
  13. sentence_1.append(normalize_text(item["sentence1"]))
  14. sentence_2.append(normalize_text(item["sentence2"]))
  15. en_data_num_1 = text_to_numbers(sentence_1, word_dict_en, max_length=max_length)
  16. en_data_num_2 = text_to_numbers(sentence_2, word_dict_en, max_length=max_length)
  17. for i_ in range(len(en_data_num_1)):
  18. texts.append(en_data_num_1[i_] + [word_dict["_SEP"]] + en_data_num_2[i_] + [word_dict["_SEP"]])
  19. seg_ids.append((len(en_data_num_1[i_])+1)*[1]+(len(en_data_num_2[i_])+1)*[2])
  20. return texts, label, seg_ids

并转换为tf record格式,以便tensorflow 更高效的读取:

  1. def write_binary(record_name, texts_, label_, seg_ids_):
  2. writer = tf.python_io.TFRecordWriter(record_name)
  3. for it, text in tqdm(enumerate(texts_)):
  4. example = tf.train.Example(
  5. features=tf.train.Features(
  6. feature={
  7. "text_ids": tf.train.Feature(int64_list=tf.train.Int64List(value=text)),
  8. "seg_ids": tf.train.Feature(int64_list=tf.train.Int64List(value=seg_ids_[it])),
  9. "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label_[it]])),
  10. }
  11. )
  12. )
  13. serialized = example.SerializeToString()
  14. writer.write(serialized)
  15. writer.close()

4.训练

加载模型,加载数据:

  1. if __name__ == '__main__':
  2. with tf.Session() as sess:
  3. data_set_train = get_dataset(train_snli_name_tf)
  4. data_set_train = data_set_train.shuffle(config.shuffle_pool_size).repeat(). \
  5. padded_batch(config.batch_size, padded_shapes=([config.max_length], [config.max_length], []))
  6. data_set_train_iter = data_set_train.make_one_shot_iterator()
  7. train_handle = sess.run(data_set_train_iter.string_handle())
  8. data_set_test = get_dataset(os.path.join(test_snli_name_tf))
  9. if test_total_acc:
  10. data_set_test = data_set_test.shuffle(config.shuffle_pool_size). \
  11. padded_batch(config.batch_size, padded_shapes=([config.max_length], [config.max_length], []))
  12. else:
  13. data_set_test = data_set_test.shuffle(config.shuffle_pool_size).repeat(). \
  14. padded_batch(config.batch_size, padded_shapes=([config.max_length], [config.max_length], []))
  15. data_set_test_iter = data_set_test.make_one_shot_iterator()
  16. test_handle = sess.run(data_set_test_iter.string_handle())
  17. handle = tf.placeholder(tf.string, shape=[])
  18. iterator = tf.data.Iterator.from_string_handle(handle, data_set_train.output_types,
  19. data_set_train.output_shapes)
  20. inputs, segs, target = iterator.get_next()
  21. tsl = TransformerSNLICls(inputs, segs, target, config)
  22. sess.run(tf.global_variables_initializer())
  23. saver = tf.train.Saver(max_to_keep=1)

开始训练:

  1. print("starting training")
  2. for i in range(12000):
  3. train_feed = {handle: train_handle}
  4. sess.run(tsl.train_op, train_feed)
  5. if (i+1) % 100 == 0:
  6. pred, acc, loss = sess.run([tsl.preds, tsl.acc, tsl.loss], train_feed)
  7. print("Generation train {} : acc: {} loss: {} ".format(i, acc, loss))
  8. if (i+1) % 200 == 0:
  9. tpred, tacc, tloss = sess.run([tsl.preds, tsl.acc, tsl.loss], {handle: test_handle})
  10. print("Generation test {} : acc: {} loss: {} ".format(i, tacc, tloss))
  11. if (i+1) % 2000 == 0:
  12. print("Generation train {} model saved ".format(i))
  13. saver.save(sess, os.path.join(model_save_path, model_name.format(model_choose)))
  14. saver.save(sess, os.path.join(model_save_path, model_name.format(model_choose)))

最后,初步在整个测试集上达到78.7%的准确度

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号