当前位置:   article > 正文

深度有趣 | 26 Seq2Seq机器翻译

深度有趣 | 26 Seq2Seq机器翻译

简介

介绍如何使用Sequence to Sequence Learning(Seq2Seq)实现神经机器翻译(Neural Machine Translation,NMT)

原理

之前我们通过序列标注模型实现了中文分词,序列标注属于Seq2Seq的一种

这次我们使用Seq2Seq实现NMT,由于输入语句和输出语句都包含多个词并且数量不一定相同,所以对应上图中的第四种情况

最简单的做法是,先将整个输入语句编码成固定长度的向量表示,然后再逐步进行解码输出对应的翻译语句,Encoder和Decoder都可以使用RNN来实现

在RNN类型上可以选择LSTM或GRU,也可以考虑使用多层LSTM、双向LSTM等扩展

也可以考虑Attention机制,对于输入序列每个输入得到的输出,计算注意力权重并加权

  • 不仅仅使用Encoder最后一步的输出,而且使用Encoder每一步的输出,和图像标题生成中的小块类似
  • Decoder每次进行生成时,先根据Decoder当前状态和Encoder每一步输出之间的关系,计算对应的注意力权重
  • 根据权重将Encoder每一步的输出进行加权求和,得到当前这一步所使用的上下文context
  • Decoder根据context以及上一步的输出,更新得到下一步的状态,进而得到下一步的输出
\alpha_{ts}=\frac{\exp(score(\mathbf{h}_t,\mathbf{\bar{h}}_s))}{\sum^{S}_{​{s}'=1}\exp(score(\mathbf{h}_t,\mathbf{\bar{h}}_{​{s}'}))}
\mathbf{c}_t=\sum^{S}_{s=1}\alpha_{ts}\mathbf{\bar{h}}_s
\mathbf{\alpha}_t=f(\mathbf{c}_t,\mathbf{h}_t)=tanh(\mathbf{W}_c[\mathbf{c}_t;\mathbf{h}_t])

在计算注意力权重时,主要有乘式和加式两类实现方案,前者称作Luong's multiplicative style,后者称作Bahdanau's additive style

score(\mathbf{h}_t,\mathbf{\bar{h}}_s)=\mathbf{h}_t^T \mathbf{W} \mathbf{\bar{h}}_s
score(\mathbf{h}_t,\mathbf{\bar{h}}_s)=\mathbf{v}_\alpha^T tanh(\mathbf{W}_1 \mathbf{h}_t+\mathbf{W}_2 \mathbf{\bar{h}}_s)

数据

使用小牛翻译开源社区提供的中英文平行语料,www.niutrans.com/,经过整理后,训练集共10W对数据,验证集共1K对数据,测试集共400对数据

实现

这里我们主要使用TensorFlow提供的API来实现Seq2Seq Learning、Attention和beam search等内容,参考以下项目实现,github.com/tensorflow/…

代码包括训练、验证、推断三部分

  • 训练:在训练集上训练模型,并计算损失函数
  • 验证:在验证集上验证模型,并计算损失函数
  • 推断:在测试集上应用模型,不计算损失函数,使用beam search生成序列,并使用bleu指标进行评估

加载库

  1. # -*- coding: utf-8 -*-
  2. import tensorflow as tf
  3. import numpy as np
  4. import matplotlib.pyplot as plt
  5. %matplotlib inline
  6. from sklearn.utils import shuffle
  7. from keras.preprocessing.sequence import pad_sequences
  8. import os
  9. from tqdm import tqdm
  10. import pickle
  11. 复制代码

加载中英文词典,保留最常见的2W个词,其他词以<unk>表示

  1. def load_vocab(path):
  2. with open(path, 'r') as fr:
  3. vocab = fr.readlines()
  4. vocab = [w.strip('\n') for w in vocab]
  5. return vocab
  6. vocab_ch = load_vocab('data/vocab.ch')
  7. vocab_en = load_vocab('data/vocab.en')
  8. print(len(vocab_ch), vocab_ch[:20])
  9. print(len(vocab_en), vocab_en[:20])
  10. word2id_ch = {w: i for i, w in enumerate(vocab_ch)}
  11. id2word_ch = {i: w for i, w in enumerate(vocab_ch)}
  12. word2id_en = {w: i for i, w in enumerate(vocab_en)}
  13. id2word_en = {i: w for i, w in enumerate(vocab_en)}
  14. 复制代码

加载训练集、验证集、测试集数据,计算中英文数据对应的最大序列长度,并根据mode对相应数据进行padding

  1. def load_data(path, word2id):
  2. with open(path, 'r') as fr:
  3. lines = fr.readlines()
  4. sentences = [line.strip('\n').split(' ') for line in lines]
  5. sentences = [[word2id['<s>']] + [word2id[w] for w in sentence] + [word2id['</s>']]
  6. for sentence in sentences]
  7. lens = [len(sentence) for sentence in sentences]
  8. maxlen = np.max(lens)
  9. return sentences, lens, maxlen
  10. # train: training, no beam search, calculate loss
  11. # eval: no training, no beam search, calculate loss
  12. # infer: no training, beam search, calculate bleu
  13. mode = 'train'
  14. train_ch, len_train_ch, maxlen_train_ch = load_data('data/train.ch', word2id_ch)
  15. train_en, len_train_en, maxlen_train_en = load_data('data/train.en', word2id_en)
  16. dev_ch, len_dev_ch, maxlen_dev_ch = load_data('data/dev.ch', word2id_ch)
  17. dev_en, len_dev_en, maxlen_dev_en = load_data('data/dev.en', word2id_en)
  18. test_ch, len_test_ch, maxlen_test_ch = load_data('data/test.ch', word2id_ch)
  19. test_en, len_test_en, maxlen_test_en = load_data('data/test.en', word2id_en)
  20. maxlen_ch = np.max([maxlen_train_ch, maxlen_dev_ch, maxlen_test_ch])
  21. maxlen_en = np.max([maxlen_train_en, maxlen_dev_en, maxlen_test_en])
  22. print(maxlen_ch, maxlen_en)
  23. if mode == 'train':
  24. train_ch = pad_sequences(train_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
  25. train_en = pad_sequences(train_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
  26. print(train_ch.shape, train_en.shape)
  27. elif mode == 'eval':
  28. dev_ch = pad_sequences(dev_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
  29. dev_en = pad_sequences(dev_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
  30. print(dev_ch.shape, dev_en.shape)
  31. elif mode == 'infer':
  32. test_ch = pad_sequences(test_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
  33. test_en = pad_sequences(test_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
  34. print(test_ch.shape, test_en.shape)
  35. 复制代码

定义四个placeholder,对输入进行嵌入

  1. X = tf.placeholder(tf.int32, [None, maxlen_ch])
  2. X_len = tf.placeholder(tf.int32, [None])
  3. Y = tf.placeholder(tf.int32, [None, maxlen_en])
  4. Y_len = tf.placeholder(tf.int32, [None])
  5. Y_in = Y[:, :-1]
  6. Y_out = Y[:, 1:]
  7. k_initializer = tf.contrib.layers.xavier_initializer()
  8. e_initializer = tf.random_uniform_initializer(-1.0, 1.0)
  9. embedding_size = 512
  10. hidden_size = 512
  11. if mode == 'train':
  12. batch_size = 128
  13. else:
  14. batch_size = 16
  15. with tf.variable_scope('embedding_X'):
  16. embeddings_X = tf.get_variable('weights_X', [len(word2id_ch), embedding_size], initializer=e_initializer)
  17. embedded_X = tf.nn.embedding_lookup(embeddings_X, X) # batch_size, seq_len, embedding_size
  18. with tf.variable_scope('embedding_Y'):
  19. embeddings_Y = tf.get_variable('weights_Y', [len(word2id_en), embedding_size], initializer=e_initializer)
  20. embedded_Y = tf.nn.embedding_lookup(embeddings_Y, Y_in) # batch_size, seq_len, embedding_size
  21. 复制代码

定义encoder部分,使用双向LSTM

  1. def single_cell(mode=mode):
  2. if mode == 'train':
  3. keep_prob = 0.8
  4. else:
  5. keep_prob = 1.0
  6. cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
  7. cell = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=keep_prob)
  8. return cell
  9. def multi_cells(num_layers):
  10. cells = []
  11. for i in range(num_layers):
  12. cell = single_cell()
  13. cells.append(cell)
  14. return tf.nn.rnn_cell.MultiRNNCell(cells)
  15. with tf.variable_scope('encoder'):
  16. num_layers = 1
  17. fw_cell = multi_cells(num_layers)
  18. bw_cell = multi_cells(num_layers)
  19. bi_outputs, bi_state = tf.nn.bidirectional_dynamic_rnn(fw_cell, bw_cell, embedded_X, dtype=tf.float32,
  20. sequence_length=X_len)
  21. # fw: batch_size, seq_len, hidden_size
  22. # bw: batch_size, seq_len, hidden_size
  23. print('=' * 100, '\n', bi_outputs)
  24. encoder_outputs = tf.concat(bi_outputs, -1)
  25. print('=' * 100, '\n', encoder_outputs) # batch_size, seq_len, 2 * hidden_size
  26. # 2 tuple(fw & bw), 2 tuple(c & h), batch_size, hidden_size
  27. print('=' * 100, '\n', bi_state)
  28. encoder_state = []
  29. for i in range(num_layers):
  30. encoder_state.append(bi_state[0][i]) # forward
  31. encoder_state.append(bi_state[1][i]) # backward
  32. encoder_state = tuple(encoder_state) # 2 tuple, 2 tuple(c & h), batch_size, hidden_size
  33. print('=' * 100)
  34. for i in range(len(encoder_state)):
  35. print(i, encoder_state[i])
  36. 复制代码

定义decoder部分,使用两层LSTM

  1. with tf.variable_scope('decoder'):
  2. beam_width = 10
  3. memory = encoder_outputs
  4. if mode == 'infer':
  5. memory = tf.contrib.seq2seq.tile_batch(memory, beam_width)
  6. X_len = tf.contrib.seq2seq.tile_batch(X_len, beam_width)
  7. encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, beam_width)
  8. bs = batch_size * beam_width
  9. else:
  10. bs = batch_size
  11. attention = tf.contrib.seq2seq.LuongAttention(hidden_size, memory, X_len, scale=True) # multiplicative
  12. # attention = tf.contrib.seq2seq.BahdanauAttention(hidden_size, memory, X_len, normalize=True) # additive
  13. cell = multi_cells(num_layers * 2)
  14. cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention, hidden_size, name='attention')
  15. decoder_initial_state = cell.zero_state(bs, tf.float32).clone(cell_state=encoder_state)
  16. with tf.variable_scope('projected'):
  17. output_layer = tf.layers.Dense(len(word2id_en), use_bias=False, kernel_initializer=k_initializer)
  18. if mode == 'infer':
  19. start = tf.fill([batch_size], word2id_en['<s>'])
  20. decoder = tf.contrib.seq2seq.BeamSearchDecoder(cell, embeddings_Y, start, word2id_en['</s>'],
  21. decoder_initial_state, beam_width, output_layer)
  22. outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder,
  23. output_time_major=True,
  24. maximum_iterations=2 * tf.reduce_max(X_len))
  25. sample_id = outputs.predicted_ids
  26. else:
  27. helper = tf.contrib.seq2seq.TrainingHelper(embedded_Y, [maxlen_en - 1 for b in range(batch_size)])
  28. decoder = tf.contrib.seq2seq.BasicDecoder(cell, helper, decoder_initial_state, output_layer)
  29. outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder,
  30. output_time_major=True)
  31. logits = outputs.rnn_output
  32. logits = tf.transpose(logits, (1, 0, 2))
  33. print(logits)
  34. 复制代码

根据mode选择是否需要定义损失函数和优化器

  1. if mode != 'infer':
  2. with tf.variable_scope('loss'):
  3. loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y_out, logits=logits)
  4. mask = tf.sequence_mask(Y_len, tf.shape(Y_out)[1], tf.float32)
  5. loss = tf.reduce_sum(loss * mask) / batch_size
  6. if mode == 'train':
  7. learning_rate = tf.Variable(0.0, trainable=False)
  8. params = tf.trainable_variables()
  9. grads, _ = tf.clip_by_global_norm(tf.gradients(loss, params), 5.0)
  10. optimizer = tf.train.GradientDescentOptimizer(learning_rate).apply_gradients(zip(grads, params))
  11. 复制代码

训练部分代码,经过20轮训练后,训练损失从200以上降到52.19,perplexity降到5.53

  1. sess = tf.Session()
  2. sess.run(tf.global_variables_initializer())
  3. if mode == 'train':
  4. saver = tf.train.Saver()
  5. OUTPUT_DIR = 'model_diy'
  6. if not os.path.exists(OUTPUT_DIR):
  7. os.mkdir(OUTPUT_DIR)
  8. tf.summary.scalar('loss', loss)
  9. summary = tf.summary.merge_all()
  10. writer = tf.summary.FileWriter(OUTPUT_DIR)
  11. epochs = 20
  12. for e in range(epochs):
  13. total_loss = 0
  14. total_count = 0
  15. start_decay = int(epochs * 2 / 3)
  16. if e <= start_decay:
  17. lr = 1.0
  18. else:
  19. decay = 0.5 ** (int(4 * (e - start_decay) / (epochs - start_decay)))
  20. lr = 1.0 * decay
  21. sess.run(tf.assign(learning_rate, lr))
  22. train_ch, len_train_ch, train_en, len_train_en = shuffle(train_ch, len_train_ch, train_en, len_train_en)
  23. for i in tqdm(range(train_ch.shape[0] // batch_size)):
  24. X_batch = train_ch[i * batch_size: i * batch_size + batch_size]
  25. X_len_batch = len_train_ch[i * batch_size: i * batch_size + batch_size]
  26. Y_batch = train_en[i * batch_size: i * batch_size + batch_size]
  27. Y_len_batch = len_train_en[i * batch_size: i * batch_size + batch_size]
  28. Y_len_batch = [l - 1 for l in Y_len_batch]
  29. feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
  30. _, ls_ = sess.run([optimizer, loss], feed_dict=feed_dict)
  31. total_loss += ls_ * batch_size
  32. total_count += np.sum(Y_len_batch)
  33. if i > 0 and i % 100 == 0:
  34. writer.add_summary(sess.run(summary,
  35. feed_dict=feed_dict),
  36. e * train_ch.shape[0] // batch_size + i)
  37. writer.flush()
  38. print('Epoch %d lr %.3f perplexity %.2f' % (e, lr, np.exp(total_loss / total_count)))
  39. saver.save(sess, os.path.join(OUTPUT_DIR, 'nmt'))
  40. 复制代码

验证部分代码,验证集的perplexity为11.56

  1. if mode == 'eval':
  2. saver = tf.train.Saver()
  3. OUTPUT_DIR = 'model_diy'
  4. saver.restore(sess, tf.train.latest_checkpoint(OUTPUT_DIR))
  5. total_loss = 0
  6. total_count = 0
  7. for i in tqdm(range(dev_ch.shape[0] // batch_size)):
  8. X_batch = dev_ch[i * batch_size: i * batch_size + batch_size]
  9. X_len_batch = len_dev_ch[i * batch_size: i * batch_size + batch_size]
  10. Y_batch = dev_en[i * batch_size: i * batch_size + batch_size]
  11. Y_len_batch = len_dev_en[i * batch_size: i * batch_size + batch_size]
  12. Y_len_batch = [l - 1 for l in Y_len_batch]
  13. feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
  14. ls_ = sess.run(loss, feed_dict=feed_dict)
  15. total_loss += ls_ * batch_size
  16. total_count += np.sum(Y_len_batch)
  17. print('Dev perplexity %.2f' % np.exp(total_loss / total_count))
  18. 复制代码

推断部分代码,测试集的bleu为0.2069,生成的英文翻译结果在output_test_diy中

  1. if mode == 'infer':
  2. saver = tf.train.Saver()
  3. OUTPUT_DIR = 'model_diy'
  4. saver.restore(sess, tf.train.latest_checkpoint(OUTPUT_DIR))
  5. def translate(ids):
  6. words = [id2word_en[i] for i in ids]
  7. if words[0] == '<s>':
  8. words = words[1:]
  9. if '</s>' in words:
  10. words = words[:words.index('</s>')]
  11. return ' '.join(words)
  12. fw = open('output_test_diy', 'w')
  13. for i in tqdm(range(test_ch.shape[0] // batch_size)):
  14. X_batch = test_ch[i * batch_size: i * batch_size + batch_size]
  15. X_len_batch = len_test_ch[i * batch_size: i * batch_size + batch_size]
  16. Y_batch = test_en[i * batch_size: i * batch_size + batch_size]
  17. Y_len_batch = len_test_en[i * batch_size: i * batch_size + batch_size]
  18. Y_len_batch = [l - 1 for l in Y_len_batch]
  19. feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
  20. ids = sess.run(sample_id, feed_dict=feed_dict) # seq_len, batch_size, beam_width
  21. ids = np.transpose(ids, (1, 2, 0)) # batch_size, beam_width, seq_len
  22. ids = ids[:, 0, :] # batch_size, seq_len
  23. for j in range(ids.shape[0]):
  24. sentence = translate(ids[j])
  25. fw.write(sentence + '\n')
  26. fw.close()
  27. from nmt.utils.evaluation_utils import evaluate
  28. for metric in ['bleu', 'rouge']:
  29. score = evaluate('data/test.en', 'output_test_diy', metric)
  30. print(metric, score / 100)
  31. 复制代码

造好的轮子

以下项目提供了非常完整的接口,github.com/tensorflow/…,通过简单的配置即可定制不同的模型,支持70多个配置项,举几个例子

  • --num_units:RNN隐层神经元的个数
  • --unit_type:RNN类型,可以是lstm、gru、layer_norm_lstm、nas
  • --num_layers:RNN的层数
  • --encoder_type:RNN的类型,可以是uni、bi、gnmt
  • --residual:是否使用残差连接
  • --attention:注意力类型,可以是luong、scaled_luong、bahdanau、normed_bahdanau,或者为空表示不使用注意力机制

如果觉得配置项太繁琐,以上项目也提供好了4个配置项模板,iwslt15.json适用于小数据集(IWSLT English-Vietnamese,13W),其他三个模版适用于大数据集(WMT German-English,4.5M)

使用以上项目训练中译英模型,只需要运行以下命令,如果是训练英译中模型,修改src和tgt的值即可

  1. python -m nmt.nmt --src=ch --tgt=en --vocab_prefix=data/vocab --train_prefix=data/train --dev_prefix=data/dev --test_prefix=data/test --out_dir=model_nmt --hparams_path=nmt/standard_hparams/iwslt15.json
  2. 复制代码

训练结果包括以下内容

  • 最后五次保存下来的模型
  • train_log中包括可供tensorboard查看的events文件
  • output_dev和output_test分别对应验证集和测试集的翻译结果
  • best_bleu中包括在验证集上bleu score最高的五个版本模型

模型在验证集上的bleu为0.233,在测试集上的bleu为0.224

使用以下命令进行推断,把需要翻译的文本写入对应文件即可,生成的英文翻译结果在output_test_nmt中

  1. python -m nmt.nmt --out_dir=model_nmt --inference_input_file=test.ch --inference_output_file=output_test_nmt
  2. 复制代码

对联生成

使用以下数据集,github.com/wb14123/cou…,包括70W条对联数据

使用以下命令训练模型,将iwslt15.json复制一份为couplet.json,因为数据量更多,所以适当增加训练次数,即修改num_train_steps为100000

没有验证集也没有关系,用测试集替代即可,因为必填参数若不填将会报错

  1. python -m nmt.nmt --src=in --tgt=out --vocab_prefix=couplet/vocab --train_prefix=couplet/train --dev_prefix=couplet/test --test_prefix=couplet/test --out_dir=model_couplet --hparams_path=nmt/standard_hparams/couplet.json
  2. 复制代码

output_test中的一些结果示例,每三句依次为上联、下联、生成的下联,字数、词性和词意基本都对上了

  1. 腾 飞 上 铁 , 锐 意 改 革 谋 发 展 , 勇 当 千 里 马
  2. 和 谐 南 供 , 安 全 送 电 保 畅 通 , 争 做 领 头 羊
  3. 改 革 开 放 , 科 学 发 展 促 繁 荣 , 争 做 领 头 羊
  4. 风 弦 未 拨 心 先 乱
  5. 夜 幕 已 沉 梦 更 闲
  6. 雪 韵 初 融 意 更 浓
  7. 彩 屏 如 画 , 望 秀 美 崤 函 , 花 团 锦 簇
  8. 短 信 报 春 , 喜 和 谐 社 会 , 物 阜 民 康
  9. 妙 笔 生 花 , 书 辉 煌 史 册 , 虎 啸 龙 吟
  10. 复制代码

如果需要根据没有见过的上联生成下联即进行推断,则使用之前介绍过的方法即可

参考

视频讲解课程

深度有趣(一)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/空白诗007/article/detail/801694
推荐阅读
相关标签
  

闽ICP备14008679号