简介
介绍如何使用Sequence to Sequence Learning(Seq2Seq)实现神经机器翻译(Neural Machine Translation,NMT)
原理
之前我们通过序列标注模型实现了中文分词,序列标注属于Seq2Seq的一种
这次我们使用Seq2Seq实现NMT,由于输入语句和输出语句都包含多个词并且数量不一定相同,所以对应上图中的第四种情况
最简单的做法是,先将整个输入语句编码成固定长度的向量表示,然后再逐步进行解码输出对应的翻译语句,Encoder和Decoder都可以使用RNN来实现
在RNN类型上可以选择LSTM或GRU,也可以考虑使用多层LSTM、双向LSTM等扩展
也可以考虑Attention机制,对于输入序列每个输入得到的输出,计算注意力权重并加权
- 不仅仅使用Encoder最后一步的输出,而且使用Encoder每一步的输出,和图像标题生成中的小块类似
- Decoder每次进行生成时,先根据Decoder当前状态和Encoder每一步输出之间的关系,计算对应的注意力权重
- 根据权重将Encoder每一步的输出进行加权求和,得到当前这一步所使用的上下文context
- Decoder根据context以及上一步的输出,更新得到下一步的状态,进而得到下一步的输出
在计算注意力权重时,主要有乘式和加式两类实现方案,前者称作Luong's multiplicative style
,后者称作Bahdanau's additive style
数据
使用小牛翻译开源社区提供的中英文平行语料,www.niutrans.com/,经过整理后,训练集共10W对数据,验证集共1K对数据,测试集共400对数据
实现
这里我们主要使用TensorFlow提供的API来实现Seq2Seq Learning、Attention和beam search等内容,参考以下项目实现,github.com/tensorflow/…
代码包括训练、验证、推断三部分
- 训练:在训练集上训练模型,并计算损失函数
- 验证:在验证集上验证模型,并计算损失函数
- 推断:在测试集上应用模型,不计算损失函数,使用beam search生成序列,并使用bleu指标进行评估
加载库
- # -*- coding: utf-8 -*-
-
- import tensorflow as tf
- import numpy as np
- import matplotlib.pyplot as plt
- %matplotlib inline
- from sklearn.utils import shuffle
- from keras.preprocessing.sequence import pad_sequences
- import os
- from tqdm import tqdm
- import pickle
- 复制代码
加载中英文词典,保留最常见的2W个词,其他词以<unk>
表示
- def load_vocab(path):
- with open(path, 'r') as fr:
- vocab = fr.readlines()
- vocab = [w.strip('\n') for w in vocab]
- return vocab
-
- vocab_ch = load_vocab('data/vocab.ch')
- vocab_en = load_vocab('data/vocab.en')
- print(len(vocab_ch), vocab_ch[:20])
- print(len(vocab_en), vocab_en[:20])
-
- word2id_ch = {w: i for i, w in enumerate(vocab_ch)}
- id2word_ch = {i: w for i, w in enumerate(vocab_ch)}
- word2id_en = {w: i for i, w in enumerate(vocab_en)}
- id2word_en = {i: w for i, w in enumerate(vocab_en)}
- 复制代码
加载训练集、验证集、测试集数据,计算中英文数据对应的最大序列长度,并根据mode对相应数据进行padding
- def load_data(path, word2id):
- with open(path, 'r') as fr:
- lines = fr.readlines()
- sentences = [line.strip('\n').split(' ') for line in lines]
- sentences = [[word2id['<s>']] + [word2id[w] for w in sentence] + [word2id['</s>']]
- for sentence in sentences]
-
- lens = [len(sentence) for sentence in sentences]
- maxlen = np.max(lens)
- return sentences, lens, maxlen
-
- # train: training, no beam search, calculate loss
- # eval: no training, no beam search, calculate loss
- # infer: no training, beam search, calculate bleu
- mode = 'train'
-
- train_ch, len_train_ch, maxlen_train_ch = load_data('data/train.ch', word2id_ch)
- train_en, len_train_en, maxlen_train_en = load_data('data/train.en', word2id_en)
- dev_ch, len_dev_ch, maxlen_dev_ch = load_data('data/dev.ch', word2id_ch)
- dev_en, len_dev_en, maxlen_dev_en = load_data('data/dev.en', word2id_en)
- test_ch, len_test_ch, maxlen_test_ch = load_data('data/test.ch', word2id_ch)
- test_en, len_test_en, maxlen_test_en = load_data('data/test.en', word2id_en)
-
- maxlen_ch = np.max([maxlen_train_ch, maxlen_dev_ch, maxlen_test_ch])
- maxlen_en = np.max([maxlen_train_en, maxlen_dev_en, maxlen_test_en])
- print(maxlen_ch, maxlen_en)
-
- if mode == 'train':
- train_ch = pad_sequences(train_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
- train_en = pad_sequences(train_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
- print(train_ch.shape, train_en.shape)
- elif mode == 'eval':
- dev_ch = pad_sequences(dev_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
- dev_en = pad_sequences(dev_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
- print(dev_ch.shape, dev_en.shape)
- elif mode == 'infer':
- test_ch = pad_sequences(test_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
- test_en = pad_sequences(test_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
- print(test_ch.shape, test_en.shape)
- 复制代码
定义四个placeholder,对输入进行嵌入
- X = tf.placeholder(tf.int32, [None, maxlen_ch])
- X_len = tf.placeholder(tf.int32, [None])
- Y = tf.placeholder(tf.int32, [None, maxlen_en])
- Y_len = tf.placeholder(tf.int32, [None])
- Y_in = Y[:, :-1]
- Y_out = Y[:, 1:]
-
- k_initializer = tf.contrib.layers.xavier_initializer()
- e_initializer = tf.random_uniform_initializer(-1.0, 1.0)
-
- embedding_size = 512
- hidden_size = 512
-
- if mode == 'train':
- batch_size = 128
- else:
- batch_size = 16
-
- with tf.variable_scope('embedding_X'):
- embeddings_X = tf.get_variable('weights_X', [len(word2id_ch), embedding_size], initializer=e_initializer)
- embedded_X = tf.nn.embedding_lookup(embeddings_X, X) # batch_size, seq_len, embedding_size
-
- with tf.variable_scope('embedding_Y'):
- embeddings_Y = tf.get_variable('weights_Y', [len(word2id_en), embedding_size], initializer=e_initializer)
- embedded_Y = tf.nn.embedding_lookup(embeddings_Y, Y_in) # batch_size, seq_len, embedding_size
- 复制代码
定义encoder部分,使用双向LSTM
- def single_cell(mode=mode):
- if mode == 'train':
- keep_prob = 0.8
- else:
- keep_prob = 1.0
- cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
- cell = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=keep_prob)
- return cell
-
- def multi_cells(num_layers):
- cells = []
- for i in range(num_layers):
- cell = single_cell()
- cells.append(cell)
- return tf.nn.rnn_cell.MultiRNNCell(cells)
-
- with tf.variable_scope('encoder'):
- num_layers = 1
- fw_cell = multi_cells(num_layers)
- bw_cell = multi_cells(num_layers)
- bi_outputs, bi_state = tf.nn.bidirectional_dynamic_rnn(fw_cell, bw_cell, embedded_X, dtype=tf.float32,
- sequence_length=X_len)
- # fw: batch_size, seq_len, hidden_size
- # bw: batch_size, seq_len, hidden_size
- print('=' * 100, '\n', bi_outputs)
-
- encoder_outputs = tf.concat(bi_outputs, -1)
- print('=' * 100, '\n', encoder_outputs) # batch_size, seq_len, 2 * hidden_size
-
- # 2 tuple(fw & bw), 2 tuple(c & h), batch_size, hidden_size
- print('=' * 100, '\n', bi_state)
-
- encoder_state = []
- for i in range(num_layers):
- encoder_state.append(bi_state[0][i]) # forward
- encoder_state.append(bi_state[1][i]) # backward
- encoder_state = tuple(encoder_state) # 2 tuple, 2 tuple(c & h), batch_size, hidden_size
- print('=' * 100)
- for i in range(len(encoder_state)):
- print(i, encoder_state[i])
- 复制代码
定义decoder部分,使用两层LSTM
- with tf.variable_scope('decoder'):
- beam_width = 10
- memory = encoder_outputs
-
- if mode == 'infer':
- memory = tf.contrib.seq2seq.tile_batch(memory, beam_width)
- X_len = tf.contrib.seq2seq.tile_batch(X_len, beam_width)
- encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, beam_width)
- bs = batch_size * beam_width
- else:
- bs = batch_size
-
- attention = tf.contrib.seq2seq.LuongAttention(hidden_size, memory, X_len, scale=True) # multiplicative
- # attention = tf.contrib.seq2seq.BahdanauAttention(hidden_size, memory, X_len, normalize=True) # additive
- cell = multi_cells(num_layers * 2)
- cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention, hidden_size, name='attention')
- decoder_initial_state = cell.zero_state(bs, tf.float32).clone(cell_state=encoder_state)
-
- with tf.variable_scope('projected'):
- output_layer = tf.layers.Dense(len(word2id_en), use_bias=False, kernel_initializer=k_initializer)
-
- if mode == 'infer':
- start = tf.fill([batch_size], word2id_en['<s>'])
- decoder = tf.contrib.seq2seq.BeamSearchDecoder(cell, embeddings_Y, start, word2id_en['</s>'],
- decoder_initial_state, beam_width, output_layer)
- outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder,
- output_time_major=True,
- maximum_iterations=2 * tf.reduce_max(X_len))
- sample_id = outputs.predicted_ids
- else:
- helper = tf.contrib.seq2seq.TrainingHelper(embedded_Y, [maxlen_en - 1 for b in range(batch_size)])
- decoder = tf.contrib.seq2seq.BasicDecoder(cell, helper, decoder_initial_state, output_layer)
-
- outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder,
- output_time_major=True)
- logits = outputs.rnn_output
- logits = tf.transpose(logits, (1, 0, 2))
- print(logits)
- 复制代码
根据mode选择是否需要定义损失函数和优化器
- if mode != 'infer':
- with tf.variable_scope('loss'):
- loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y_out, logits=logits)
- mask = tf.sequence_mask(Y_len, tf.shape(Y_out)[1], tf.float32)
- loss = tf.reduce_sum(loss * mask) / batch_size
-
- if mode == 'train':
- learning_rate = tf.Variable(0.0, trainable=False)
- params = tf.trainable_variables()
- grads, _ = tf.clip_by_global_norm(tf.gradients(loss, params), 5.0)
- optimizer = tf.train.GradientDescentOptimizer(learning_rate).apply_gradients(zip(grads, params))
- 复制代码
训练部分代码,经过20轮训练后,训练损失从200以上降到52.19,perplexity降到5.53
- sess = tf.Session()
- sess.run(tf.global_variables_initializer())
-
- if mode == 'train':
- saver = tf.train.Saver()
- OUTPUT_DIR = 'model_diy'
- if not os.path.exists(OUTPUT_DIR):
- os.mkdir(OUTPUT_DIR)
-
- tf.summary.scalar('loss', loss)
- summary = tf.summary.merge_all()
- writer = tf.summary.FileWriter(OUTPUT_DIR)
-
- epochs = 20
- for e in range(epochs):
- total_loss = 0
- total_count = 0
-
- start_decay = int(epochs * 2 / 3)
- if e <= start_decay:
- lr = 1.0
- else:
- decay = 0.5 ** (int(4 * (e - start_decay) / (epochs - start_decay)))
- lr = 1.0 * decay
- sess.run(tf.assign(learning_rate, lr))
-
- train_ch, len_train_ch, train_en, len_train_en = shuffle(train_ch, len_train_ch, train_en, len_train_en)
-
- for i in tqdm(range(train_ch.shape[0] // batch_size)):
- X_batch = train_ch[i * batch_size: i * batch_size + batch_size]
- X_len_batch = len_train_ch[i * batch_size: i * batch_size + batch_size]
- Y_batch = train_en[i * batch_size: i * batch_size + batch_size]
- Y_len_batch = len_train_en[i * batch_size: i * batch_size + batch_size]
- Y_len_batch = [l - 1 for l in Y_len_batch]
-
- feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
- _, ls_ = sess.run([optimizer, loss], feed_dict=feed_dict)
-
- total_loss += ls_ * batch_size
- total_count += np.sum(Y_len_batch)
-
- if i > 0 and i % 100 == 0:
- writer.add_summary(sess.run(summary,
- feed_dict=feed_dict),
- e * train_ch.shape[0] // batch_size + i)
- writer.flush()
-
- print('Epoch %d lr %.3f perplexity %.2f' % (e, lr, np.exp(total_loss / total_count)))
- saver.save(sess, os.path.join(OUTPUT_DIR, 'nmt'))
- 复制代码
验证部分代码,验证集的perplexity为11.56
- if mode == 'eval':
- saver = tf.train.Saver()
- OUTPUT_DIR = 'model_diy'
- saver.restore(sess, tf.train.latest_checkpoint(OUTPUT_DIR))
-
- total_loss = 0
- total_count = 0
- for i in tqdm(range(dev_ch.shape[0] // batch_size)):
- X_batch = dev_ch[i * batch_size: i * batch_size + batch_size]
- X_len_batch = len_dev_ch[i * batch_size: i * batch_size + batch_size]
- Y_batch = dev_en[i * batch_size: i * batch_size + batch_size]
- Y_len_batch = len_dev_en[i * batch_size: i * batch_size + batch_size]
- Y_len_batch = [l - 1 for l in Y_len_batch]
-
- feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
- ls_ = sess.run(loss, feed_dict=feed_dict)
-
- total_loss += ls_ * batch_size
- total_count += np.sum(Y_len_batch)
-
- print('Dev perplexity %.2f' % np.exp(total_loss / total_count))
- 复制代码
推断部分代码,测试集的bleu为0.2069,生成的英文翻译结果在output_test_diy中
- if mode == 'infer':
- saver = tf.train.Saver()
- OUTPUT_DIR = 'model_diy'
- saver.restore(sess, tf.train.latest_checkpoint(OUTPUT_DIR))
-
- def translate(ids):
- words = [id2word_en[i] for i in ids]
- if words[0] == '<s>':
- words = words[1:]
- if '</s>' in words:
- words = words[:words.index('</s>')]
- return ' '.join(words)
-
- fw = open('output_test_diy', 'w')
- for i in tqdm(range(test_ch.shape[0] // batch_size)):
- X_batch = test_ch[i * batch_size: i * batch_size + batch_size]
- X_len_batch = len_test_ch[i * batch_size: i * batch_size + batch_size]
- Y_batch = test_en[i * batch_size: i * batch_size + batch_size]
- Y_len_batch = len_test_en[i * batch_size: i * batch_size + batch_size]
- Y_len_batch = [l - 1 for l in Y_len_batch]
-
- feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
- ids = sess.run(sample_id, feed_dict=feed_dict) # seq_len, batch_size, beam_width
- ids = np.transpose(ids, (1, 2, 0)) # batch_size, beam_width, seq_len
- ids = ids[:, 0, :] # batch_size, seq_len
-
- for j in range(ids.shape[0]):
- sentence = translate(ids[j])
- fw.write(sentence + '\n')
- fw.close()
-
- from nmt.utils.evaluation_utils import evaluate
-
- for metric in ['bleu', 'rouge']:
- score = evaluate('data/test.en', 'output_test_diy', metric)
- print(metric, score / 100)
- 复制代码
造好的轮子
以下项目提供了非常完整的接口,github.com/tensorflow/…,通过简单的配置即可定制不同的模型,支持70多个配置项,举几个例子
--num_units
:RNN隐层神经元的个数--unit_type
:RNN类型,可以是lstm、gru、layer_norm_lstm、nas--num_layers
:RNN的层数--encoder_type
:RNN的类型,可以是uni、bi、gnmt--residual
:是否使用残差连接--attention
:注意力类型,可以是luong、scaled_luong、bahdanau、normed_bahdanau,或者为空表示不使用注意力机制
如果觉得配置项太繁琐,以上项目也提供好了4个配置项模板,iwslt15.json
适用于小数据集(IWSLT English-Vietnamese,13W),其他三个模版适用于大数据集(WMT German-English,4.5M)
使用以上项目训练中译英模型,只需要运行以下命令,如果是训练英译中模型,修改src和tgt的值即可
- python -m nmt.nmt --src=ch --tgt=en --vocab_prefix=data/vocab --train_prefix=data/train --dev_prefix=data/dev --test_prefix=data/test --out_dir=model_nmt --hparams_path=nmt/standard_hparams/iwslt15.json
- 复制代码
训练结果包括以下内容
- 最后五次保存下来的模型
- train_log中包括可供tensorboard查看的events文件
- output_dev和output_test分别对应验证集和测试集的翻译结果
- best_bleu中包括在验证集上bleu score最高的五个版本模型
模型在验证集上的bleu为0.233,在测试集上的bleu为0.224
使用以下命令进行推断,把需要翻译的文本写入对应文件即可,生成的英文翻译结果在output_test_nmt中
- python -m nmt.nmt --out_dir=model_nmt --inference_input_file=test.ch --inference_output_file=output_test_nmt
- 复制代码
对联生成
使用以下数据集,github.com/wb14123/cou…,包括70W条对联数据
使用以下命令训练模型,将iwslt15.json
复制一份为couplet.json
,因为数据量更多,所以适当增加训练次数,即修改num_train_steps
为100000
没有验证集也没有关系,用测试集替代即可,因为必填参数若不填将会报错
- python -m nmt.nmt --src=in --tgt=out --vocab_prefix=couplet/vocab --train_prefix=couplet/train --dev_prefix=couplet/test --test_prefix=couplet/test --out_dir=model_couplet --hparams_path=nmt/standard_hparams/couplet.json
- 复制代码
output_test中的一些结果示例,每三句依次为上联、下联、生成的下联,字数、词性和词意基本都对上了
- 腾 飞 上 铁 , 锐 意 改 革 谋 发 展 , 勇 当 千 里 马
- 和 谐 南 供 , 安 全 送 电 保 畅 通 , 争 做 领 头 羊
- 改 革 开 放 , 科 学 发 展 促 繁 荣 , 争 做 领 头 羊
-
- 风 弦 未 拨 心 先 乱
- 夜 幕 已 沉 梦 更 闲
- 雪 韵 初 融 意 更 浓
-
- 彩 屏 如 画 , 望 秀 美 崤 函 , 花 团 锦 簇
- 短 信 报 春 , 喜 和 谐 社 会 , 物 阜 民 康
- 妙 笔 生 花 , 书 辉 煌 史 册 , 虎 啸 龙 吟
- 复制代码
如果需要根据没有见过的上联生成下联即进行推断,则使用之前介绍过的方法即可
参考
- Neural Machine Translation (seq2seq) Tutorial:github.com/tensorflow/…
- 对联数据库:github.com/wb14123/cou…