赞
踩
机器翻译是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程。
本项目是机器翻译领域主流模型 Transformer 的 PaddlePaddle 实现,包含模型训练,预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的翻译模型。
Transformer 是论文 Attention Is All You Need 中提出的用以完成机器翻译(Machine Translation)等序列到序列(Seq2Seq)学习任务的一种全新网络结构,其完全使用注意力(Attention)机制来实现序列到序列的建模。
相较于此前 Seq2Seq 模型中广泛使用的循环神经网络(Recurrent Neural Network, RNN),使用Self Attention进行输入序列到输出序列的变换主要具有以下优势:
PaddlePaddle框架,AI Studio平台已经默认安装最新版2.1。
PaddleNLP,深度兼容框架2.1,是飞桨框架2.1在NLP领域的最佳实践。
!unzip -o transformer_mt.zip
%cd transformer_mt/
[Errno 2] No such file or directory: 'transformer_mt/'
/home/aistudio/transformer_mt
# 安装依赖
!pip install --upgrade paddlenlp -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already up-to-date: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.1.0) Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1) Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1) Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2) Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4) Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0) Requirement already satisfied, skipping upgrade: paddlefsl==1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.0.0) Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0) Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp) (0.3.3) Requirement already satisfied, skipping upgrade: numpy>=1.14.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (1.20.3) Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2) Requirement already satisfied, skipping upgrade: tqdm~=4.27.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp) (4.27.0) Requirement already satisfied, skipping upgrade: requests~=2.24.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp) (2.24.0) Requirement already satisfied, skipping upgrade: pillow==8.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp) (8.2.0) Requirement already satisfied, skipping upgrade: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.15.0) Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0) Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1) Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3) Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (3.0.4) Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (2.8) Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (1.25.6) Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (2019.9.11) Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: attrdict==2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 1)) (2.0.1) Requirement already satisfied: PyYAML==5.4.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 2)) (5.4.1) Requirement already satisfied: subword_nmt==0.3.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (0.3.7) Requirement already satisfied: jieba==0.42.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 4)) (0.42.1) Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from attrdict==2.0.1->-r requirements.txt (line 1)) (1.15.0)
import os import time import yaml import logging import argparse import numpy as np from pprint import pprint from attrdict import AttrDict import jieba import numpy as np from functools import partial import paddle import paddle.distributed as dist from paddle.io import DataLoader,BatchSampler from paddlenlp.data import Vocab, Pad from paddlenlp.datasets import load_dataset from paddlenlp.transformers import TransformerModel, InferTransformerModel, CrossEntropyCriterion, position_encoding_init from paddlenlp.utils.log import logger from utils import post_process_seq
本教程使用CWMT数据集中的中文英文的数据作为训练语料,
CWMT数据集在900万+,质量较高,非常适合来训练Transformer机器翻译。
中文需要Jieba+BPE,英文需要BPE
BPE优势:
# 数据预处理过程,包括jieba分词、bpe分词和词表。
!bash preprocess.sh
jieba tokenize...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.705 seconds.
Prefix dict has been built successfully.
source learn-bpe and apply-bpe...
no pair has frequency >= 2. Stopping
target learn-bpe and apply-bpe...
no pair has frequency >= 2. Stopping
source get-vocab. if loading pretrained model, use its vocab.
target get-vocab. if loading pretrained model, use its vocab.
Over.
# 下载预训练模型
!bash get_data_and_model.sh
Over.
下面的create_data_loader
函数用于创建训练集、验证集所需要的DataLoader
对象,
create_infer_loader
函数用于创建预测集所需要的DataLoader
对象,
DataLoader
对象用于产生一个个batch的数据。下面对函数中调用的paddlenlp
内置函数作简单说明:
paddlenlp.data.Vocab.load_vocabulary
:Vocab词表类,集合了一系列文本token与ids之间映射的一系列方法,支持从文件、字典、json等一系方式构建词表paddlenlp.datasets.load_dataset
:从本地文件创建数据集时,推荐根据本地数据集的格式给出读取function并传入 load_dataset() 中创建数据集paddlenlp.data.Pad
:padding 操作# 自定义读取本地数据的方法 def read(src_path, tgt_path, is_predict=False): if is_predict: with open(src_path, 'r', encoding='utf8') as src_f: for src_line in src_f.readlines(): src_line = src_line.strip() if not src_line: continue yield {'src':src_line, 'tgt':''} else: with open(src_path, 'r', encoding='utf8') as src_f, open(tgt_path, 'r', encoding='utf8') as tgt_f: for src_line, tgt_line in zip(src_f.readlines(), tgt_f.readlines()): src_line = src_line.strip() if not src_line: continue tgt_line = tgt_line.strip() if not tgt_line: continue yield {'src':src_line, 'tgt':tgt_line} # 过滤掉长度 ≤min_len或者≥max_len 的数据 def min_max_filer(data, max_len, min_len=0): # 1 for special tokens. data_min_len = min(len(data[0]), len(data[1])) + 1 data_max_len = max(len(data[0]), len(data[1])) + 1 return (data_min_len >= min_len) and (data_max_len <= max_len)
# 创建训练集、验证集的dataloader def create_data_loader(args): train_dataset = load_dataset(read, src_path=args.training_file.split(',')[0], tgt_path=args.training_file.split(',')[1], lazy=False) dev_dataset = load_dataset(read, src_path=args.validation_file.split(',')[0], tgt_path=args.validation_file.split(',')[1], lazy=False) src_vocab = Vocab.load_vocabulary( args.src_vocab_fpath, bos_token=args.special_token[0], eos_token=args.special_token[1], unk_token=args.special_token[2]) trg_vocab = Vocab.load_vocabulary( args.trg_vocab_fpath, bos_token=args.special_token[0], eos_token=args.special_token[1], unk_token=args.special_token[2]) padding_vocab = ( lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor ) args.src_vocab_size = padding_vocab(len(src_vocab)) args.trg_vocab_size = padding_vocab(len(trg_vocab)) def convert_samples(sample): source = sample['src'].split() target = sample['tgt'].split() source = src_vocab.to_indices(source) target = trg_vocab.to_indices(target) return source, target # 训练集dataloader和验证集dataloader data_loaders = [] for i, dataset in enumerate([train_dataset, dev_dataset]): dataset = dataset.map(convert_samples, lazy=False).filter( partial(min_max_filer, max_len=args.max_length)) # BatchSampler: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/BatchSampler_cn.html batch_sampler = BatchSampler(dataset,batch_size=args.batch_size, shuffle=True,drop_last=False) # DataLoader: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html data_loader = DataLoader( dataset=dataset, batch_sampler=batch_sampler, collate_fn=partial( prepare_train_input, bos_idx=args.bos_idx, eos_idx=args.eos_idx, pad_idx=args.bos_idx), num_workers=0, return_list=True) data_loaders.append(data_loader) return data_loaders def prepare_train_input(insts, bos_idx, eos_idx, pad_idx): """ Put all padded data needed by training into a list. """ word_pad = Pad(pad_idx) src_word = word_pad([inst[0] + [eos_idx] for inst in insts]) trg_word = word_pad([[bos_idx] + inst[1] for inst in insts]) lbl_word = np.expand_dims( word_pad([inst[1] + [eos_idx] for inst in insts]), axis=2) data_inputs = [src_word, trg_word, lbl_word] return data_inputs
# 创建测试集的dataloader,原理步骤同上(创建训练集、验证集的dataloader) def create_infer_loader(args): dataset = load_dataset(read, src_path=args.predict_file, tgt_path=None, is_predict=True, lazy=False) src_vocab = Vocab.load_vocabulary( args.src_vocab_fpath, bos_token=args.special_token[0], eos_token=args.special_token[1], unk_token=args.special_token[2]) trg_vocab = Vocab.load_vocabulary( args.trg_vocab_fpath, bos_token=args.special_token[0], eos_token=args.special_token[1], unk_token=args.special_token[2]) padding_vocab = ( lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor ) args.src_vocab_size = padding_vocab(len(src_vocab)) args.trg_vocab_size = padding_vocab(len(trg_vocab)) def convert_samples(sample): source = sample['src'].split() target = sample['tgt'].split() source = src_vocab.to_indices(source) target = trg_vocab.to_indices(target) return source, target dataset = dataset.map(convert_samples, lazy=False) # BatchSampler: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/BatchSampler_cn.html batch_sampler = BatchSampler(dataset,batch_size=args.infer_batch_size,drop_last=False) # DataLoader: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html data_loader = DataLoader( dataset=dataset, batch_sampler=batch_sampler, collate_fn=partial( prepare_infer_input, bos_idx=args.bos_idx, eos_idx=args.eos_idx, pad_idx=args.bos_idx), num_workers=0, return_list=True) return data_loader, trg_vocab.to_tokens def prepare_infer_input(insts, bos_idx, eos_idx, pad_idx): """ Put all padded data needed by beam search decoder into a list. """ word_pad = Pad(pad_idx) src_word = word_pad([inst[0] + [eos_idx] for inst in insts]) return [src_word, ]
PaddleNLP提供Transformer API供调用:
paddlenlp.transformers.TransformerModel
:Transformer模型的实现paddlenlp.transformers.InferTransformerModel
:Transformer模型用于生成paddlenlp.transformers.CrossEntropyCriterion
:计算交叉熵损失paddlenlp.transformers.position_encoding_init
:Transformer 位置编码的初始化运行do_train
函数,
在do_train
函数中,配置优化器、损失函数,以及评价指标Perplexity;
Perplexity,即困惑度,常用来衡量语言模型优劣,即句子的通顺度,一般用于机器翻译和文本生成等领域。Perplexity越小,句子越通顺,该语言模型越好。
def do_train(args): if args.use_gpu: place = "gpu" else: place = "cpu" paddle.set_device(place) # Set seed for CE random_seed = eval(str(args.random_seed)) if random_seed is not None: paddle.seed(random_seed) # Define data loader (train_loader), (eval_loader) = create_data_loader(args) # Define model transformer = TransformerModel( src_vocab_size=args.src_vocab_size, trg_vocab_size=args.trg_vocab_size, max_length=args.max_length + 1, num_encoder_layers=args.n_layer, num_decoder_layers=args.n_layer, n_head=args.n_head, d_model=args.d_model, d_inner_hid=args.d_inner_hid, dropout=args.dropout, weight_sharing=args.weight_sharing, bos_id=args.bos_idx, eos_id=args.eos_idx) # Define loss criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx) scheduler = paddle.optimizer.lr.NoamDecay( args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0) # Define optimizer optimizer = paddle.optimizer.Adam( learning_rate=scheduler, beta1=args.beta1, beta2=args.beta2, epsilon=float(args.eps), parameters=transformer.parameters()) step_idx = 0 # Train loop for pass_id in range(args.epoch): batch_id = 0 for input_data in train_loader: (src_word, trg_word, lbl_word) = input_data logits = transformer(src_word=src_word, trg_word=trg_word) sum_cost, avg_cost, token_num = criterion(logits, lbl_word) # 计算梯度 avg_cost.backward() # 更新参数 optimizer.step() # 梯度清零 optimizer.clear_grad() if (step_idx + 1) % args.print_step == 0 or step_idx == 0: total_avg_cost = avg_cost.numpy() logger.info( "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, " " ppl: %f " % (step_idx, pass_id, batch_id, total_avg_cost, np.exp([min(total_avg_cost, 100)]))) if (step_idx + 1) % args.save_step == 0: # Validation transformer.eval() total_sum_cost = 0 total_token_num = 0 with paddle.no_grad(): for input_data in eval_loader: (src_word, trg_word, lbl_word) = input_data logits = transformer( src_word=src_word, trg_word=trg_word) sum_cost, avg_cost, token_num = criterion(logits, lbl_word) total_sum_cost += sum_cost.numpy() total_token_num += token_num.numpy() total_avg_cost = total_sum_cost / total_token_num logger.info("validation, step_idx: %d, avg loss: %f, " " ppl: %f" % (step_idx, total_avg_cost, np.exp([min(total_avg_cost, 100)]))) transformer.train() if args.save_model: model_dir = os.path.join(args.save_model, "step_" + str(step_idx)) if not os.path.exists(model_dir): os.makedirs(model_dir) paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams")) paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt")) batch_id += 1 step_idx += 1 scheduler.step() if args.save_model: model_dir = os.path.join(args.save_model, "step_final") if not os.path.exists(model_dir): os.makedirs(model_dir) paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams")) paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt"))
# 读入参数
yaml_file = 'transformer.base.yaml'
with open(yaml_file, 'rt') as f:
args = AttrDict(yaml.safe_load(f))
pprint(args)
{'batch_size': 50, 'beam_size': 5, 'beta1': 0.9, 'beta2': 0.997, 'bos_idx': 0, 'd_inner_hid': 2048, 'd_model': 512, 'dropout': 0.1, 'eos_idx': 1, 'epoch': 1, 'eps': '1e-9', 'infer_batch_size': 50, 'init_from_params': 'trained_models/CWMT2021_step_345000/', 'label_smooth_eps': 0.1, 'learning_rate': 2.0, 'max_length': 256, 'max_out_len': 256, 'n_best': 1, 'n_head': 8, 'n_layer': 6, 'output_file': 'train_dev_test/predict.txt', 'pad_factor': 8, 'predict_file': 'train_dev_test/ccmt2019-news.zh2en.source_bpe', 'print_step': 10, 'random_seed': 'None', 'save_model': 'trained_models', 'save_step': 20, 'special_token': ['<s>', '<e>', '<unk>'], 'src_vocab_fpath': 'train_dev_test/vocab.ch.src', 'src_vocab_size': 10000, 'training_file': 'train_dev_test/train.ch.bpe,train_dev_test/train.en.bpe', 'trg_vocab_fpath': 'train_dev_test/vocab.en.tgt', 'trg_vocab_size': 10000, 'unk_idx': 2, 'use_gpu': True, 'validation_file': 'train_dev_test/dev.ch.bpe,train_dev_test/dev.en.bpe', 'warmup_steps': 8000, 'weight_sharing': False}
do_train(args)
[2021-10-20 18:45:23,800] [ INFO] - step_idx: 0, epoch: 0, batch: 0, avg loss: 10.526473, ppl: 37289.726562
[2021-10-20 18:45:24,991] [ INFO] - step_idx: 9, epoch: 0, batch: 9, avg loss: 10.517828, ppl: 36968.742188
[2021-10-20 18:45:26,296] [ INFO] - step_idx: 19, epoch: 0, batch: 19, avg loss: 10.475711, ppl: 35444.054688
[2021-10-20 18:45:26,404] [ INFO] - validation, step_idx: 19, avg loss: 10.480215, ppl: 35604.062500
模型最终训练的效果一般可通过测试集来进行测试,机器翻译领域一般计算BLEU值。
def do_predict(args): if args.use_gpu: place = "gpu" else: place = "cpu" paddle.set_device(place) # Define data loader test_loader, to_tokens = create_infer_loader(args) # Define model transformer = InferTransformerModel( src_vocab_size=args.src_vocab_size, trg_vocab_size=args.trg_vocab_size, max_length=args.max_length + 1, num_encoder_layers=args.n_layer, num_decoder_layers=args.n_layer, n_head=args.n_head, d_model=args.d_model, d_inner_hid=args.d_inner_hid, dropout=args.dropout, weight_sharing=args.weight_sharing, bos_id=args.bos_idx, eos_id=args.eos_idx, beam_size=args.beam_size, max_out_len=args.max_out_len) # Load the trained model assert args.init_from_params, ( "Please set init_from_params to load the infer model.") model_dict = paddle.load( os.path.join(args.init_from_params, "transformer.pdparams")) # To avoid a longer length than training, reset the size of position # encoding to max_length model_dict["encoder.pos_encoder.weight"] = position_encoding_init( args.max_length + 1, args.d_model) model_dict["decoder.pos_encoder.weight"] = position_encoding_init( args.max_length + 1, args.d_model) transformer.load_dict(model_dict) # Set evaluate mode transformer.eval() f = open(args.output_file, "w") with paddle.no_grad(): for (src_word, ) in test_loader: finished_seq = transformer(src_word=src_word) finished_seq = finished_seq.numpy().transpose([0, 2, 1]) for ins in finished_seq: for beam_idx, beam in enumerate(ins): if beam_idx >= args.n_best: break id_list = post_process_seq(beam, args.bos_idx, args.eos_idx) word_list = to_tokens(id_list) sequence = " ".join(word_list) + "\n" f.write(sequence) f.close()
do_predict(args)
预测结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估
# 还原 predict.txt 中的预测结果为 tokenize 后的数据
! sed -r 's/(@@ )|(@@ ?$)//g' train_dev_test/predict.txt > train_dev_test/predict.tok.txt
# BLEU评估工具来源于 https://github.com/moses-smt/mosesdecoder.git
! tar -zxf mosesdecoder.tar.gz
# 计算multi-bleu
! perl mosesdecoder/scripts/generic/multi-bleu.perl train_dev_test/ccmt2019-news.zh2en.ref*.txt < train_dev_test/predict.tok.txt
BLEU = 38.11, 74.5/49.1/32.5/21.7 (BP=0.951, ratio=0.952, hyp_len=22252, ref_len=23371)
It is not advisable to publish scores from multi-bleu.perl. The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups. Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization. Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。