赞
踩
https://github.com/OpenNMT/OpenNMT-py
GitHub - OpenNMT/CTranslate2: Fast inference engine for Transformer models
安装软件包:
pip install OpenNMT-py
pip install ctranslate2
基于CTranslate2和openNMT-py对比性能(非基于官方镜像的方式)
参考OpenNMT-py github主页链接下载Pretrained models
基于WMT训练的English-German - Transformer
这个模型里面包含两个文件
averaged-10-epoch.pt和sentencepiece.model
前者是保存的模型,后者是sentencepiece tokenizer的模型
CTranslate2模型转换和量化参考:
https://github.com/OpenNMT/CTranslate2/blob/master/docs/quantization.md
下载测试数据(写在CTranslate2/tools/benchmark/benchmark_all.py)
- import sacrebleu
-
- # Benchmark configuration
- test_set = "wmt14"
- langpair = "en-de"
-
- print("Downloading the test files...")
- source_file = sacrebleu.get_source_file(test_set, langpair=langpair)
- target_file = sacrebleu.get_reference_files(test_set, langpair=langpair)[0]
- print("source_file:", source_file)
- print("target_file:", target_file)
会下载翻译前后的原始文本,分别为en-de.en, en-de.de。
bench:参考CTranslate2/tools/benchmark/opennmt_ende_wmt14/目录的ctranslate2和opennmt_py两个目录,里面分别有tokenize.sh和translate.sh。
先对于opennmt_py目录,可以根据tokenize.sh创建custom_tokenize.py,内容如下:
- import pyonmttok
-
- sp_model_path = "transformer-ende-wmt-pyOnmt/sentencepiece.model"
-
- src_file = "sacrebleu/wmt14/en-de.en"
- tgt_file = src_file + ".tok"
-
- pyonmttok.Tokenizer("none", sp_model_path=sp_model_path).tokenize_file(src_file, tgt_file)
执行后为输入文本生成对应的tokenized文本
然后执行翻译脚本为上一步tokenized输入文本生成翻译后的tokenized文本
修改translate.sh:
- #!/bin/bash
-
- # EXTRA_ARGS=""
- # if [ $DEVICE = "GPU" ]; then
- # EXTRA_ARGS+=" -gpu 0"
- # fi
- # if [ ${INT8:-0} = "1" ]; then
- # EXTRA_ARGS+=" -int8"
- # fi
-
- model_path=transformer-ende-wmt-pyOnmt/averaged-10-epoch.pt
- src_file=sacrebleu//wmt14/en-de.en.tok
- out_file=${src_file}.onmt.out
-
- # onmt_translate \
- python -m onmt.bin.translate \
- -model ${model_path} \
- -src ${src_file} \
- -out ${out_file} \
- -batch_size 32 \
- -beam_size 4 -gpu 0
最后进行detokenize,可以参考detokenize.sh编写custom_detokenize.py:
- import sys;
- import pyonmttok;
-
- sp_model_path = "transformer-ende-wmt-pyOnmt/sentencepiece.model"
-
- src_file = "sacrebleu/wmt14/en-de.en.tok.onmt.out"
- tgt_file = src_file + ".detok"
-
- pyonmttok.Tokenizer("none", sp_model_path=sp_model_path).detokenize_file(src_file, tgt_file)
执行后会为翻译后的tokenized文本生成detokenize的文本文件,结果跟en-de.de应该相似。
ctranslate2执行跟上面流程除了translate步骤其他一样。
benchmark/ctranslate2/translate.sh里面使用了translate二进制可执行文件,但直接pip安装的ctranslate2并没有这个translate文件。
如果不使用镜像或自己构建镜像,直接使用Pip包推理,则应该自己读入文件调用python接口推理。为了公平的比较,openNMT-py的推理可以改成同样的方式并加warmup。
ctranslate2 python接口使用参考:
CTranslate2/python.md at master · OpenNMT/CTranslate2 · GitHub
这里采用自己编写ctans_custom_translate.py脚本读入文件推理:
- import ctranslate2
- import time
-
- src_file = "sacrebleu/wmt14/en-de.en.tok"
- tgt_file = src_file + ".ctrans.out"
-
- model_path = "transformer-ende-wmt-pyOnmt/ende_ctranslate2/"
- device = "cuda" # "cpu", "cuda"
- max_batch_size = 32
- beam_size = 4
-
- with open(src_file, "r") as f:
- lines = f.readlines()
- lines = [line.strip('\n').split(" ") for line in lines]
-
- translator = ctranslate2.Translator(model_path, device=device)
-
- # warmup
- translator.translate_batch(lines[:max_batch_size * 4], max_batch_size=max_batch_size, beam_size=beam_size)
-
- time1 = time.time()
- trans_results = translator.translate_batch(lines, max_batch_size=max_batch_size, beam_size=beam_size)
- time2 = time.time()
- print("ctranslate2 translate time:", time2 - time1)
-
- result_lines = [" ".join(result.hypotheses[0]) + "\n" for result in trans_results]
-
- with open(tgt_file, "w") as f:
- f.writelines(result_lines)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。