当前位置:   article > 正文

基于术语词典干预的机器翻译挑战赛baseline及改进(Datawhale AI 夏令营)_机器翻译比赛

机器翻译比赛

比赛背景

        目前,神经机器翻译(NMT)技术在翻译质量和速度方面已经取得了显著进展。然而,在特定领域或行业中,NMT仍面临一些挑战,尤其是在术语一致性方面。对于术语名词、人名地名等特定词汇,机器翻译经常会出现不准确的结果,这会导致翻译的混淆或歧义。通过引入术语词典,可以纠正这些错误,从而提高翻译质量。

赛事任务

        本次比赛的任务是基于术语词典干预的英文到中文的机器翻译。大赛提供了以下数据:

  • 训练集:中英双语数据,共计14万余对句子。
  • 开发集:英中双语数据,共计1000对句子。
  • 测试集:英中双语数据,共计1000对句子。
  • 术语词典:包含英中对照的2226条术语。

        参赛队伍需要使用提供的训练数据,构建并训练多语言机器翻译模型,并基于测试集和术语词典,提供最终的翻译结果。

数据说明

        所有文件均为UTF-8编码,训练集、开发集、测试集和术语词典的格式如下:

  • 训练集:每行为一个句对样本,格式如图1所示。

    示例:

图1 训练集格式

术语词典格式如图2所示。

图2 术语词典格式

评估指标

对于参赛队伍提交的测试集翻译结果文件,采用自动评价指标BLUE-4进行评价,具体工具使用sacrebleu开源版本。

Baseline

  • 加载和预处理训练数据和术语词典。
  • 定义序列到序列的神经网络模型,包括编码器和解码器。
  • 使用训练数据训练模型,并保存训练好的模型参数。
  • 在测试集上进行推理,并利用术语词典干预翻译结果,确保术语的一致性。
  1. # 安装torchtext
  2. !pip install torchtext
  3. import torch
  4. import torch.nn as nn
  5. import torch.optim as optim
  6. from torch.utils.data import Dataset, DataLoader
  7. from torchtext.data.utils import get_tokenizer
  8. from collections import Counter
  9. import random
  10. from torch.utils.data import Subset, DataLoader
  11. import time
  12. # 定义数据集类,处理术语词典
  13. class TranslationDataset(Dataset):
  14. def __init__(self, filename, terminology):
  15. self.data = []
  16. with open(filename, 'r', encoding='utf-8') as f:
  17. for line in f:
  18. en, zh = line.strip().split('\t')
  19. self.data.append((en, zh))
  20. self.terminology = terminology
  21. # 创建词汇表,确保术语词典中的词也被包含在词汇表中
  22. self.en_tokenizer = get_tokenizer('basic_english')
  23. self.zh_tokenizer = list # 使用字符级分词
  24. en_vocab = Counter(self.terminology.keys()) # 确保术语在词汇表中
  25. zh_vocab = Counter()
  26. for en, zh in self.data:
  27. en_vocab.update(self.en_tokenizer(en))
  28. zh_vocab.update(self.zh_tokenizer(zh))
  29. self.en_vocab = ['<pad>', '<sos>', '<eos>'] + list(self.terminology.keys()) + [word for word, _ in en_vocab.most_common(10000)]
  30. self.zh_vocab = ['<pad>', '<sos>', '<eos>'] + [word for word, _ in zh_vocab.most_common(10000)]
  31. self.en_word2idx = {word: idx for idx, word in enumerate(self.en_vocab)}
  32. self.zh_word2idx = {word: idx for idx, word in enumerate(self.zh_vocab)}
  33. def __len__(self):
  34. return len(self.data)
  35. def __getitem__(self, idx):
  36. en, zh = self.data[idx]
  37. en_tensor = torch.tensor([self.en_word2idx.get(word, self.en_word2idx['<sos>']) for word in self.en_tokenizer(en)] + [self.en_word2idx['<eos>']])
  38. zh_tensor = torch.tensor([self.zh_word2idx.get(word, self.zh_word2idx['<sos>']) for word in self.zh_tokenizer(zh)] + [self.zh_word2idx['<eos>']])
  39. return en_tensor, zh_tensor
  40. def collate_fn(batch):
  41. en_batch, zh_batch = [], []
  42. for en_item, zh_item in batch:
  43. en_batch.append(en_item)
  44. zh_batch.append(zh_item)
  45. en_batch = nn.utils.rnn.pad_sequence(en_batch, padding_value=0, batch_first=True)
  46. zh_batch = nn.utils.rnn.pad_sequence(zh_batch, padding_value=0, batch_first=True)
  47. return en_batch, zh_batch
  48. class Encoder(nn.Module):
  49. def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
  50. super().__init__()
  51. self.embedding = nn.Embedding(input_dim, emb_dim)
  52. self.rnn = nn.GRU(emb_dim, hid_dim, n_layers, dropout=dropout, batch_first=True)
  53. self.dropout = nn.Dropout(dropout)
  54. def forward(self, src):
  55. embedded = self.dropout(self.embedding(src))
  56. outputs, hidden = self.rnn(embedded)
  57. return outputs, hidden
  58. class Decoder(nn.Module):
  59. def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
  60. super().__init__()
  61. self.output_dim = output_dim
  62. self.embedding = nn.Embedding(output_dim, emb_dim)
  63. self.rnn = nn.GRU(emb_dim, hid_dim, n_layers, dropout=dropout, batch_first=True)
  64. self.fc_out = nn.Linear(hid_dim, output_dim)
  65. self.dropout = nn.Dropout(dropout)
  66. def forward(self, input, hidden):
  67. embedded = self.dropout(self.embedding(input))
  68. output, hidden = self.rnn(embedded, hidden)
  69. prediction = self.fc_out(output.squeeze(1))
  70. return prediction, hidden
  71. class Seq2Seq(nn.Module):
  72. def __init__(self, encoder, decoder, device):
  73. super().__init__()
  74. self.encoder = encoder
  75. self.decoder = decoder
  76. self.device = device
  77. def forward(self, src, trg, teacher_forcing_ratio=0.5):
  78. batch_size = src.shape[0]
  79. trg_len = trg.shape[1]
  80. trg_vocab_size = self.decoder.output_dim
  81. outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
  82. _, hidden = self.encoder(src)
  83. input = trg[:, 0].unsqueeze(1)
  84. for t in range(1, trg_len):
  85. output, hidden = self.decoder(input, hidden)
  86. outputs[:, t, :] = output
  87. teacher_force = random.random() < teacher_forcing_ratio
  88. top1 = output.argmax(1)
  89. input = trg[:, t].unsqueeze(1) if teacher_force else top1.unsqueeze(1)
  90. return outputs
  91. def load_terminology_dictionary(dict_file):
  92. terminology = {}
  93. with open(dict_file, 'r', encoding='utf-8') as f:
  94. for line in f:
  95. en_term, ch_term = line.strip().split('\t')
  96. terminology[en_term] = ch_term
  97. return terminology
  98. def train(model, iterator, optimizer, criterion, clip):
  99. model.train()
  100. epoch_loss = 0
  101. for i, (src, trg) in enumerate(iterator):
  102. src, trg = src.to(device), trg.to(device)
  103. optimizer.zero_grad()
  104. output = model(src, trg)
  105. output_dim = output.shape[-1]
  106. output = output[:, 1:].contiguous().view(-1, output_dim)
  107. trg = trg[:, 1:].contiguous().view(-1)
  108. loss = criterion(output, trg)
  109. loss.backward()
  110. torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
  111. optimizer.step()
  112. epoch_loss += loss.item()
  113. return epoch_loss / len(iterator)
  114. def evaluate_bleu(model, dataset, src_file, ref_file, terminology, device):
  115. model.eval()
  116. src_sentences = load_sentences(src_file)
  117. ref_sentences = load_sentences(ref_file)
  118. translated_sentences = []
  119. for src in src_sentences:
  120. translated = translate_sentence(src, model, dataset, terminology, device)
  121. translated_sentences.append(translated)
  122. bleu = BLEU()
  123. score = bleu.corpus_score(translated_sentences, [ref_sentences])
  124. return score
  125. def translate_sentence(sentence, model, dataset, terminology, device, max_length=50):
  126. model.eval()
  127. tokens = dataset.en_tokenizer(sentence)
  128. tensor = torch.LongTensor([dataset.en_word2idx.get(token, dataset.en_word2idx['<sos>']) for token in tokens]).unsqueeze(0).to(device)
  129. with torch.no_grad():
  130. _, hidden = model.encoder(tensor)
  131. translated_tokens = []
  132. input_token = torch.LongTensor([[dataset.zh_word2idx['<sos>']]]).to(device)
  133. for _ in range(max_length):
  134. output, hidden = model.decoder(input_token, hidden)
  135. top_token = output.argmax(1)
  136. translated_token = dataset.zh_vocab[top_token.item()]
  137. if translated_token == '<eos>':
  138. break
  139. if translated_token in terminology.values():
  140. for en_term, ch_term in terminology.items():
  141. if translated_token == ch_term:
  142. translated_token = en_term
  143. break
  144. translated_tokens.append(translated_token)
  145. input_token = top_token.unsqueeze(1)
  146. return ''.join(translated_tokens)
  147. def inference(model, dataset, src_file, save_dir, terminology, device):
  148. model.eval()
  149. src_sentences = load_sentences(src_file)
  150. translated_sentences = []
  151. for src in src_sentences:
  152. translated = translate_sentence(src, model, dataset, terminology, device)
  153. translated_sentences.append(translated)
  154. text = '\n'.join(translated_sentences)
  155. with open(save_dir, 'w', encoding='utf-8') as f:
  156. f.write(text)
  157. def load_sentences(file_path):
  158. with open(file_path, 'r', encoding='utf-8') as f:
  159. return [line.strip() for line in f]
  160. # 主函数
  161. if __name__ == '__main__':
  162. start_time = time.time()
  163. device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  164. # 加载术语词典
  165. terminology = load_terminology_dictionary('../dataset/en-zh.dic')
  166. # 加载数据
  167. dataset = TranslationDataset('../dataset/train.txt', terminology)
  168. N = 1000
  169. subset_indices = list(range(N))
  170. subset_dataset = Subset(dataset, subset_indices)
  171. train_loader = DataLoader(subset_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
  172. INPUT_DIM = len(dataset.en_vocab)
  173. OUTPUT_DIM = len(dataset.zh_vocab)
  174. ENC_EMB_DIM = 256
  175. DEC_EMB_DIM = 256
  176. HID_DIM = 512
  177. N_LAYERS = 2
  178. ENC_DROPOUT = 0.5
  179. DEC_DROPOUT = 0.5
  180. enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
  181. dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)
  182. model = Seq2Seq(enc, dec, device).to(device)
  183. optimizer = optim.Adam(model.parameters())
  184. criterion = nn.CrossEntropyLoss(ignore_index=dataset.zh_word2idx['<pad>'])
  185. N_EPOCHS = 10
  186. CLIP = 1
  187. for epoch in range(N_EPOCHS):
  188. train_loss = train(model, train_loader, optimizer, criterion, CLIP)
  189. print(f'Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f}')
  190. torch.save(model.state_dict(), './translation_model_GRU.pth')
  191. end_time = time.time()
  192. elapsed_time_minute = (end_time - start_time)/60
  193. print(f"Total running time: {elapsed_time_minute:.2f} minutes")
  194. bleu_score = evaluate_bleu(model, dataset, '../dataset/dev_en.txt', '../dataset/dev_zh.txt', terminology, device)
  195. print(f'BLEU-4 score: {bleu_score.score:.2f}')
  196. save_dir = '../dataset/submit.txt'
  197. inference(model, dataset, "../dataset/test_en.txt", save_dir, terminology, device)
  198. print(f"翻译完成!文件已保存到{save_dir}")

baseline改进

  • 增加训练样本。
  • 增加训练轮数。
  1. # 选择数据集的前N个样本进行训练
  2. N = int(len(dataset) * 0.8)
  3. # 训练模型
  4. N_EPOCHS = 100

如果你觉得这篇博文对你有帮助,请点赞、收藏、关注我,并且可以打赏支持我!

欢迎关注我的后续博文,我将分享更多关于人工智能、自然语言处理和计算机视觉的精彩内容。

谢谢大家的支持!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/941807
推荐阅读
相关标签
  

闽ICP备14008679号