[KEY: > input, = target, < output] > il est en train de peindre un tableau . = he is painting a picture . < he is painting a picture . > pourquoi ne pas essayer ce vin delicieux ? = why not try that delicious wine ? < why not try that delicious wine ? > elle n est pas poete mais romanciere . = she is not a poet but a novelist . < she not not a poet but a novelist . > vous etes trop maigre . = you re too skinny . < you re all alone .
这可以通过序列到序列网络来实现,其中两个递归神经网络一起工作以将一个序列转换成另一个序列。编码器网络将输入序列压缩成向量,并 且解码器网络将该向量展开成新的序列。
https://pytorch.org/ PyTorch 安装指南
Deep Learning with PyTorch:A 60 Minute Blitz :PyTorch的基本入门教程
Learning PyTorch with Examples:得到深层而广泛的概述
PyTorch for Former Torch Users Lua Torch:如果你曾是一个Lua张量的使用者
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Sequence to Sequence Learning with Neural Networks
Neural Machine Translation by Jointly Learning to Align and Translate
A Neural Conversational Model
您还可以找到之前有关Classifying Names with a Character-Level RNN和 Generating Names with a Character-Level RNN 的教程,因为这些概念分别与编码器和解码器模型非常相似。
from __future__ import unicode_literals, print_function, division from io import open import unicodedata import string import re import random import torch import torch.nn as nn from torch import optim import torch.nn.functional as F device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
关于Open Data Stack Exchange的这个问题,开放式翻译网站 https://tatoeba.org/给出了指导,该网站的下载位于https://tatoeba.org/eng/downloads
英语到法语对因为太大而无法包含在repo中,因此下载到data / eng-fra.txt再继续进行后续步骤。该文件是以制表符分隔的翻译对列表:
I am cold. J'ai froid.
与字符级RNN教程中使用的字符编码类似,我们将语言中的每个单词表示为one-hot向量或零的巨向量,除了单个字符(在单词的索引处)。 与语言中可能存在的几十个字符相比,还有更多的字,因此编码向量很大。然而,我们投机取巧并修剪数据,每种语言只使用几千个单词。
我们将需要每个单词的唯一索引,以便稍后用作网络的输入和目标。为了跟踪所有这些,我们将使用一个名为Lang的辅助类,它具有 word→index(word2index)和index→word(index2word)的字典,以及用于稍后替换稀有单词的每个单词word2count的计数。
SOS_token = 0 EOS_token = 1 class Lang: def __init__(self, name): self.name = name self.word2index = {} self.word2count = {} self.index2word = {0: "SOS", 1: "EOS"} self.n_words = 2 # Count SOS and EOS def addSentence(self, sentence): for word in sentence.split(' '): self.addWord(word) def addWord(self, word): if word not in self.word2index: self.word2index[word] = self.n_words self.word2count[word] = 1 self.index2word[self.n_words] = word self.n_words += 1 else: self.word2count[word] += 1
# 将Unicode字符串转换为纯ASCII, 感谢https://stackoverflow.com/a/518232/2809427 def unicodeToAscii(s): return ''.join( c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn' ) # 小写,修剪和删除非字母字符 def normalizeString(s): s = unicodeToAscii(s.lower().strip()) s = re.sub(r"([.!?])", r" \1", s) s = re.sub(r"[^a-zA-Z.!?]+", r" ", s) return s
要读取数据文件,我们将文件拆分为行,然后将行拆分成对。 这些文件都是英语→其他语言,所以如果我们想翻译其他语言→英语,我添加reverse标志来反转对。
def readLangs(lang1, lang2, reverse=False): print("Reading lines...") # 读取文件并分成几行 lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').\ read().strip().split('\n') # 将每一行拆分成对并进行标准化 pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines] # 反向对,使Lang实例 if reverse: pairs = [lis
