赞
踩
Encoder-Decoder模型可应用于机器翻译、文本摘要、对话机器人。 没有AM(attention model)时,句子X中任意单词对生成某个目标单词
y
i
y_i
yi来说影响力是相同的,没有区别,单词自身的信息消失,丢失很多细节信息。
AM模型需要解决的问题:输入句子单词注意力分配概率分布值怎么计算?如(句子:Tom chase Jerry,翻译Tom时的概率分布值为:(Tom,0.6)、(chase,0.2)、(Jerry,0.2))。
使用decoder输出
i
−
1
i-1
i−1时刻的隐藏层节点状态
H
i
−
1
H_{i-1}
Hi−1去和encoder每个单词对应的隐藏层节点状态
h
j
h_j
hj进行比对,即对齐函数。最后加权求和得到根据输出目标单词变化的语义向量
C
t
C_t
Ct。
明显书中给出的公式只是说我们得到
y
t
−
1
y_{t-1}
yt−1,
C
t
C_t
Ct,
S
t
S_t
St的信息就可以得到
y
t
y_t
yt,并没说具体怎么使用,所以结论就是根据实际情况(不同的代码有不同的方式,没有统一说法)。此外还有其他注意力机制:Local Attention,Global Attention。。。
唠叨这么久,继续阅读官网教程吧
上一节重点讨论模型,还没附上代码,这里还是先使用它的模型。
embedded 会将单词的one-hot向量变成词嵌入向量(没经过训练,随机分布)。
encoder是一个RNN,对于输入的每一个词都输出一个向量和一个隐状态,这个隐状态会作为下一个时刻的输入。
class EncoderRNN(nn.Module):
def __init__(self,input_size,hidden_size):
super(EncoderRNN, self).__init__()
self.hidden_size = hidden_size
self.embedding = nn.Embedding(input_size, hidden_size)
self.gru = nn.GRU(hidden_size, hidden_size)
def forward(self, input, hidden):
embedded = self.embedding(input).view(1,1,-1)
output, hidden = self.gru(embedded,hidden)
return output,hidden
def initHidden(self):
return torch.zeros(1,1,self.hidden_size, device=device)
class AttnDecoderRNN(nn.Module): def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH): super(AttnDecoderRNN, self).__init__() self.hidden_size = hidden_size self.output_size = output_size self.dropout_p = dropout_p self.max_length = max_length self.embedding = nn.Embedding(self.output_size, self.hidden_size) self.attn = nn.Linear(self.hidden_size * 2, self.max_length) self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size) self.dropout = nn.Dropout(self.dropout_p) self.gru = nn.GRU(self.hidden_size, self.hidden_size) self.out = nn.Linear(self.hidden_size, self.output_size) def forward(self, input, hidden, encoder_outputs): # input是编码器的上一步输出 或者 真实的前一个单词 embedded = self.embedding(input).view(1, 1, -1) embedded = self.dropout(embedded) attn_weights = F.softmax( self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1) attn_applied = torch.bmm(attn_weights.unsqueeze(0), # 1, 1, max_length encoder_outputs.unsqueeze(0)) # 1, max_length, hidden_size # 输出的attn_applied 大小为 (1, 1, hidden_size) # embedded: (1, 1, hidden_size) output = torch.cat((embedded[0], attn_applied[0]), 1) #输入embedded和语义向量C output = self.attn_combine(output).unsqueeze(0) #全连接 output = F.relu(output) #将信息输入decoder output, hidden = self.gru(output, hidden) #hidden结构1, 1, self.hidden_size output = F.log_softmax(self.out(output[0]), dim=1) #output结构,shape(seq_len,batch,features) return output, hidden, attn_weights def initHidden(self): return torch.zeros(1, 1, self.hidden_size, device=device)
F.log_softmax、F.softmax、nn.CrossEntropyLoss、nn.NLLLoss区别
连接1
连接2
To train we run the input sentence through the encoder, and keep track of every output and the latest hidden state. Then the decoder is given the <SOS> token as its first input, and the last hidden state of the encoder as its first hidden state.
“Teacher forcing” is the concept of using the real target outputs as each next input, instead of using the decoder’s guess as the next input. Using teacher forcing causes it to converge faster but when the trained network is exploited, it may exhibit instability.
使用teacher forcing使收敛更快,但当训练好的网络被利用时,可能会表现出不稳定性。
it has learned to represent the output grammar and can “pick up” the meaning once the teacher tells it the first few words, it has not properly learned how to create the sentence from the translation in the first place
通过百分比控制tracher forcing的使用率。
To train, for each pair we will need an input tensor (indexes of the words in the input sentence,它是输入句子中的单词的索引) and target tensor (indexes of the words in the target sentence). While creating these vectors we will append the EOS token to both sequences.
import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # 创建句子的tensor def indexesFromSentence(lang, sentence): return [lang.word2index[word] for word in sentence.split(' ')] # 在句子的tensor中,加入EOS符号 def tensorFromSentence(lang, sentence): indexes = indexesFromSentence(lang,sentence) indexes.append(EOS_token) return torch.tensor(indexes, dtype = torch.long, device=device).view(-1,1) # 创建句子对的tensor def tensorsFromPair(pair): input_tensor = tensorFromSentence(input_lang,pair[0]) target_tensor = tensorFromSentence(output_lang,pair[1]) return (input_tensor, target_tensor) sample_pairs = random.choice(pairs) print(sample_pairs) input_tensor, target_tensor = tensorsFromPair(sample_pairs) print('input:',input_tensor) print('target:',target_tensor)
结果:
['tu es plus grand que moi .', 'you re taller than i am .'] input: tensor([[210], [211], [152], [213], [902], [ 42], [ 5], [ 1]], device='cuda:0') target: tensor([[ 129], [ 78], [ 150], [1166], [ 2], [ 16], [ 4], [ 1]], device='cuda:0')
This is a helper function to print time elapsed and estimated time remaining given the current time and progress %.
import time import math def asMinutes(s): m = math.floor(s / 60) s -= m * 60 return '%dm %ds' % (m, s) def timeSince(since, percent): now = time.time() s = now - since es = s / (percent) rs = es - s return '%s (- %s)' % (asMinutes(s), asMinutes(rs)) #percent = iter/n_iters #es:还需多长时间 #s:已经过去时间
%matplotlib inline #jupyter 需要
import matplotlib.pyplot as plt
#plt.switch_backend('agg') 不需要
import matplotlib.ticker as ticker
import numpy as np
def showPlot(points):
plt.figure()
fig, ax = plt.subplots()
# this locator puts ticks at regular intervals
loc = ticker.MultipleLocator(base=0.2)
ax.yaxis.set_major_locator(loc)
plt.plot(points)
teacher_forcing_ratio = 0.5 def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH): encoder_hidden = encoder.initHidden() encoder_optimizer.zero_grad() decoder_optimizer.zero_grad() input_length = input_tensor.size(0) # 源语言句子长度 target_length = target_tensor.size(0) # 目标语言句子长度 encoder_outputs = torch.zeros(max_length,encoder.hidden_size,device=device) loss = 0 for ei in range(input_length): encoder_output, encoder_hidden = encoder( input_tensor[ei],encoder_hidden) encoder_outputs[ei] = encoder_output[0,0] # 保存encoder每一步的隐藏层状态 #[0,0]==[0][0] decoder_input = torch.tensor([[SOS_token]],device=device) # decoder的第一个输入是SOS decoder_hidden = encoder_hidden # encoder最后一步隐藏层状态 use_teacher_forcing = True if random.random()<teacher_forcing_ratio else False if use_teacher_forcing: # 强制输入target的input for di in range(target_length): decoder_output, decoder_hidden, decoder_attention = decoder( decoder_input, decoder_hidden, encoder_outputs) loss += criterion(decoder_output, target_tensor[di]) decoder_input = target_tensor[di] else: # 输入预测的input,topk返回value和index,index:2-D for di in range(target_length): #decoder_output:(batch,output_size) decoder_output, decoder_hidden, decoder_attention = decoder( decoder_input, decoder_hidden, encoder_outputs) topv,topi = decoder_output.topk(1) decoder_input = topi.squeeze().detach() loss += criterion(decoder_output, target_tensor[di]) if decoder_input.item() == EOS_token: break loss.backward() encoder_optimizer.step() decoder_optimizer.step() return loss.item()/target_length
Then we call
train
many times and occasionally print the progress (% of examples, time so far, estimated time) and average loss.
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100,learning_rate = 0.01): start = time.time() plot_losses = [] print_loss_total = 0 plot_loss_total = 0 encoder_optimizer = optim.Adam(encoder.parameters(),lr=learning_rate) decoder_optimizer = optim.Adam(decoder.parameters(),lr=learning_rate) #optim.SGD哪会用 training_pairs = [tensorsFromPair(random.choice(pairs)) for i in range(n_iters)] criterion = nn.NLLLoss() for iter in range(1,n_iters+1): #n_iters:75000个数据 数据集:10599 training_pair = training_pairs[iter-1] input_tensor = training_pair[0] target_tensor = training_pair[1] #batch=1 loss = train(input_tensor, target_tensor,encoder, decoder, encoder_optimizer, decoder_optimizer,criterion) print_loss_total += loss plot_loss_total += loss if iter % print_every ==0: #5000 print_loss_avg = print_loss_total/print_every print_loss_total = 0 print("%s (%d %d%%) %.4f"%(timeSince(start,iter/n_iters), iter, iter / n_iters*100, print_loss_avg)) if iter % plot_every == 0: #100 plot_loss_avg = plot_loss_total / plot_every plot_losses.append(plot_loss_avg) plot_loss_total =0 showPlot(plot_losses)
pytorch .detach() 用于切断反向传播,我不了解pytorch的细节,先忽略
连接
训练网络的时候可能希望保持一部分的网络参数不变,只对其中一部分的参数进行调整;或者值训练部分分支网络,并不让其梯度对主网络的梯度造成影响,这时候我们就需要使用detach()函数来切断一些分支的反向传播
非teacher forcing,输出target_length词次(循环),可以提前结束,最长为目标句子长度
evaluate的非teacher forcing,循环次数最大为max_length
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)
trainIters(encoder1,attn_decoder1,75000,print_every=5000)
优化器使用SGD的learning rate为0.01,使用Adam(默认值0.001)的learning rate为0.001
当我使用Adam为0.01,训练5000条数据时损失值为9,且不断增大。
因为数据集不固定,所以损失值可比性不大
根据实验SGD:0.01比Adam:0.001优,adam不下降,
Adam
2m 27s (- 34m 30s) (5000 6%) 2.5476
4m 50s (- 31m 29s) (10000 13%) 1.8332
7m 13s (- 28m 54s) (15000 20%) 1.5085
9m 36s (- 26m 24s) (20000 26%) 1.2806
11m 59s (- 23m 58s) (25000 33%) 1.1640
14m 22s (- 21m 33s) (30000 40%) 1.0529
16m 46s (- 19m 9s) (35000 46%) 0.9597
19m 9s (- 16m 45s) (40000 53%) 0.8905
21m 33s (- 14m 22s) (45000 60%) 0.8717
23m 57s (- 11m 58s) (50000 66%) 0.8538
26m 21s (- 9m 34s) (55000 73%) 0.7900
28m 43s (- 7m 10s) (60000 80%) 0.7374
31m 6s (- 4m 47s) (65000 86%) 0.7014
33m 30s (- 2m 23s) (70000 93%) 0.7213
35m 54s (- 0m 0s) (75000 100%) 0.6933
saving seq-to-seq model…
SGD
1m 43s (- 24m 13s) (5000 6%) 2.8603
3m 19s (- 21m 36s) (10000 13%) 2.2830
5m 1s (- 20m 5s) (15000 20%) 1.9711
6m 41s (- 18m 22s) (20000 26%) 1.7312
8m 18s (- 16m 37s) (25000 33%) 1.5236
9m 57s (- 14m 56s) (30000 40%) 1.3548
11m 37s (- 13m 16s) (35000 46%) 1.2258
13m 17s (- 11m 37s) (40000 53%) 1.1104
14m 57s (- 9m 58s) (45000 60%) 0.9972
16m 36s (- 8m 18s) (50000 66%) 0.9301
18m 15s (- 6m 38s) (55000 73%) 0.8304
19m 53s (- 4m 58s) (60000 80%) 0.7546
21m 31s (- 3m 18s) (65000 86%) 0.7037
23m 10s (- 1m 39s) (70000 93%) 0.6258
24m 46s (- 0m 0s) (75000 100%) 0.5900
saving seq-to-seq model…
hidden_size = 256
encoder1 = torch.load("./last-encode1.model",map_location=device)
attn_decoder1 = torch.load("./last-attndecoder1.model",map_location=device)
trainIters(encoder1,attn_decoder1,75000,print_every=5000)
1m 40s (- 23m 22s) (5000 6%) 0.6629
3m 15s (- 21m 13s) (10000 13%) 0.5663
4m 50s (- 19m 20s) (15000 20%) 0.5271
6m 25s (- 17m 39s) (20000 26%) 0.4796
8m 1s (- 16m 3s) (25000 33%) 0.4327
9m 38s (- 14m 28s) (30000 40%) 0.3903
11m 13s (- 12m 49s) (35000 46%) 0.3546
12m 48s (- 11m 12s) (40000 53%) 0.3365
14m 25s (- 9m 36s) (45000 60%) 0.3220
16m 0s (- 8m 0s) (50000 66%) 0.2992
17m 36s (- 6m 24s) (55000 73%) 0.3070
19m 12s (- 4m 48s) (60000 80%) 0.3165
20m 47s (- 3m 11s) (65000 86%) 0.3164
22m 23s (- 1m 35s) (70000 93%) 0.3129
23m 58s (- 0m 0s) (75000 100%) 0.2779
saving seq-to-seq model…
np.random.choice的replace=False可以控制不重复
np.random.permutation
import random
import numpy as np
a = [1,2,3,4,5,6,7,8,9]
# print(random.choice(a,3,replace=False))
print(np.random.choice(a,6,replace=True))
print(np.random.choice(a,6,replace=False))
b = [random.choice(a) for i in range(6)]
print(b)
#结果
[6 3 6 9 1 2]
[8 4 3 1 2 6]
[7, 2, 6, 3, 6, 6]
返回随机生成的一个实数,它在[0,1)范围内。
True if random.random()<teacher_forcing_ratio else False 控制teacher forcing
batch=1,不需要padding,没有分训练集or测试集,没有保存和加载模型,没有beam search。
没有teacher forcing
Every time it predicts a word we add it to the output string, and if it predicts the EOS token we stop there. We also store the decoder’s attention outputs for display later.
We can evaluate random sentences from the training set and print out the input, target, and output to make some subjective quality judgements:
def evaluate(encoder, decoder, sentence, max_length =MAX_LENGTH): with torch.no_grad(): input_tensor = tensorFromSentence(input_lang,sentence) input_length = input_tensor.size()[0] encoder_hidden = encoder.initHidden() encoder_outputs = torch.zeros(max_length,encoder.hidden_size,device=device) for ei in range(input_length): encoder_output, encoder_hidden = encoder(input_tensor[ei],encoder_hidden) encoder_outputs[ei] += encoder_output[0,0] decoder_input = torch.tensor([[SOS_token]],device=device) decoder_hidden=encoder_hidden decoded_words = [] decoder_attentions = torch.zeros(max_length, max_length) for di in range(max_length): decoder_output, decoder_hidden,decoder_attention = decoder(decoder_input,decoder_hidden,encoder_outputs) decoder_attentions[di] = decoder_attention.data topv, topi = decoder_output.data.topk(1) if topi.item() == EOS_token: decoded_words.append('<EOS>') break else: decoded_words.append(output_lang.index2word[topi.item()]) decoder_input = topi.squeeze().detach() return decoded_words, decoder_attentions[:di+1] def evaluateRandomly(encoder,decoder,n=10): for i in range(n): pair = random.choice(pairs) print('>', pair[0]) print('=', pair[1]) output_words, attentions = evaluate(encoder, decoder, pair[0]) output_sentence = ' '.join(output_words) print('<', output_sentence) print('') evaluateRandomly(encoder1, attn_decoder1)
>encore es tu plus grand que moi . = you re still taller than me . < you re taller taller than me . <EOS> > elle donne une fete ce soir . = she is giving a party tonight . < she is a a tonight tonight . <EOS> > je me fais a nouveau pousser la barbe . = i m growing a beard again . < i m growing a again . . <EOS> > vous etes ma princesse . = you re my princess . < you re my princess . <EOS> > c est un avocat competent . = he is a capable lawyer . < he is a lawyer child . <EOS> > je n en ai pas termine . = i m not finished . < i m not done . <EOS> > je suis trop occupe pour l aider . = i m too busy to help him . < i am too busy to help . <EOS> > tu n es pas seul . = you re not alone . < you re not alone . <EOS> > je suis en train de me concentrer . = i m concentrating . < i m concentrating . <EOS> > je suis vraiment concerne par votre avenir . = i m really concerned about your future . < i m really concerned about this . . <EOS>
Comment out the lines where the encoder and decoder are initialized and run trainIters again.
it is used to weight specific encoder outputs of the input sequence, we can imagine looking where the network is focused most at each time step
You could simply run plt.matshow(attentions) to see attention output displayed as a matrix, with the columns being input steps and rows being output steps
output_words, attentions = evaluate(
encoder1, attn_decoder1, "je suis trop froid .")
plt.matshow(attentions.cpu().numpy())
def showAttention(input_sentence, output_words, attentions): # Set up figure with colorbar fig = plt.figure() ax = fig.add_subplot(111) cax = ax.matshow(attentions.numpy(), cmap='bone') fig.colorbar(cax) # Set up axes ax.set_xticklabels([''] + input_sentence.split(' ') + ['<EOS>'], rotation=90) ax.set_yticklabels([''] + output_words) # Show label at every tick ax.xaxis.set_major_locator(ticker.MultipleLocator(1)) ax.yaxis.set_major_locator(ticker.MultipleLocator(1)) plt.show() def evaluateAndShowAttention(input_sentence): output_words, attentions = evaluate( encoder1, attn_decoder1, input_sentence) print('input = ',input_sentence) print('output = ',' '.join(output_words)) showAttention(input_sentence, output_words, attentions) evaluateAndShowAttention("elle a cinq ans de moins que moi .") evaluateAndShowAttention("elle est trop petit .") evaluateAndShowAttention("je ne crains pas de mourir .") evaluateAndShowAttention("c est un jeune directeur plein de talent .")
input = elle a cinq ans de moins que moi .
output = she is five years younger than me .
input = elle est trop petit .
output = she s too skinny .
input = je ne crains pas de mourir .
output = i m not scared of dying .
input = c est un jeune directeur plein de talent .
output = he s a talented writer .
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。