当前位置:   article > 正文

基于transformer的机器翻译实战

基于transformer的机器翻译

一、基本模块

1、位置嵌入(position embedding)

(1)为什么要引入位置嵌入?

        文本序列中的单词是有顺序的,一个单词在序列中所处的位置对我们理解其词义、上下文关系都十分重要,但是传统的词向量嵌入(word embedding)并不包含位置信息,所以专门引入位置向量。

(2)如何实现位置嵌入?

主要有两种方式:

  • 可学习位置嵌入:为每一个位置初始化一个位置嵌入向量,并且将位置嵌入向量作为模型参数,之后会训练过程中不断更新该向量。
  • 绝对位置嵌入:位置嵌入向量初始化之后就不再改变。一般基于三角函数式,又称Sinusoidal Position Encoding,公式如下: 

 PE_{k, 2i} = sin(\frac{k}{10000^{\frac{2i}{d_{model}}}}) \\ PE_{k, 2i+1} = cos(\frac{k}{10000^{\frac{2i+1}{d_{model}}}})

        

        分别通过sin和cos计算位置 k  的编码向量的第 2 i 和 2 i + 1个分量,d_{model}是位置向量的维度。

两种方式的比较:有论文实验显示,绝对位置嵌入和可学习位置嵌入最终的效果是类似的,但是可学习位置嵌入会引入额外的参数,增加训练开销,所以本项目使用基于三角函数式的绝对位置嵌入

2、掩码机制(mask)

(1)mask的作用是什么?什么情况下需要使用mask?

        作用是避免过拟合,如果不使用mask,会导致模型在训练时就能看到整个句子,从而导致训练准确度上升很快,但是验证准确度会先升后降。

        第一种情况是输入序列长度不一致,需要使用“pad”字符补全短序列,保证序列长度一致性。在计算注意力时,就会需要将“pad”字符掩去。

        第二种情况是为了保证训练效果,在训练时不能直接看到整个句子,而是只能看到当前所处位置及其之前位置的单词,所以可以使用三角型的mask矩阵。

(2)mask实现方式

         对于第一种情况,需要先确定在词表中“pad”的序号,不妨假设pad = 1,序列向量seq = [[1,2,3],[2,2,2],[1,0,0]],辅助矩阵p = \begin{bmatrix} 1 &1 &1 \\ 1 &1 & 1\\ 1&1 & 1 \end{bmatrix},这里的1是因为pad=1,然后比较seq和p,相等的位置置1,不相等的位置置0,得到mask矩阵:

mask = [[1,0,0],[0,0,0],[1,0,0]]

        本项目使用self-attention,会出现上述第二种情况。由于是self-attention,因此Q = K = V,假设:Q = K = V = \begin{bmatrix} s_1\\ s_2\\ s_3\\ s_4 \end{bmatrix}

        根据注意力计算公式,需要先计算QK^T:

        当我们遍历到第2个位置时,应该只能知道s_1s_2,而无法看到s_3s_4,所以理论上无法计算出s_2s_3^Ts_2s_4^T,因此要把这两个位置掩去,同理可以推出mask矩阵形式为:

mask = \begin{bmatrix} 1 & 0 & 0 & 0\\ 1 & 1&0 & 0\\ 1& 1& 1&0 \\ 1& 1 & 1&1 \end{bmatrix}

二、代码

1、目录架构

  1. Machine_translation
  2. --data #存放数据集
  3. --eng-fra.txt #英语-法语数据集
  4. --save #保存模型参数
  5. --data_process.py #数据预处理
  6. --decoder.py #定义transformer解码器
  7. --encoder.py #定义transformer编码器
  8. --layer.py #定义transformer网络层
  9. --modules.py #实现位置嵌入、mask、词/索引转换等模块
  10. --optimizer.py #动态学习率
  11. --train.py #配置以及训练
  12. --transformer.py #搭建transformer模型

2、data_process.py:数据预处理

数据集下载:见文章顶部

数据标准化流程:转小写 ——> 转码 ——> 在标点符号前插入空格 ——> 剔除数字等非法字符 ——> 剔除多余空格

  1. import unicodedata
  2. import re
  3. import pandas as pd
  4. import torchtext
  5. import torch
  6. from tqdm import tqdm
  7. from sklearn.model_selection import train_test_split
  8. class DataLoader:
  9. def __init__(self, data_iter):
  10. self.data_iter = data_iter
  11. self.length = len(data_iter) # 一共有多少个batch?
  12. def __len__(self):
  13. return self.length
  14. def __iter__(self):
  15. # 注意,在此处调整text的shape为batch first
  16. for batch in self.data_iter:
  17. yield (torch.transpose(batch.src, 0, 1), torch.transpose(batch.targ, 0, 1))
  18. # 将unicode字符串转化为ASCII码
  19. def unicodeToAscii(s):
  20. return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
  21. # 标准化句子序列
  22. def normalizeString(s):
  23. s = s.lower().strip() # 全部转小写
  24. s = unicodeToAscii(s)
  25. s = re.sub(r"([.!?])", r" \1", s) # \1表示group(1)即第一个匹配到的 即匹配到'.'或者'!'或者'?'后,一律替换成'空格.'或者'空格!'或者'空格?'
  26. s = re.sub(r"[^a-zA-Z.!?]+", r" ", s) # 非字母以及非.!?的其他任何字符 一律被替换成空格
  27. s = re.sub(r'[\s]+', " ", s) # 将出现的多个空格,都使用一个空格代替。例如:w='abc aa bb' 处理后:w='abc aa bb'
  28. return s
  29. # 文件是英译法,我们实现的是法译英,所以进行了reverse,所以pair[1]是英语
  30. def exchangepairs(pairs):
  31. # 过滤,并交换句子顺序,得到法英句子对(之前是英法句子对)
  32. return [[pair[1], pair[0]] for pair in pairs]
  33. def get_dataset(pairs, src, targ):
  34. fields = [('src', src), ('targ', targ)] # filed信息 fields dict[str, Field])
  35. examples = [] # list(Example)
  36. for fra, eng in tqdm(pairs): # 进度条
  37. # 创建Example时会调用field.preprocess方法
  38. examples.append(torchtext.legacy.data.Example.fromlist([fra, eng], fields))
  39. return examples, fields
  40. def get_datapipe(opt, src, tar):
  41. data_df = pd.read_csv(opt.data_dir + 'eng-fra.txt', # 数据格式:英语\t法语,注意我们的任务源语言是法语,目标语言是英语
  42. encoding='UTF-8', sep='\t', header=None,
  43. names=['eng', 'fra'], index_col=False)
  44. pairs = [[normalizeString(s) for s in line] for line in data_df.values]
  45. pairs = exchangepairs(pairs)
  46. train_pairs, val_pairs = train_test_split(pairs, test_size=0.2, random_state=1234)
  47. ds_train = torchtext.legacy.data.Dataset(*get_dataset(train_pairs, src, tar))
  48. ds_val = torchtext.legacy.data.Dataset(*get_dataset(val_pairs, src, tar))
  49. train_iter, val_iter = torchtext.legacy.data.Iterator.splits(
  50. (ds_train, ds_val),
  51. sort_within_batch=True,
  52. sort_key=lambda x: len(x.src),
  53. batch_sizes=(opt.batch_size, opt.batch_size)
  54. )
  55. train_dataloader = DataLoader(train_iter)
  56. val_dataloader = DataLoader(val_iter)
  57. return train_dataloader, val_dataloader, ds_train

3、modules.py:实现位置嵌入、mask、词/索引转换模块

  1. import torch
  2. from torch import nn
  3. import numpy as np
  4. from data_process import normalizeString
  5. """
  6. 位置编码:有可学习位置编码与相对位置编码,实验表明两者的效果差不多,但是绝对位置编码可以减少参数量
  7. 本项目使用三角函数式的绝对位置编码,原理是计算sin和cos来表示位置k的编码向量的2i,2i+1个分量
  8. """
  9. def positional_encoding(max_seq_len, d_word_vec):
  10. """
  11. max_seq_len:序列长度,即单词数
  12. d_word_vec:位置编码向量维度
  13. """
  14. # 计算位置向量
  15. pos_enc = np.array(
  16. [[pos / np.power(10000, 2.0 * (j // 2) / d_word_vec) for j in range(d_word_vec)]
  17. for pos in range(max_seq_len)])
  18. pos_enc[:, 0::2] = np.sin(pos_enc[:, 0::2])
  19. pos_enc[:, 1::2] = np.cos(pos_enc[:, 1::2])
  20. # 维度扩展
  21. pos_enc = pos_enc[np.newaxis, :] # (max_seq_len, d_word_vec) -> (1, max_seq_len, d_word_vec)
  22. return torch.tensor(pos_enc, dtype=torch.float32)
  23. """
  24. 掩码机制
  25. 在encoder中使用padding_mask
  26. 在decoder中使用look_ahead_mask与padding_mask
  27. """
  28. pad = 0 # 重要参数,必须与字符‘pad’在词表中的索引保持一致,在train.py中可以查看
  29. def create_look_ahead_mask(seq_len):
  30. look_ahead_mask = torch.tril(torch.ones(seq_len, seq_len), diagonal = 0)
  31. return look_ahead_mask
  32. def create_padding_mask(pad, seq):
  33. seq = torch.eq(seq, torch.tensor(pad)).float()
  34. return seq[:, np.newaxis, np.newaxis, :]
  35. # 计算带有mask的损失
  36. def mask_loss_func(real, pred):
  37. loss_object = torch.nn.CrossEntropyLoss(reduction='none')
  38. _loss = loss_object(pred.transpose(-1,-2), real)
  39. # logical_not 取非
  40. # mask 每个元素为bool值,如果real中有pad,则mask相应位置就为False
  41. # mask = torch.logical_not(real.eq(0)).type(_loss.dtype) # [b, targ_seq_len] pad=0的情况
  42. mask = torch.logical_not(real.eq(pad)).type(_loss.dtype) # [b, targ_seq_len] pad!=0的情况
  43. # 对应位置相乘,token上的损失被保留了下来,pad的loss被置为0或False 去掉,不计算在内
  44. _loss *= mask
  45. return _loss.sum() / mask.sum().item()
  46. # 计算带有mask的准确度
  47. # real [b, targ_seq_len]
  48. # pred [b, targ_seq_len, target_vocab_size]
  49. def mask_accuracy_func(real, pred):
  50. _pred = pred.argmax(dim=-1) # [b, targ_seq_len, target_vocab_size]=>[b, targ_seq_len]
  51. corrects = _pred.eq(real) # [b, targ_seq_len] bool值
  52. # logical_not 取非
  53. # mask 每个元素为bool值,如果real中有pad,则mask相应位置就为False
  54. # mask = torch.logical_not(real.eq(0)) # [b, targ_seq_len] bool值 pad=0的情况
  55. mask = torch.logical_not(real.eq(pad)) # [b, targ_seq_len] bool值 pad!=0的情况
  56. # 对应位置相乘,token上的值被保留了下来,pad上的值被置为0或False 去掉,不计算在内
  57. corrects *= mask
  58. return corrects.sum().float() / mask.sum().item()
  59. # inp [b, inp_seq_len] 序列已经加入pad填充
  60. # targ [b, targ_seq_len] 序列已经加入pad填充
  61. def create_mask(inp, targ):
  62. # encoder padding mask
  63. enc_padding_mask = create_padding_mask(pad, inp) # =>[b,1,1,inp_seq_len] mask=1的位置为pad
  64. # decoder's first attention block(self-attention)
  65. # 使用的padding create_mask & look-ahead create_mask
  66. look_ahead_mask = create_look_ahead_mask(targ.shape[-1]) # =>[targ_seq_len,targ_seq_len] ##################
  67. dec_targ_padding_mask = create_padding_mask(pad, targ) # =>[b,1,1,targ_seq_len]
  68. combined_mask = torch.max(look_ahead_mask, dec_targ_padding_mask) # 结合了2种mask =>[b,1,targ_seq_len,targ_seq_len]
  69. # decoder's second attention block(encoder-decoder attention) 使用的padding create_mask
  70. # 【注意】:这里的mask是用于遮挡encoder output的填充pad,而encoder的输出与其输入shape都是[b,inp_seq_len,d_model]
  71. # 所以这里mask的长度是inp_seq_len而不是targ_mask_len
  72. dec_padding_mask = create_padding_mask(pad, inp) # =>[b,1,1,inp_seq_len] mask=1的位置为pad
  73. return enc_padding_mask, combined_mask, dec_padding_mask
  74. # [b,1,1,inp_seq_len], [b,1,targ_seq_len,targ_seq_len], [b,1,1,inp_seq_len]
  75. """
  76. token与索引的编码与解码
  77. """
  78. # tokenizer = lambda x:x.split()
  79. # 单词 -> 索引
  80. def tokenzier_encode(tokenize, sentence, vocab):
  81. sentence = normalizeString(sentence) # 句子标准化
  82. sentence = tokenize(sentence) # 分词,str -> list
  83. sentence = ['<start>'] + sentence + ['<end>']
  84. sentence_ids = [vocab.stoi[token] for token in sentence] # vocab.stoi可以快速查询到token在词表中对应的索引
  85. return sentence_ids
  86. # 索引 -> 单词
  87. def tokenzier_decode(sentence_ids, vocab):
  88. sentence = [vocab.itos[id] for id in sentence_ids if id<len(vocab)]
  89. return " ".join(sentence) # 将sentence中的单词以空格为分隔的方式连接起来

4、layer.py:定义transformer网络层

前馈网络层Feed Forward:包含两个线性层,中间夹着一层ReLU激活。作用是更好的提取特征。

  1. import torch
  2. from torch import nn
  3. # 多头注意力层
  4. class MultiHeadAttention(torch.nn.Module):
  5. def __init__(self, d_word_vec, num_heads, dropout):
  6. """
  7. d_word_vec: 词向量维度
  8. num_heads: 注意力头的数目
  9. dropout: 取值0~1, 表示随机置为0的神经元的比例
  10. """
  11. super(MultiHeadAttention, self).__init__()
  12. self.num_heads = num_heads
  13. self.d_word_vec = d_word_vec
  14. assert d_word_vec%self.num_heads == 0
  15. self.wq = nn.Linear(d_word_vec, d_word_vec)
  16. self.wk = nn.Linear(d_word_vec, d_word_vec)
  17. self.wv = nn.Linear(d_word_vec, d_word_vec)
  18. self.final_linear = nn.Linear(d_word_vec, d_word_vec)
  19. self.dropout = nn.Dropout(dropout)
  20. # 缩放:注意力计算时用到的根号d_k
  21. self.scale = torch.sqrt(torch.FloatTensor([d_word_vec // self.num_heads])).cuda()
  22. def split_heads(self, x, batch_size):
  23. x = x.view(batch_size, -1, self.num_heads, self.d_word_vec // self.num_heads) # (batch_size, seq_len, d_word_vec) -> (batch_size, seq_len, num_heads, depth)
  24. x = x.permute(0, 2, 1, 3) # (batch_size, seq_len, num_heads, depth) -> (batch_size, num_heads, seq_len, depth)
  25. return x
  26. def forward(self, q, k, v, mask):
  27. batch_size = q.shape[0]
  28. # 计算Q,K,V
  29. Q = self.wq(q) # (batch_size, seq_len, d_word_vec)
  30. K = self.wk(k) # (batch_size, seq_len, d_word_vec)
  31. V = self.wv(v) # (batch_size, seq_len, d_word_vec)
  32. # 将Q,K,V在d_word_vec维度上划分到多个注意力头中
  33. Q = self.split_heads(Q, batch_size) # (batch_size, num_heads, seq_len, depth)
  34. K = self.split_heads(K, batch_size) # (batch_size, num_heads, seq_len, depth)
  35. V = self.split_heads(V, batch_size) # (batch_size, num_heads, seq_len, depth)
  36. # 计算注意力
  37. attention = torch.matmul(Q, K.permute(0, 1, 3, 2))/self.scale # (batch_size, num_heads, seq_len, seq_len)
  38. # 掩码机制:如果mask不为空,就将mask中取值为0的位置的注意力设定为 -1e10
  39. if mask is not None:
  40. attention = attention.masked_fill(mask==0, -1e10)
  41. attention = self.dropout(torch.softmax(attention, dim=-1))
  42. # 将注意力分数与权重矩阵V相乘
  43. x = torch.matmul(attention, V) # (batch_size, num_heads, seq_len, depth)
  44. # 将多头计算结果拼接: (batch_size, num_heads, seq_len, depth) -> (batch_size, seq_len, num_heads, depth) -> (batch_size, seq_len, d_word_vec)
  45. x = x.permute(0, 2, 1, 3).reshape(batch_size, -1, self.d_word_vec)
  46. return self.final_linear(x)
  47. # 点式前馈网络
  48. def point_wise_feed_forward_network(d_word_vec, d_hidden):
  49. feed_forward_net = nn.Sequential(
  50. nn.Linear(d_word_vec, d_hidden),
  51. nn.ReLU(),
  52. nn.Linear(d_hidden, d_word_vec)
  53. )
  54. return feed_forward_net
  55. class EncoderLayer(nn.Module):
  56. def __init__(self, d_word_vec, num_heads, d_hidden, dropout=0.1):
  57. super(EncoderLayer, self).__init__()
  58. self.multiheadlayer = MultiHeadAttention(d_word_vec, num_heads, dropout)
  59. self.ffn = point_wise_feed_forward_network(d_word_vec, d_hidden)
  60. self.layernorm1 = nn.LayerNorm(normalized_shape=d_word_vec, eps=1e-6)
  61. self.layernorm2 = nn.LayerNorm(normalized_shape=d_word_vec, eps=1e-6)
  62. self.dropout1 = nn.Dropout(dropout)
  63. self.dropout2 = nn.Dropout(dropout)
  64. def forward(self, x, mask):
  65. attn_output = self.multiheadlayer(x, x, x, mask) # (batch_size, seq_len, d_word_vec)
  66. attn_output = self.dropout1(attn_output)
  67. out1 = self.layernorm1(x + attn_output)
  68. ffn_output = self.ffn(out1)
  69. ffn_output = self.dropout2(ffn_output)
  70. out2 = self.layernorm2(out1 + ffn_output)
  71. return out2 # (batch_size, seq_len, d_word_vec)
  72. class DecoderLayer(nn.Module):
  73. def __init__(self, d_word_vec, num_heads, d_hidden, dropout=0.1):
  74. super(DecoderLayer, self).__init__()
  75. self.multiheadlayer1 = MultiHeadAttention(d_word_vec, num_heads, dropout)
  76. self.multiheadlayer2 = MultiHeadAttention(d_word_vec, num_heads, dropout)
  77. self.ffn = point_wise_feed_forward_network(d_word_vec, d_hidden)
  78. self.layernorm1 = nn.LayerNorm(normalized_shape=d_word_vec, eps=1e-6)
  79. self.layernorm2 = nn.LayerNorm(normalized_shape=d_word_vec, eps=1e-6)
  80. self.layernorm3 = nn.LayerNorm(normalized_shape=d_word_vec, eps=1e-6)
  81. self.dropout1 = nn.Dropout(dropout)
  82. self.dropout2 = nn.Dropout(dropout)
  83. self.dropout3 = nn.Dropout(dropout)
  84. def forward(self, x, enc_output, look_ahead_mask, padding_mask):
  85. attn1 = self.multiheadlayer1(x, x, x, look_ahead_mask)
  86. attn1 = self.dropout1(attn1)
  87. out1 = self.layernorm1(x + attn1)
  88. attn2 = self.multiheadlayer2(out1, enc_output, enc_output, padding_mask)
  89. attn2 = self.dropout2(attn2)
  90. out2 = self.layernorm2(out1 + attn2)
  91. ffn_output = self.ffn(out2)
  92. ffn_output = self.dropout3(ffn_output)
  93. out3 = self.layernorm3(out2 + ffn_output)
  94. return out3

5、encoder.py:定义编码器

  1. import torch
  2. import pdb
  3. from torch import nn
  4. from modules import positional_encoding
  5. from layer import EncoderLayer
  6. class Encoder(nn.Module):
  7. def __init__(self,
  8. num_layers,
  9. d_word_vec,
  10. num_heads,
  11. d_hidden,
  12. vocab_size,
  13. max_seq_len,
  14. dropout = 0.1):
  15. """
  16. num_layers: encoder网络层数
  17. vocab_size: 源词表大小(待翻译语言词表)
  18. """
  19. super(Encoder, self).__init__()
  20. self.num_layers = num_layers
  21. self.d_word_vec = d_word_vec
  22. self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=d_word_vec)
  23. self.pos_encoding = positional_encoding(max_seq_len, d_word_vec) # (1, max_seq_len, d_word_vec)
  24. self.enc_layers = nn.ModuleList([EncoderLayer(d_word_vec, num_heads, d_hidden, dropout) for _ in range(num_layers)])
  25. self.dropout = nn.Dropout(dropout)
  26. def forward(self, x, mask):
  27. input_seq_len = x.shape[-1]
  28. # 词嵌入与位置嵌入
  29. x = self.embedding(x) # (batch_size, input_seq_len) -> (batch_size, input_seq_len, d_word_vec)
  30. x *= torch.sqrt(torch.tensor(self.d_word_vec, dtype=torch.float32))
  31. pos_enc = self.pos_encoding[:, :input_seq_len, :] # (1, input_seq_len, d_word_vec)
  32. pos_enc = pos_enc.cuda()
  33. x += pos_enc
  34. x = self.dropout(x) # (batch_size, input_seq_len, d_word_vec)
  35. # 经过num_layers层encoder
  36. for i in range(self.num_layers):
  37. x = self.enc_layers[i](x, mask)
  38. return x # (batch_size, input_seq_len, d_word_vec)

6、decoder.py:定义解码器

  1. import torch
  2. from torch import nn
  3. from modules import positional_encoding
  4. from layer import DecoderLayer
  5. class Decoder(nn.Module):
  6. def __init__(
  7. self,
  8. num_layers,
  9. d_word_vec,
  10. num_heads,
  11. d_hidden,
  12. vocab_size,
  13. max_seq_len,
  14. dropout = 0.1):
  15. """
  16. vocab_size: 目标词表大小
  17. """
  18. super(Decoder, self).__init__()
  19. self.num_layers = num_layers
  20. self.d_word_vec = d_word_vec
  21. self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=d_word_vec)
  22. self.pos_encoding = positional_encoding(max_seq_len, d_word_vec) # (1, max_seq_len, d_word_vec)
  23. self.dec_layers = nn.ModuleList([DecoderLayer(d_word_vec, num_heads, d_hidden, dropout) for _ in range(num_layers)])
  24. self.dropout = nn.Dropout(dropout)
  25. def forward(self, x, enc_output, look_ahead_mask, padding_mask):
  26. target_seq_len = x.shape[-1]
  27. # 词嵌入与位置嵌入
  28. x = self.embedding(x) # (batch_size, target_seq_len) -> (batch_size, target_seq_len, d_word_vec)
  29. x *= torch.sqrt(torch.tensor(self.d_word_vec, dtype=torch.float32)) # (batch_size, target_seq_len, d_word_vec)
  30. pos_enc = self.pos_encoding[:, :target_seq_len, :] # (1, target_seq_len, d_word_vec)
  31. pos_enc = pos_enc.cuda()
  32. x += pos_enc # (batch_size, target_seq_len, d_word_vec)
  33. x = self.dropout(x)
  34. for i in range(self.num_layers):
  35. x = self.dec_layers[i](x, enc_output, look_ahead_mask, padding_mask)
  36. return x # (batch_size, target_seq_len, d_word_vec)

7、transformer.py:搭建transformer模型

  1. from torch import nn
  2. import pdb
  3. from encoder import Encoder
  4. from decoder import Decoder
  5. class Transformer(nn.Module):
  6. def __init__(self,
  7. num_layers,
  8. d_word_vec,
  9. num_heads,
  10. d_hidden,
  11. input_vocab_size,
  12. target_vocab_size,
  13. input_seq_len,
  14. target_seq_len,
  15. dropout = 0.1
  16. ):
  17. """
  18. input_vocab_size: 源语言词表大小
  19. target_vocab_size: 目标语言词表大小
  20. input_seq_len: 源语言序列的最大序列长度
  21. target_seq_len: 目标语言序列的最大序列长度
  22. """
  23. super(Transformer, self).__init__()
  24. self.encoder = Encoder(num_layers, d_word_vec, num_heads, d_hidden, input_vocab_size, input_seq_len, dropout)
  25. self.decoder = Decoder(num_layers, d_word_vec, num_heads, d_hidden, target_vocab_size, target_seq_len, dropout)
  26. self.final_layer = nn.Linear(d_word_vec, target_vocab_size)
  27. def forward(self, input, target, enc_padding_mask, look_ahead_mask, dec_papdding_mask):
  28. enc_output = self.encoder(input, enc_padding_mask)
  29. dec_output = self.decoder(target, enc_output, look_ahead_mask, dec_papdding_mask)
  30. final_output = self.final_layer(dec_output)
  31. return final_output

8、optimizer.py:实现动态学习率

  1. import torch
  2. class CustomSchedule(torch.optim.lr_scheduler._LRScheduler):
  3. def __init__(self, optimizer, d_word_vec, warm_steps = 4):
  4. """
  5. warm_steps: 热身步数,即学习率达到最大值所需的步数
  6. """
  7. self.optimizer = optimizer
  8. self.d_word_vec = d_word_vec
  9. self.warmup_steps = warm_steps
  10. super(CustomSchedule, self).__init__(optimizer)
  11. # 使用动态学习率
  12. def get_lr(self):
  13. arg1 = self._step_count ** (-0.5)
  14. arg2 = self._step_count * (self.warmup_steps ** -1.5)
  15. dynamic_lr = (self.d_word_vec ** (-0.5)) * min(arg1, arg2)
  16. return [dynamic_lr for group in self.optimizer.param_groups]

9、train.py:训练配置

要启动训练,只需要设置‘mode’为‘train’,并将‘model_path’设置为' '。

  1. import torch
  2. import torchtext
  3. import argparse
  4. import pandas as pd
  5. from matplotlib import pyplot as plt
  6. import datetime
  7. import time
  8. import copy
  9. import os
  10. from transformer import Transformer
  11. from optimizer import CustomSchedule
  12. from data_process import get_datapipe
  13. from modules import create_mask, mask_loss_func, mask_accuracy_func, tokenzier_encode, tokenzier_decode
  14. use_cuda = torch.cuda.is_available()
  15. device = torch.device("cuda:0" if use_cuda else "cpu")
  16. # 打印分隔线
  17. def printbar():
  18. nowtime = datetime.datetime.now().strftime('%Y-%m_%d %H:%M:%S')
  19. print('\n' + "=========="*8 + '%s'%nowtime)
  20. def train_step(model, optimizer, inp, targ):
  21. # 目标(target)被分成了 tar_inp 和 tar_real
  22. # tar_inp 作为输入传递到解码器。
  23. # tar_real 是位移了 1 的同一个输入:在 tar_inp 中的每个位置,tar_real 包含了应该被预测到的下一个标记(token)。
  24. targ_inp = targ[:, :-1]
  25. targ_real = targ[:, 1:]
  26. enc_padding_mask, combined_mask, dec_padding_mask = create_mask(inp, targ_inp)
  27. inp = inp.to(device)
  28. targ_inp = targ_inp.to(device)
  29. targ_real = targ_real.to(device)
  30. enc_padding_mask = enc_padding_mask.to(device)
  31. combined_mask = combined_mask.to(device)
  32. dec_padding_mask = dec_padding_mask.to(device)
  33. model.train() # 设置train mode
  34. optimizer.zero_grad() # 梯度清零
  35. # forward
  36. prediction = model(inp, targ_inp, enc_padding_mask, combined_mask, dec_padding_mask)
  37. # [b, targ_seq_len, target_vocab_size]
  38. loss = mask_loss_func(targ_real, prediction)
  39. metric = mask_accuracy_func(targ_real, prediction)
  40. # backward
  41. loss.backward() # 反向传播计算梯度
  42. optimizer.step() # 更新参数
  43. return loss.item(), metric.item()
  44. df_history = pd.DataFrame(columns=['epoch', 'loss', 'acc', 'val_loss', 'val_' + 'acc'])
  45. tokenizer = lambda x:x.split() # 分词器规则
  46. def validate_step(model, inp, targ):
  47. targ_inp = targ[:, :-1]
  48. targ_real = targ[:, 1:]
  49. enc_padding_mask, combined_mask, dec_padding_mask = create_mask(inp, targ_inp)
  50. inp = inp.to(device)
  51. targ_inp = targ_inp.to(device)
  52. targ_real = targ_real.to(device)
  53. enc_padding_mask = enc_padding_mask.to(device)
  54. combined_mask = combined_mask.to(device)
  55. dec_padding_mask = dec_padding_mask.to(device)
  56. model.eval() # 设置eval mode
  57. with torch.no_grad():
  58. # forward
  59. prediction = model(inp, targ_inp, enc_padding_mask, combined_mask, dec_padding_mask)
  60. val_loss = mask_loss_func(targ_real, prediction)
  61. val_metric = mask_accuracy_func(targ_real, prediction)
  62. return val_loss.item(), val_metric.item()
  63. def train_model(model, optimizer, train_dataloader, val_dataloader, model_state):
  64. opt = model_state['opt']
  65. starttime = time.time()
  66. print('*' * 27, 'start training...')
  67. printbar()
  68. best_acc = 0.
  69. for epoch in range(1, opt.max_epochs + 1):
  70. loss_sum = 0.
  71. metric_sum = 0.
  72. for step, (inp, targ) in enumerate(train_dataloader, start=1):
  73. # inp [64, 10] , targ [64, 10]
  74. loss, metric = train_step(model, optimizer, inp, targ)
  75. loss_sum += loss
  76. metric_sum += metric
  77. # 打印batch级别日志
  78. if step % opt.print_trainstep_every == 0:
  79. print('*' * 8, f'[step = {step}] loss: {loss_sum / step:.3f}, {opt.metric_name}: {metric_sum / step:.3f}')
  80. opt.lr_scheduler.step() # 更新学习率
  81. # 一个epoch的train结束,做一次验证
  82. # test(model, train_dataloader)
  83. val_loss_sum = 0.
  84. val_metric_sum = 0.
  85. for val_step, (inp, targ) in enumerate(val_dataloader, start=1):
  86. # inp [64, 10] , targ [64, 10]
  87. loss, metric = validate_step(model, inp, targ)
  88. val_loss_sum += loss
  89. val_metric_sum += metric
  90. # 记录和收集1个epoch的训练(和验证)信息
  91. # record = (epoch, loss_sum/step, metric_sum/step)
  92. record = (epoch, loss_sum/step, metric_sum/step, val_loss_sum/val_step, val_metric_sum/val_step)
  93. df_history.loc[epoch - 1] = record
  94. # 打印epoch级别的日志
  95. print('EPOCH = {} loss: {:.3f}, {}: {:.3f}, val_loss: {:.3f}, val_{}: {:.3f}'.format(
  96. record[0], record[1], 'acc', record[2], record[3], 'acc', record[4]))
  97. printbar()
  98. # 保存模型
  99. current_acc_avg = val_metric_sum / val_step # 看验证集指标
  100. if current_acc_avg > best_acc: # 保存更好的模型
  101. best_acc = current_acc_avg
  102. checkpoint = './save/' + '{:03d}_{:.2f}_ckpt.tar'.format(epoch, current_acc_avg)
  103. if device.type == 'cuda' and opt.ngpu > 1:
  104. model_sd = copy.deepcopy(model.module.state_dict())
  105. else:
  106. model_sd = copy.deepcopy(model.state_dict()) ##################
  107. torch.save({
  108. 'loss': loss_sum / step,
  109. 'epoch': epoch,
  110. 'net': model_sd,
  111. 'opt': optimizer.state_dict(),
  112. # 'lr_scheduler': lr_scheduler.state_dict()
  113. }, checkpoint)
  114. print('finishing training...')
  115. endtime = time.time()
  116. time_elapsed = endtime - starttime
  117. print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
  118. return df_history
  119. def test(model, dataloader):
  120. model.eval() # 设置为eval mode
  121. test_loss_sum = 0.
  122. test_metric_sum = 0.
  123. for test_step, (inp, targ) in enumerate(dataloader, start=1):
  124. # inp [64, 10] , targ [64, 10]
  125. loss, metric = validate_step(model, inp, targ)
  126. test_loss_sum += loss
  127. test_metric_sum += metric
  128. # 打印
  129. print('*' * 8,
  130. 'Test: loss: {:.3f}, {}: {:.3f}'.format(test_loss_sum / test_step, 'test_acc', test_metric_sum / test_step))
  131. def evaluate(model, inp_sentence, src_vocab, targ_vocab, model_state):
  132. model.eval()
  133. opt = model_state['opt']
  134. inp_sentence_ids: list = tokenzier_encode(tokenizer, inp_sentence, src_vocab) # 单词 -> 索引
  135. encoder_input = torch.tensor(inp_sentence_ids).unsqueeze(dim=0) # (inp_seq_len) -> (1, inp_seq_len)
  136. decoder_input = [targ_vocab.stoi['<start>']] # 初始化,以start字符开头
  137. decoder_input = torch.tensor(decoder_input).unsqueeze(0) # (1, 1)
  138. with torch.no_grad():
  139. for i in range(opt.max_length+2):
  140. enc_padding_mask, combined_mask, dec_padding_mask = create_mask(encoder_input.cpu(), decoder_input.cpu())
  141. # (b, 1, 1, inp_seq_len), (b, 1, targ_seq_len, inp_seq_len), (b, 1, 1, inp_seq_len)
  142. encoder_input = encoder_input.to(device)
  143. decoder_input = decoder_input.to(device)
  144. enc_padding_mask = enc_padding_mask.to(device)
  145. combined_mask = combined_mask.to(device)
  146. dec_padding_mask = dec_padding_mask.to(device)
  147. predictions = model(encoder_input,
  148. decoder_input,
  149. enc_padding_mask,
  150. combined_mask,
  151. dec_padding_mask) # (b, targ_seq_len, target_vocab_size)
  152. prediction = predictions[:, -1:, :] # (b, 1, target_vocab_size)
  153. # torch.argmax()返回张量沿着指定维度最大值的索引,此处是返回预测到的最后一个词在词表中最可能的索引。
  154. prediction_id = torch.argmax(prediction, dim=-1) # (b, 1)
  155. if prediction_id.squeeze().item() == targ_vocab.stoi['<end>']:
  156. return decoder_input.squeeze(dim=0)
  157. # 将预测到的单词添加至decoder_input
  158. decoder_input = torch.cat([decoder_input, prediction_id], dim=-1)
  159. return decoder_input.squeeze(dim=0)
  160. def create_model(opt):
  161. SRC_TEXT = torchtext.legacy.data.Field(sequential=True,
  162. tokenize=tokenizer,
  163. # lower=True,
  164. fix_length=opt.max_length + 2,
  165. preprocessing=lambda x: ['<start>'] + x + ['<end>'],
  166. # after tokenizing but before numericalizing
  167. # postprocessing # after numericalizing but before the numbers are turned into a Tensor
  168. )
  169. TARG_TEXT = torchtext.legacy.data.Field(sequential=True,
  170. tokenize=tokenizer,
  171. # lower=True,
  172. fix_length=opt.max_length + 2,
  173. preprocessing=lambda x: ['<start>'] + x + ['<end>'],
  174. )
  175. # 获取训练集和测试集
  176. train_dataloader, val_dataloader, ds_train = get_datapipe(opt, SRC_TEXT, TARG_TEXT)
  177. # 构建词表
  178. SRC_TEXT.build_vocab(ds_train)
  179. TARG_TEXT.build_vocab(ds_train)
  180. opt.input_vocab_size = len(SRC_TEXT.vocab) # 3901
  181. opt.target_vocab_size = len(TARG_TEXT.vocab) # 2591
  182. model = Transformer(opt.num_layers,
  183. opt.d_word_vec,
  184. opt.num_heads,
  185. opt.d_hidden,
  186. opt.input_vocab_size,
  187. opt.target_vocab_size,
  188. input_seq_len=opt.input_vocab_size,
  189. target_seq_len=opt.target_vocab_size,
  190. dropout=opt.dropout).to(device)
  191. if opt.ngpu > 1:
  192. model = torch.nn.DataParallel(model, device_ids = list(range(opt.ngpu)))
  193. if os.path.exists(opt.model_path):
  194. ckpt = torch.load(opt.model_path)
  195. model.load_state_dict(ckpt['net'])
  196. model_state = {'opt': opt, 'curr_epochs': 0, 'train_steps': 0}
  197. return model, model_state, train_dataloader, val_dataloader, SRC_TEXT.vocab, TARG_TEXT.vocab
  198. def main(opt):
  199. model, model_state, train_dataloader, val_dataloader, src_vocab, targ_vocab = create_model(opt)
  200. print(src_vocab.stoi['pad']) # 查看pad在词表对应的索引
  201. """
  202. 定义Adam优化器
  203. """
  204. if opt.dynamic_lr: # 使用动态学习率
  205. optimizer = torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9)
  206. opt.lr_scheduler = CustomSchedule(optimizer, opt.d_word_vec, warm_steps=opt.warm_steps)
  207. else:
  208. optimizer = torch.optim.Adam(model.parameters(), lr=opt.lr, betas=(0.9, 0.98), eps=1e-9)
  209. if opt.mode == 'train':
  210. # 开始训练
  211. df_history = pd.DataFrame(columns=['epoch', 'loss', 'acc', 'val_loss', 'val_' + 'acc'])
  212. df_history = train_model(model, optimizer, train_dataloader, val_dataloader, model_state)
  213. print(df_history)
  214. elif opt.mode == 'test':
  215. # 在测试集上测试指标,这里使用val_dataloader模拟测试集
  216. print('*' * 8, 'final test...')
  217. test(model, val_dataloader)
  218. elif opt.mode == 'eval':
  219. # 批量翻译
  220. sentence_pairs = [
  221. ['je pars en vacances pour quelques jours .', 'i m taking a couple of days off .'],
  222. ['je ne me panique pas .', 'i m not panicking .'],
  223. ['je recherche un assistant .', 'i am looking for an assistant .'],
  224. ['je suis loin de chez moi .', 'i m a long way from home .'],
  225. ['vous etes en retard .', 'you re very late .'],
  226. ['j ai soif .', 'i am thirsty .'],
  227. ['je suis fou de vous .', 'i m crazy about you .'],
  228. ['vous etes vilain .', 'you are naughty .'],
  229. ['il est vieux et laid .', 'he s old and ugly .'],
  230. ['je suis terrifiee .', 'i m terrified .'],
  231. ]
  232. for pair in sentence_pairs:
  233. print('input:', pair[0])
  234. print('target:', pair[1])
  235. pred_result = evaluate(model, pair[0], src_vocab, targ_vocab, model_state)
  236. pred_sentence = tokenzier_decode(pred_result, targ_vocab)
  237. print('pred:', pred_sentence)
  238. print('')
  239. if __name__ == '__main__':
  240. parser = argparse.ArgumentParser(description='Training Hyperparams')
  241. # data loading params
  242. parser.add_argument('-data_path', help='Path to the preprocessed data',default='./data/')
  243. # network params
  244. parser.add_argument('-d_word_vec', type=int, default=128)
  245. parser.add_argument('-num_layers', type=int, default=4)
  246. parser.add_argument('-num_heads', type=int, default=8)
  247. parser.add_argument('-d_hidden', type=int, default=512)
  248. parser.add_argument('-dropout', type=float, default=0.1)
  249. parser.add_argument('-model_path', default='./save/039_0.85_ckpt.tar', help='如果有训练好的模型参数,可以加载')
  250. # training params
  251. parser.add_argument('-mode', default='eval')
  252. parser.add_argument('-ngpu', type=int, default=1)
  253. parser.add_argument('-dynamic_lr', type=bool, default=True, help='是否使用动态学习率')
  254. parser.add_argument('-warm_steps', type=int, default=4000, help='动态学习率达到最大值所需步数')
  255. parser.add_argument('-lr', type=float, default=0.00001, help='设置学习率,如果使用动态学习率则设置无效')
  256. parser.add_argument('-batch_size', type=int, default=64)
  257. parser.add_argument('-max_epochs', type=int, default=40)
  258. parser.add_argument('-max_length', type=int, default=10, help='为了快速训练而设置的参数,将长度大于max_length的句子筛除')
  259. parser.add_argument('-input_vocab_size', type=int)
  260. parser.add_argument('-target_vocab_size', type=int)
  261. parser.add_argument('-print_trainstep_every', type=int, default=50, help='每50个step做一次打印')
  262. parser.add_argument('-metric_name', default='acc')
  263. opt = parser.parse_args()
  264. main(opt)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/代码探险家/article/detail/972637
推荐阅读
相关标签
  

闽ICP备14008679号