当前位置:   article > 正文

Transformer经典模型实战:零基础训练一个面向中文的T5模型(Text to Text Transfer Transformer)

Transformer经典模型实战:零基础训练一个面向中文的T5模型(Text to Text Transfer Transformer)

scient

scient一个用python实现科学计算相关算法的包,包括自然语言、图像、神经网络、优化算法、机器学习、图计算等模块。

scient源码和编译安装包可以在Python package index获取。

The source code and binary installers for the latest released version are available at the [Python package index].

https://pypi.org/project/scient

可以用pip安装scient

You can install scient like this:

pip install scient
  • 1

也可以用setup.py安装。

Or in the scient directory, execute:

python setup.py install
  • 1

scient.neuralnet

神经网络相关算法模块,包括attention、transformer、bert、lstm、resnet、crf、dataset、fit等。

scient.neuralnet.transformer

实现了多个Transformer模型,包括Transformer、T5Transformer、ViTransformer、DecodeTransformer、Encoder、Decoder。

scient.neuralnet.transformer.T5Transformer(vocab_size: int, seq_len: int = 512, embed_size: int = 768,
										   n_head: int = 12, n_encode_layer: int = 12, n_decode_layer: int = 12, n_bucket: int = 32,
										   max_dist: int = 128, norm_first: bool = True, bias: bool = False, attn_scale: bool = False,
										   **kwargs)
  • 1
  • 2
  • 3
  • 4

Parameters

  • vocab_size : int
    字典规模.
  • seq_len : int, optional
    序列长度. The default is 512.
  • embed_size : int, optional
    embedding向量长度. The default is 768.
  • n_head : int, optional
    multi_head_attention的head数量. The default is 12.
  • n_encode_layer : int, optional
    编码层数. The default is 12.
  • n_decode_layer : int, optional
    解码层数. The default is 12.
  • n_bucket : int, optional
    multi_head_attention中相对位置编码的分桶数量. The default is 32.
  • max_dist : int, optional
    multi_head_attention中相对位置编码的最大距离. The default is 128.
  • norm_first : bool, optional
    在每一个编码/解码层中是否先进行Batch Normalization. The default is True.
  • bias : bool, optional
    模型中的参数是否bias. The default is False.
  • attn_scale : bool, optional
    multi_head_attention中是否需要对注意力矩阵进行scale. The default is False.
  • kwargs : 其它参数,kwargs中的参数将被传递到Encoder层和Decoder层。

Algorithms

T5采用了相对位置分桶(relative_position_bucket)的方式来处理位置编码。
在双向注意力的Encoder阶段,相对位置分桶的公式为:

在这里插入图片描述

在单向注意力的Decoder阶段,相对位置分桶的公式为:

在这里插入图片描述

式中的 n b n_b nb为相对位置编码的分桶数量n_bucket, m a x _ d i s t a n c e max\_distance max_distance为相对位置编码的最大距离max_dist。

T5模型结构

T5Transformer(
  (encoder_position): BucketPosition(
    (projection): Embedding(32, 12)
  )
  (decoder_position): BucketPosition(
    (projection): Embedding(32, 12)
  )
  (embedding): Embedding(32128, 768)
  (encoder): ModuleList(
    (0-11): 12 x Encoder(
      (multi_head_attn): MultiHead(
        (dropout): Dropout(p=0.1, inplace=False)
        (query): Linear(in_features=768, out_features=768, bias=False)
        (key): Linear(in_features=768, out_features=768, bias=False)
        (value): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
      )
      (feedforward): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=False)
        (1): ReLU()
        (2): Dropout(p=0.1, inplace=False)
        (3): Linear(in_features=3072, out_features=768, bias=False)
      )
      (layernorm1): T5LayerNorm()
      (layernorm2): T5LayerNorm()
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
  )
  (decoder): ModuleList(
    (0-11): 12 x Decoder(
      (mask_multi_head_attn): MultiHead(
        (dropout): Dropout(p=0.1, inplace=False)
        (query): Linear(in_features=768, out_features=768, bias=False)
        (key): Linear(in_features=768, out_features=768, bias=False)
        (value): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
      )
      (multi_head_attn): MultiHead(
        (dropout): Dropout(p=0.1, inplace=False)
        (query): Linear(in_features=768, out_features=768, bias=False)
        (key): Linear(in_features=768, out_features=768, bias=False)
        (value): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
      )
      (feedforward): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=False)
        (1): ReLU()
        (2): Dropout(p=0.1, inplace=False)
        (3): Linear(in_features=3072, out_features=768, bias=False)
      )
      (layernorm1): T5LayerNorm()
      (layernorm2): T5LayerNorm()
      (layernorm3): T5LayerNorm()
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
      (dropout3): Dropout(p=0.1, inplace=False)
    )
  )
  (encoder_layernorm): T5LayerNorm()
  (decoder_layernorm): T5LayerNorm()
  (linear): Linear(in_features=768, out_features=32128, bias=False)
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63

Examples

下面的代码实例是训练一个“对句子进行重写,且不改变语义”的模型,比如“鹿跳过篱笆。”可重写成“一只鹿跳过篱笆。”。

import torch
from scient.neuralnet import transformer,fit
from scient.neuralnet import dataset
import sentencepiece
import pandas
from tqdm import tqdm

tqdm.pandas()

data_path='d:\\rewrite_train3.xlsx'
tokenizer_path='d:\\spiece.model'

#%%model
vocab_size=32128
seq_len_upper=32

tokenizer=sentencepiece.SentencePieceProcessor(tokenizer_path)
model=transformer.T5Transformer(vocab_size=vocab_size,dropout=0.1,ffn_size=3072)

#%% 数据
data=pandas.read_excel(data_path)

#tokenize
data['source_token']=data['input'].progress_apply(lambda x:tokenizer.encode(x))
data['target_token']=data['label'].progress_apply(lambda x:tokenizer.encode(x))

#清洗
data=data[(data['source_token'].apply(len)<seq_len_upper)&(data['target_token'].apply(len)<seq_len_upper)]

#截断
data['source_token']=data['source_token'].progress_apply(lambda x:x[:seq_len_upper]+[tokenizer.eos_id()])#增加<eos>标识
data['target_input_token']=data['target_token'].progress_apply(lambda x:[tokenizer.pad_id()]+x[:seq_len_upper])#增加<bos>标识,这里用pad_id作为<bos>
data['target_output_token']=data['target_token'].progress_apply(lambda x:x[:seq_len_upper]+[tokenizer.eos_id()])#增加<eos>标识

#mask
data['source_pad_mask']=data['source_token'].progress_apply(lambda x:[False]*len(x)+[True]*(seq_len_upper-len(x)))
data['target_pad_mask']=data['target_input_token'].progress_apply(lambda x:[False]*len(x)+[True]*(seq_len_upper-len(x)))

#补齐
data['source_token']=data['source_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x)))
data['target_input_token']=data['target_input_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x)))
data['target_output_token']=data['target_output_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x)))

batch_size=8
#dataLoad
data_train=data.sample(frac=0.7)
data_eval=data.drop(data_train.index).sample(frac=0.7)
data_val=data.drop(data_train.index).drop(data_eval.index)
train_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_train,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=batch_size,shuffle=True)
eval_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_eval,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=batch_size,shuffle=False)
val_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_val,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=1,shuffle=False)
#%% 训练
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device = torch.device("cpu")

#损失函数
loss_func_ = torch.nn.CrossEntropyLoss(ignore_index=0)
def loss_func(y_hat,y):
    return loss_func_(y_hat.reshape(-1, vocab_size),y.reshape(-1).to(torch.int64).to(device))  # 计算损失

#优化器
optimizer = torch.optim.AdamW(model.parameters(), lr=1.5e-4)

def perform_func(y_hat,y):#perform_func的输入是预测值y_hat和实际值y
    y_hat,y=torch.concat(y_hat).reshape(-1, vocab_size).numpy(),torch.concat(y).reshape(-1).numpy()#先将y_hat和y分别concat,由于y_hat和y是按loader分批计算和收集的,所以y_hat和y是batch_size大小的多个对象组成的list
    y_hat=y_hat.argmax(axis=1)
    y_hat=y_hat[y!=0]
    y=y[y!=0]
    return round((y_hat==y).sum()/len(y),4)#输出准确率,并保留4位小数

model=fit.set(model,optimizer=optimizer,loss_func=loss_func,perform_func=perform_func,device=device,n_iter=5)
model.fit(train_loader,eval_loader,mode=('inputs','target'))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72

附代码中用到的tokenizer模型spiece.model和训练数据rewrite_train3.xlsx的下载地址:
链接:https://pan.baidu.com/s/12vEZBYldXvPrJTiFUEKGUw?pwd=DTFM
提取码:DTFM

通过5轮训练,模型在训练集和测试集上的准确率均已达到99%以上。

train iter 0: avg_batch_loss=3.88477 perform=0.5023: 100%|██████████| 140/140 [06:43<00:00,  2.88s/it]    
eval iter 0: avg_batch_loss=0.56695 perform=0.8973: 100%|██████████| 42/42 [00:28<00:00,  1.47it/s]    
train iter 1: avg_batch_loss=0.27674 perform=0.9539: 100%|██████████| 140/140 [08:02<00:00,  3.45s/it]    
eval iter 1: avg_batch_loss=0.08557 perform=0.9808: 100%|██████████| 42/42 [00:46<00:00,  1.10s/it]    
train iter 2: avg_batch_loss=0.05592 perform=0.9897: 100%|██████████| 140/140 [09:33<00:00,  4.10s/it]    
eval iter 2: avg_batch_loss=0.01999 perform=0.9957: 100%|██████████| 42/42 [00:28<00:00,  1.45it/s]    
train iter 3: avg_batch_loss=0.02244 perform=0.9964: 100%|██████████| 140/140 [07:58<00:00,  3.42s/it]    
eval iter 3: avg_batch_loss=0.01343 perform=0.996: 100%|██████████| 42/42 [00:32<00:00,  1.31it/s]     
train iter 4: avg_batch_loss=0.01273 perform=0.9981: 100%|██████████| 140/140 [07:44<00:00,  3.32s/it]    
eval iter 4: avg_batch_loss=0.01047 perform=0.9977: 100%|██████████| 42/42 [00:29<00:00,  1.41it/s]    
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

采用训练好的模型对data_val数据集进行预测

#%%
# 验证
model.eval()
progressbar = tqdm(val_loader)#这里batch_size必须为1
preds=[]
with torch.no_grad():
    for index,((source,target_input,source_pad_mask,target_input_pad_mask),target_output) in enumerate(progressbar):
        # break
        memory=model.encode(source.to(torch.int64).to(device),source_pad_mask.to(device))
        pred=torch.tensor([[tokenizer.pad_id()]])#bos
        while True:
            pred_mask=torch.zeros_like(pred).to(torch.bool)
            decode = model.decode(pred.to(torch.int64).to(device),memory,target_pad_mask=pred_mask.to(device))
            output=model.linear(decode)
            _,ids = output.max(dim=-1)
            if ids[0,-1]==tokenizer.eos_id():#eos
                break
            if pred.size(1)>seq_len_upper-1:
                break
            pred=torch.cat([pred.to(device),ids[:,-1:]],dim=-1)
        preds+=pred.tolist()

data_val['target_output_pred']=preds
data_val['target_pred']=data_val['target_output_pred'].progress_apply(lambda x:tokenizer.decode(x))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24

预测结果

在这里插入图片描述

input是输入,label是期望模型输出的内容,target_pred是模型输出的内容,可以看到模型输出与期望之间基本一致。
值得注意的是第一条数据,模型将
机构认为,随着经济数据及上市公司财报的披露,预计市场主线将逐渐清晰,中大盘成长股有望成为下一阶段的资金偏好。
改写成
他们说,第二季度的收益报告将是给予投资者这种指导的关键。
虽然与期望输出的内容差距较大,但是模型输出的意思却是完全正确的,难道这就是模型涌现出的创造力?

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/码创造者/article/detail/1016027
推荐阅读
相关标签
  

闽ICP备14008679号