赞
踩
scient一个用python实现科学计算相关算法的包,包括自然语言、图像、神经网络、优化算法、机器学习、图计算等模块。
scient源码和编译安装包可以在Python package index
获取。
The source code and binary installers for the latest released version are available at the [Python package index].
https://pypi.org/project/scient
可以用pip
安装scient
。
You can install scient
like this:
pip install scient
也可以用setup.py
安装。
Or in the scient
directory, execute:
python setup.py install
神经网络相关算法模块,包括attention、transformer、bert、lstm、resnet、crf、dataset、fit等。
实现了多个Transformer模型,包括Transformer、T5Transformer、ViTransformer、DecodeTransformer、Encoder、Decoder。
scient.neuralnet.transformer.DecodeTransformer(vocab_size: int, seq_len: int = 512, embed_size: int = 512,
n_head: int = 8, n_layer: int = 6,
**kwargs)
GPT全称Generative Pretrained Transformer, 生成式预训练Transformer。
目前各种类型的Transformer可以分为四大架构:编码器-解码器架构(Encode-Decode)、编码器架构(Encode-Only)、因果解码器架构(Decode-Only)、前缀解码器架构(Prefix-Decode)。
DecodeTransformer模型结构
DecodeTransformer( (position): SinePosition() (embedding): Embedding(32128, 512) (decoder): ModuleList( (0-5): 6 x Encoder( (multi_head_attn): MultiHead( (dropout): Dropout(p=0.1, inplace=False) (query): Linear(in_features=512, out_features=512, bias=True) (key): Linear(in_features=512, out_features=512, bias=True) (value): Linear(in_features=512, out_features=512, bias=True) (linear): Linear(in_features=512, out_features=512, bias=True) ) (feedforward): Sequential( (0): Linear(in_features=512, out_features=2048, bias=True) (1): ReLU() (2): Dropout(p=0.1, inplace=False) (3): Linear(in_features=2048, out_features=512, bias=True) ) (layernorm1): LayerNorm((512,), eps=1e-09, elementwise_affine=True) (layernorm2): LayerNorm((512,), eps=1e-09, elementwise_affine=True) (dropout1): Dropout(p=0.1, inplace=False) (dropout2): Dropout(p=0.1, inplace=False) ) ) (linear): Linear(in_features=512, out_features=32128, bias=True) )
下面的代码实例是训练一个可以自动续写完整句子的模型,比如输入“鹿”,模型输出“鹿跳过篱笆。”。
import torch from scient.neuralnet import transformer,fit from scient.neuralnet import dataset import sentencepiece import pandas from tqdm import tqdm tqdm.pandas() data_path='d:\\rewrite_train3.xlsx' tokenizer_path='d:\\spiece.model' #%%model vocab_size=32128 seq_len_upper=32 tokenizer=sentencepiece.SentencePieceProcessor(tokenizer_path) model=transformer.DecodeTransformer(vocab_size=vocab_size,dropout=0.1) #%% 数据 data=pandas.read_excel(data_path) #清洗 data=data.dropna(how='any') data['label']=data['label'].progress_apply(lambda x:x.replace("\n", "\\n").replace("\t", "\\t")) #tokenize data['target_token']=data['label'].progress_apply(lambda x:tokenizer.encode(x)) #清洗 data=data[(data['target_token'].apply(len)<seq_len_upper)] #截断 data['target_input_token']=data['target_token'].progress_apply(lambda x:[tokenizer.pad_id()]+x[:seq_len_upper]) data['target_output_token']=data['target_token'].progress_apply(lambda x:x[:seq_len_upper]+[tokenizer.eos_id()]) #mask data['target_pad_mask']=data['target_input_token'].progress_apply(lambda x:[False]*len(x)+[True]*(seq_len_upper-len(x))) #补齐 data['target_input_token']=data['target_input_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x))) data['target_output_token']=data['target_output_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x))) batch_size=8 #dataLoad data_train=data.sample(frac=0.7) data_eval=data.drop(data_train.index).sample(frac=0.7) data_val=data.drop(data_train.index).drop(data_eval.index) train_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_train,tensor_vars=['target_input_token','target_pad_mask'],target_var='target_output_token'),batch_size=batch_size,shuffle=True) eval_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_eval,tensor_vars=['target_input_token','target_pad_mask'],target_var='target_output_token'),batch_size=batch_size,shuffle=False) val_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_val,tensor_vars=['target_input_token','target_pad_mask'],target_var='target_output_token'),batch_size=1,shuffle=False) #%% 训练 device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # device = torch.device("cpu") #损失函数 loss_func_ = torch.nn.CrossEntropyLoss(ignore_index=0) def loss_func(y_hat,y): return loss_func_(y_hat.reshape(-1, vocab_size),y.reshape(-1).to(torch.int64).to(device)) # 计算损失 #优化器 optimizer = torch.optim.AdamW(model.parameters(), lr=1.5e-4) def perform_func(y_hat,y):#perform_func的输入是预测值y_hat和实际值y y_hat,y=torch.concat(y_hat).reshape(-1, vocab_size).numpy(),torch.concat(y).reshape(-1).numpy()#先将y_hat和y分别concat,由于y_hat和y是按loader分批计算和收集的,所以y_hat和y是batch_size大小的多个对象组成的list y_hat=y_hat.argmax(axis=1) y_hat=y_hat[y!=0] y=y[y!=0] return round((y_hat==y).sum()/len(y),4)#输出准确率,并保留4位小数 model=fit.set(model,optimizer=optimizer,loss_func=loss_func,perform_func=perform_func,device=device,n_iter=5) model.fit(train_loader,eval_loader,mode=('inputs','target'))
附代码中用到的tokenizer模型spiece.model和训练数据rewrite_train3.xlsx的下载地址:
链接:https://pan.baidu.com/s/12vEZBYldXvPrJTiFUEKGUw?pwd=DTFM
提取码:DTFM
通过5轮训练,模型的准确率已达到85%以上。
train iter 0: avg_batch_loss=4.81741 perform=0.3631: 100%|██████████| 140/140 [00:42<00:00, 3.31it/s]
eval iter 0: avg_batch_loss=1.68011 perform=0.7351: 100%|██████████| 42/42 [00:03<00:00, 10.60it/s]
train iter 1: avg_batch_loss=1.14753 perform=0.7976: 100%|██████████| 140/140 [00:37<00:00, 3.78it/s]
eval iter 1: avg_batch_loss=0.77367 perform=0.8287: 100%|██████████| 42/42 [00:03<00:00, 10.95it/s]
train iter 2: avg_batch_loss=0.74214 perform=0.8348: 100%|██████████| 140/140 [00:37<00:00, 3.76it/s]
eval iter 2: avg_batch_loss=0.64296 perform=0.8475: 100%|██████████| 42/42 [00:03<00:00, 11.36it/s]
train iter 3: avg_batch_loss=0.64953 perform=0.8459: 100%|██████████| 140/140 [00:37<00:00, 3.74it/s]
eval iter 3: avg_batch_loss=0.60818 perform=0.8495: 100%|██████████| 42/42 [00:03<00:00, 11.05it/s]
train iter 4: avg_batch_loss=0.60935 perform=0.8478: 100%|██████████| 140/140 [00:37<00:00, 3.74it/s]
eval iter 4: avg_batch_loss=0.57814 perform=0.8553: 100%|██████████| 42/42 [00:03<00:00, 10.93it/s]
采用训练好的模型对data_val数据集进行预测
#%% # 验证 model.eval() progressbar = tqdm(val_loader)#这里batch_size必须为1 preds=[] with torch.no_grad(): for index,((target_input,target_input_pad_mask),target_output) in enumerate(progressbar): # break pred=target_input[:,:3] while True: pred_mask=torch.zeros_like(pred).to(torch.bool) output=model(pred.to(device),pred_mask.to(device)) _,ids = output.max(dim=-1) if ids[0,-1]==tokenizer.eos_id():#eos break if pred.size(1)>seq_len_upper-1: break pred=torch.cat([pred.to(device),ids[:,-1:]],dim=-1) preds+=pred.tolist() data_val['target_output_pred']=preds data_val['target_pred']=data_val['target_output_pred'].progress_apply(lambda x:tokenizer.decode(x))
预测结果
请添加图片描述
label是期望模型输出的内容,只给模型输入句子的前三个字符,target_pred是模型输出的内容,可以看到模型输出与期望之间基本一致。
值得注意的是第一条数据,当输入为 当泰国三个字时,模型输出为:
当泰国抗议活动升温时,岩石、催泪瓦斯将被指控,有7700万信徒。
虽然与期望输出的内容差距较大,且模型并未理解岩石、催泪瓦斯是不可被指控的,但是生成的语句却完全符合语法要求。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。