赞
踩
scient一个用python实现科学计算相关算法的包,包括自然语言、图像、神经网络、优化算法、机器学习、图计算等模块。
scient源码和编译安装包可以在Python package index
获取。
The source code and binary installers for the latest released version are available at the [Python package index].
https://pypi.org/project/scient
可以用pip
安装scient
。
You can install scient
like this:
pip install scient
也可以用setup.py
安装。
Or in the scient
directory, execute:
python setup.py install
神经网络相关算法模块,包括attention、transformer、bert、lstm、resnet、crf、dataset、fit等。
实现了多个Transformer模型,包括Transformer、T5Transformer、ViTransformer、DecodeTransformer、Encoder、Decoder。
scient.neuralnet.transformer.T5Transformer(vocab_size: int, seq_len: int = 512, embed_size: int = 768,
n_head: int = 12, n_encode_layer: int = 12, n_decode_layer: int = 12, n_bucket: int = 32,
max_dist: int = 128, norm_first: bool = True, bias: bool = False, attn_scale: bool = False,
**kwargs)
T5采用了相对位置分桶(relative_position_bucket)的方式来处理位置编码。
在双向注意力的Encoder阶段,相对位置分桶的公式为:
在单向注意力的Decoder阶段,相对位置分桶的公式为:
式中的 n b n_b nb为相对位置编码的分桶数量n_bucket, m a x _ d i s t a n c e max\_distance max_distance为相对位置编码的最大距离max_dist。
T5模型结构
T5Transformer( (encoder_position): BucketPosition( (projection): Embedding(32, 12) ) (decoder_position): BucketPosition( (projection): Embedding(32, 12) ) (embedding): Embedding(32128, 768) (encoder): ModuleList( (0-11): 12 x Encoder( (multi_head_attn): MultiHead( (dropout): Dropout(p=0.1, inplace=False) (query): Linear(in_features=768, out_features=768, bias=False) (key): Linear(in_features=768, out_features=768, bias=False) (value): Linear(in_features=768, out_features=768, bias=False) (linear): Linear(in_features=768, out_features=768, bias=False) ) (feedforward): Sequential( (0): Linear(in_features=768, out_features=3072, bias=False) (1): ReLU() (2): Dropout(p=0.1, inplace=False) (3): Linear(in_features=3072, out_features=768, bias=False) ) (layernorm1): T5LayerNorm() (layernorm2): T5LayerNorm() (dropout1): Dropout(p=0.1, inplace=False) (dropout2): Dropout(p=0.1, inplace=False) ) ) (decoder): ModuleList( (0-11): 12 x Decoder( (mask_multi_head_attn): MultiHead( (dropout): Dropout(p=0.1, inplace=False) (query): Linear(in_features=768, out_features=768, bias=False) (key): Linear(in_features=768, out_features=768, bias=False) (value): Linear(in_features=768, out_features=768, bias=False) (linear): Linear(in_features=768, out_features=768, bias=False) ) (multi_head_attn): MultiHead( (dropout): Dropout(p=0.1, inplace=False) (query): Linear(in_features=768, out_features=768, bias=False) (key): Linear(in_features=768, out_features=768, bias=False) (value): Linear(in_features=768, out_features=768, bias=False) (linear): Linear(in_features=768, out_features=768, bias=False) ) (feedforward): Sequential( (0): Linear(in_features=768, out_features=3072, bias=False) (1): ReLU() (2): Dropout(p=0.1, inplace=False) (3): Linear(in_features=3072, out_features=768, bias=False) ) (layernorm1): T5LayerNorm() (layernorm2): T5LayerNorm() (layernorm3): T5LayerNorm() (dropout1): Dropout(p=0.1, inplace=False) (dropout2): Dropout(p=0.1, inplace=False) (dropout3): Dropout(p=0.1, inplace=False) ) ) (encoder_layernorm): T5LayerNorm() (decoder_layernorm): T5LayerNorm() (linear): Linear(in_features=768, out_features=32128, bias=False) )
下面的代码实例是训练一个“对句子进行重写,且不改变语义”的模型,比如“鹿跳过篱笆。”可重写成“一只鹿跳过篱笆。”。
import torch from scient.neuralnet import transformer,fit from scient.neuralnet import dataset import sentencepiece import pandas from tqdm import tqdm tqdm.pandas() data_path='d:\\rewrite_train3.xlsx' tokenizer_path='d:\\spiece.model' #%%model vocab_size=32128 seq_len_upper=32 tokenizer=sentencepiece.SentencePieceProcessor(tokenizer_path) model=transformer.T5Transformer(vocab_size=vocab_size,dropout=0.1,ffn_size=3072) #%% 数据 data=pandas.read_excel(data_path) #tokenize data['source_token']=data['input'].progress_apply(lambda x:tokenizer.encode(x)) data['target_token']=data['label'].progress_apply(lambda x:tokenizer.encode(x)) #清洗 data=data[(data['source_token'].apply(len)<seq_len_upper)&(data['target_token'].apply(len)<seq_len_upper)] #截断 data['source_token']=data['source_token'].progress_apply(lambda x:x[:seq_len_upper]+[tokenizer.eos_id()])#增加<eos>标识 data['target_input_token']=data['target_token'].progress_apply(lambda x:[tokenizer.pad_id()]+x[:seq_len_upper])#增加<bos>标识,这里用pad_id作为<bos> data['target_output_token']=data['target_token'].progress_apply(lambda x:x[:seq_len_upper]+[tokenizer.eos_id()])#增加<eos>标识 #mask data['source_pad_mask']=data['source_token'].progress_apply(lambda x:[False]*len(x)+[True]*(seq_len_upper-len(x))) data['target_pad_mask']=data['target_input_token'].progress_apply(lambda x:[False]*len(x)+[True]*(seq_len_upper-len(x))) #补齐 data['source_token']=data['source_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x))) data['target_input_token']=data['target_input_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x))) data['target_output_token']=data['target_output_token'].progress_apply(lambda x:x+[tokenizer.pad_id()]*(seq_len_upper-len(x))) batch_size=8 #dataLoad data_train=data.sample(frac=0.7) data_eval=data.drop(data_train.index).sample(frac=0.7) data_val=data.drop(data_train.index).drop(data_eval.index) train_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_train,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=batch_size,shuffle=True) eval_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_eval,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=batch_size,shuffle=False) val_loader = torch.utils.data.DataLoader(dataset=dataset.DataFrame(frame=data_val,tensor_vars=['source_token','target_input_token','source_pad_mask','target_pad_mask'],target_var='target_output_token'),batch_size=1,shuffle=False) #%% 训练 device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # device = torch.device("cpu") #损失函数 loss_func_ = torch.nn.CrossEntropyLoss(ignore_index=0) def loss_func(y_hat,y): return loss_func_(y_hat.reshape(-1, vocab_size),y.reshape(-1).to(torch.int64).to(device)) # 计算损失 #优化器 optimizer = torch.optim.AdamW(model.parameters(), lr=1.5e-4) def perform_func(y_hat,y):#perform_func的输入是预测值y_hat和实际值y y_hat,y=torch.concat(y_hat).reshape(-1, vocab_size).numpy(),torch.concat(y).reshape(-1).numpy()#先将y_hat和y分别concat,由于y_hat和y是按loader分批计算和收集的,所以y_hat和y是batch_size大小的多个对象组成的list y_hat=y_hat.argmax(axis=1) y_hat=y_hat[y!=0] y=y[y!=0] return round((y_hat==y).sum()/len(y),4)#输出准确率,并保留4位小数 model=fit.set(model,optimizer=optimizer,loss_func=loss_func,perform_func=perform_func,device=device,n_iter=5) model.fit(train_loader,eval_loader,mode=('inputs','target'))
附代码中用到的tokenizer模型spiece.model和训练数据rewrite_train3.xlsx的下载地址:
链接:https://pan.baidu.com/s/12vEZBYldXvPrJTiFUEKGUw?pwd=DTFM
提取码:DTFM
通过5轮训练,模型在训练集和测试集上的准确率均已达到99%以上。
train iter 0: avg_batch_loss=3.88477 perform=0.5023: 100%|██████████| 140/140 [06:43<00:00, 2.88s/it]
eval iter 0: avg_batch_loss=0.56695 perform=0.8973: 100%|██████████| 42/42 [00:28<00:00, 1.47it/s]
train iter 1: avg_batch_loss=0.27674 perform=0.9539: 100%|██████████| 140/140 [08:02<00:00, 3.45s/it]
eval iter 1: avg_batch_loss=0.08557 perform=0.9808: 100%|██████████| 42/42 [00:46<00:00, 1.10s/it]
train iter 2: avg_batch_loss=0.05592 perform=0.9897: 100%|██████████| 140/140 [09:33<00:00, 4.10s/it]
eval iter 2: avg_batch_loss=0.01999 perform=0.9957: 100%|██████████| 42/42 [00:28<00:00, 1.45it/s]
train iter 3: avg_batch_loss=0.02244 perform=0.9964: 100%|██████████| 140/140 [07:58<00:00, 3.42s/it]
eval iter 3: avg_batch_loss=0.01343 perform=0.996: 100%|██████████| 42/42 [00:32<00:00, 1.31it/s]
train iter 4: avg_batch_loss=0.01273 perform=0.9981: 100%|██████████| 140/140 [07:44<00:00, 3.32s/it]
eval iter 4: avg_batch_loss=0.01047 perform=0.9977: 100%|██████████| 42/42 [00:29<00:00, 1.41it/s]
采用训练好的模型对data_val数据集进行预测
#%% # 验证 model.eval() progressbar = tqdm(val_loader)#这里batch_size必须为1 preds=[] with torch.no_grad(): for index,((source,target_input,source_pad_mask,target_input_pad_mask),target_output) in enumerate(progressbar): # break memory=model.encode(source.to(torch.int64).to(device),source_pad_mask.to(device)) pred=torch.tensor([[tokenizer.pad_id()]])#bos while True: pred_mask=torch.zeros_like(pred).to(torch.bool) decode = model.decode(pred.to(torch.int64).to(device),memory,target_pad_mask=pred_mask.to(device)) output=model.linear(decode) _,ids = output.max(dim=-1) if ids[0,-1]==tokenizer.eos_id():#eos break if pred.size(1)>seq_len_upper-1: break pred=torch.cat([pred.to(device),ids[:,-1:]],dim=-1) preds+=pred.tolist() data_val['target_output_pred']=preds data_val['target_pred']=data_val['target_output_pred'].progress_apply(lambda x:tokenizer.decode(x))
预测结果
input是输入,label是期望模型输出的内容,target_pred是模型输出的内容,可以看到模型输出与期望之间基本一致。
值得注意的是第一条数据,模型将
机构认为,随着经济数据及上市公司财报的披露,预计市场主线将逐渐清晰,中大盘成长股有望成为下一阶段的资金偏好。
改写成
他们说,第二季度的收益报告将是给予投资者这种指导的关键。
虽然与期望输出的内容差距较大,但是模型输出的意思却是完全正确的,难道这就是模型涌现出的创造力?
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。