赞
踩
本文写于2023.3.20,不保证未来以下代码及操作过程仍然可以使用。
本文主要关注中文短文本分类。不过其他场景只要换预训练模型就行。
整体的训练流程是:将数据分成训练集、验证集和测试集。在训练集上训练16个epoch,在每次训练之后都在验证集上测试一遍,最终选择指标最高的一个epoch的checkpoint来运行测试集,得出其结果,并与真实标签进行对比,得到模型的最终指标。
此处不介绍具体对该数据集的处理过程,可参考用huggingface.transformers在文本分类任务(单任务和多任务场景下)上微调预训练模型一文。在这里,我主要是为了将数据集处理为一个比较显明易读的格式,以便读者替换为自己的数据集。
数据集都是CSV格式的文件,label
列是标签(这里是数字形式。如果是文本形式,要先映射为数字), review
列是要被分类的文本。
示例:
建议使用Linux系统,但是如果你非要想用Windows系统也行。
如果直接使用Google Colab等集成平台,可以跳过下面的这些安装步骤。
在VSCode中新建一个文本文件,取名为chn_run.py
,把如下代码复制进去(记得删掉CSDN复制后产生的多余文字):
import csv from tqdm import tqdm from copy import deepcopy from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score import torch import torch.nn as nn from torch.utils.data import DataLoader from transformers import AutoTokenizer,AutoModelForSequenceClassification """超参设置""" pretrained_path='/data/pretrained_model/bert-base-chinese' dropout_rate=0.1 max_epoch_num=16 cuda_device='cuda:2' output_dim=2 """加载数据集""" #训练集 with open('chn_train.csv') as f: reader=csv.reader(f) header=next(reader) #表头 train_data=[[int(row[0]),row[1]] for row in reader] #验证集 with open('chn_valid.csv') as f: reader=csv.reader(f) header=next(reader) valid_data=[[int(row[0]),row[1]] for row in reader] #测试集 with open('chn_test.csv') as f: reader=csv.reader(f) header=next(reader) test_data=[[int(row[0]),row[1]] for row in reader] tokenizer = AutoTokenizer.from_pretrained(pretrained_path) def collate_fn(batch): pt_batch=tokenizer([x[1] for x in batch],padding=True,truncation=True,max_length=512,return_tensors='pt') return {'input_ids':pt_batch['input_ids'],'token_type_ids':pt_batch['token_type_ids'],'attention_mask':pt_batch['attention_mask'], 'label':torch.tensor([x[0] for x in batch])} train_dataloader=DataLoader(train_data,batch_size=16,shuffle=True,collate_fn=collate_fn) valid_dataloader=DataLoader(valid_data,batch_size=128,shuffle=False,collate_fn=collate_fn) test_dataloader=DataLoader(test_data,batch_size=128,shuffle=False,collate_fn=collate_fn) """建模""" #API文档:https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForSequenceClassification model=AutoModelForSequenceClassification.from_pretrained(pretrained_path,num_labels=output_dim) model.to(cuda_device) """构建优化器、损失函数等""" optimizer=torch.optim.Adam(params=model.parameters(),lr=1e-5) loss_func=nn.CrossEntropyLoss() max_valid_f1=0 best_model={} """训练与验证""" for e in tqdm(range(max_epoch_num)): for batch in train_dataloader: model.train() optimizer.zero_grad() input_ids=batch['input_ids'].to(cuda_device) token_type_ids=batch['token_type_ids'].to(cuda_device) attention_mask=batch['attention_mask'].to(cuda_device) labels=batch['label'].to(cuda_device) outputs=model(input_ids,token_type_ids,attention_mask,labels=labels) outputs.loss.backward() optimizer.step() #验证 with torch.no_grad(): model.eval() labels=[] predicts=[] for batch in valid_dataloader: input_ids=batch['input_ids'].to(cuda_device) token_type_ids=batch['token_type_ids'].to(cuda_device) attention_mask=batch['attention_mask'].to(cuda_device) outputs=model(input_ids,token_type_ids,attention_mask) labels.extend([i.item() for i in batch['label']]) predicts.extend([i.item() for i in torch.argmax(outputs.logits,1)]) f1=f1_score(labels,predicts,average='macro') if f1>max_valid_f1: best_model=deepcopy(model.state_dict()) max_valid_f1=f1 """测试""" model.load_state_dict(best_model) with torch.no_grad(): model.eval() labels=[] predicts=[] for batch in test_dataloader: input_ids=batch['input_ids'].to(cuda_device) token_type_ids=batch['token_type_ids'].to(cuda_device) attention_mask=batch['attention_mask'].to(cuda_device) outputs=model(input_ids,token_type_ids,attention_mask) labels.extend([i.item() for i in batch['label']]) predicts.extend([i.item() for i in torch.argmax(outputs.logits,1)]) print(accuracy_score(labels,predicts)) print(precision_score(labels,predicts,average='macro')) print(recall_score(labels,predicts,average='macro')) print(f1_score(labels,predicts,average='macro'))
需要修改的内容:
cpu
chn_*.csv
:改成你放置CSV文件的位置运行结果类似于:
env_path/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Some weights of the model checkpoint at /data/pretrained_model/bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /data/pretrained_model/bert-base-chinese and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [1:39:37<00:00, 373.62s/it]
0.8088803088803089
0.7998017541751772
0.7586588921282799
0.7722809550288214
开头警告不用管。
(注意,不用在乎具体数值结果,我连续跑了2次,相差10个点……我用另一种类似的写法写出来能上90%呢,反正这个指标不用太在意,它稳定性很差的:用huggingface.transformers在文本分类任务(单任务和多任务场景下)上微调预训练模型)
reader()
函数中添加对应的入参Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。