赞
踩
使用colab进行实验
左上角上传数据,到当前实验室
右上角设置GPU选择
! nvidia-sm
安装需要的库
!pip install datasets
!pip install transformers[torch]
!pip install torchkeras
import pandas as pd
data = pd.read_csv("/content/news.csv")
data
我们看到label是中文字符串,训练时需要转换成数值型,如下
{'教育': 0,
'体育': 1,
'科技': 2,
'时尚': 3,
'房产': 4,
'家居': 5,
'财经': 6,
'时政': 7,
'娱乐': 8,
'游戏': 9}
遍历一下就可以,并将全数据转换为data frame
#处理标签 def label_dic(data,label): d = {} labels = data[label].unique() for i,v in enumerate(labels): d[v] = i return d #数据整理 def get_train_data(data,col_x,col_y,label_dic): content = data[col_x] label = [] for i in data[col_y]: label.append(label_dic.get(i)) return content,label label_dic = label_dic(data,"label") content,label = get_train_data(data,"text","label",label_dic)
将数据转换成可以进行训练的数据
data = pd.DataFrame({"content":content,"label":label})
data = shuffle(data)
from transformers import AutoTokenizer #BertTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')
tokenizer
tokenizer
BertTokenizerFast(name_or_path=‘bert-base-chinese’, vocab_size=21128, model_max_length=512, is_fast=True, padding_side=‘right’, truncation_side=‘right’, special_tokens={‘unk_token’: ‘[UNK]’, ‘sep_token’: ‘[SEP]’, ‘pad_token’: ‘[PAD]’, ‘cls_token’: ‘[CLS]’, ‘mask_token’: ‘[MASK]’}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken(“[PAD]”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
100: AddedToken(“[UNK]”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
101: AddedToken(“[CLS]”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
102: AddedToken(“[SEP]”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
103: AddedToken(“[MASK]”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
训练集取80%
train_len = round(len(data)*0.8)
train_data = tokenizer(data.content.to_list()[:train_len], padding = "max_length", max_length = 128, truncation=True ,return_tensors = "pt")
train_label = data.label.to_list()[:train_len]
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-chinese", num_labels=10)
在AutoModelForSequenceClassification.from_pretrained(“bert-base-chinese”, num_labels=10) 这个函数中,transformer 已经帮你定义了损失函数,既10个分类的交叉熵损失,所以下方我们只需要自己定义优化器和学习率即可。
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
batch_size = 16
train = TensorDataset(train_data["input_ids"], train_data["attention_mask"], torch.tensor(train_label))
train_sampler = RandomSampler(train)
train_dataloader = DataLoader(train, sampler=train_sampler, batch_size=batch_size)
train_data 是一个包含输入数据的字典,其中 “input_ids” 是模型输入的token ID,“attention_mask” 是用于标识输入序列中哪些位置是有效的前景tokens,“labels” 是序列分类任务的标签。我们可以自己打印下我们前面定义好的训练数据,如下
TensorDataset 将数据转换为一个PyTorch张量数据集,其中每个样本是一个包含input_ids、attention_mask和label的元组。
RandomSampler 从数据集中随机抽取样本进行训练,这对于避免过拟合和获得更具代表性的训练集是有益的。
DataLoader 负责将数据集划分为批次,并为训练提供迭代器。
#定义优化器
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=1e-4)
#定义学习率和训练轮数
num_epochs = 1
from transformers import get_scheduler
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
检查是否有可用的GPU,如果有,则将device设置为cuda;否则,设置为cpu。模型移动到选定的设备上。
循环
for epoch in range(num_epochs): total_loss = 0 model.train() for step, batch in enumerate(train_dataloader): if step % 10 == 0 and not step == 0: print("step: ",step, " loss:",total_loss/(step*batch_size)) b_input_ids = batch[0].to(device) b_input_mask = batch[1].to(device) b_labels = batch[2].to(device) model.zero_grad() outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels) loss = outputs.loss total_loss += loss.item() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() lr_scheduler.step() avg_train_loss = total_loss / len(train_dataloader) print("avg_loss:",avg_train_loss)
-b_input_ids、b_input_mask、b_labels:这些是从批次中提取的输入ID、掩码和标签,并移动到device上。
inp = "专家指导:参加SSAT考试读美国优质高中(图)SSAT考试的全称是Secondary SchoolAdmission Test),是美国(微博)中学入学测试,相当于中国的中考,近年来,越来越多的中国学生通过参加SSAT申请美国高中,然后一步步进入世界一流大学。"
import numpy as np
test = tokenizer(inp,return_tensors="pt",padding="max_length",max_length=128)
model.eval()
with torch.no_grad():
test["input_ids"] = test["input_ids"].to(device)
test["attention_mask"] = test["attention_mask"].to(device)
outputs = model(test["input_ids"],
token_type_ids=None,
attention_mask=test["attention_mask"])
pred_flat = np.argmax(outputs["logits"].cpu(),axis=1).numpy().squeeze()
pred_flat.tolist()
#0
我们也可以把标签进行一个映射
id2label_dic={}
for k,v in label_dic.items():
id2label_dic[v] = k
id2label_dic[pred_flat.tolist()]
#教育
model.config.id2label = id2label_dic
model.save_pretrained("./bert0207")
tokenizer.save_pretrained("./bert0207")
调取模型预测
from transformers import pipeline
classifier = pipeline("text-classification",model="./bert0207")
classifier(inp)
#[{'label': '教育', 'score': 0.9345001578330994}]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。