赞
踩
1.数据集处理:
data(dataframe形式)是数据集有两列:data有两列:[‘train_texts’,‘train_labels’]
train_texts就是x输入模型前的最终形式,如下
train_texts=data[‘‘train_texts’],例如:
train_texts=[‘这部电影很好看’,‘这部电影剧情拖沓,没演技’,’',‘非常好看的一部电影,很催泪’]
train_labels输入模型前的最终形式,如下
train_labels=data[‘train_labels’],例如:
train_labels=[‘好评’,’差评‘,’好评‘]
2.中文Bert模型加载(bert_base_chinese)
下载预训练模型
从https://huggingface.co/models下载,目前需要翻墙。这里提供百度云下载,点击连接链接:https://pan.baidu.com/s/1iDY4ANbAgOR6OCOr7QAwwA?pwd=8pu6
提取码:8pu6
将下载后的压缩包解压,放到你的文本分类项目目录下,如:
[chinese_L-12这个文件就是下载好的预训练模型,里面包含3个文件:vocab、config、pytorch_model.bin,分别是词典(获取词到索引的映射,将词变数字)、config.json是用于配置预训练模型参数的文件包含了模型的架构、超参数和其他模型配置信息、pytorch_model.bin是模型的权重。
导入bert相关包:
from transformers import BertTokenizer, BertForSequenceClassification
tokenizer = BertTokenizer.from_pretrained('./chinese_L-12_H-768_A-12')
注:这里是相对路径,只要你把chinese_L-12_H-768_A-12文件夹放到图1的位置都可以成功导进来。
对train_texts进行编码,max_length=64是单个文本得最大长度,超了截掉,没超补0,可以根据文本的词数分布自己定一个数:
train_input_ids = [] train_attention_masks = [] for text in train_texts: encoded = tokenizer.encode_plus( text, add_special_tokens=True, max_length=64, padding='max_length', truncation=True, return_attention_mask=True, return_tensors='pt' ) train_input_ids.append(encoded['input_ids']) train_attention_masks.append(encoded['attention_mask']) train_input_ids = torch.cat(train_input_ids, dim=0) train_attention_masks = torch.cat(train_attention_masks, dim=0)
将train_labels=[‘好评’,’差评‘,’中评‘]进行编码,[[1,0,0],[0,0,1],[0,1,0]]
train_labels = torch.tensor(train_labels)
生成batch,可以32 64 128 256等
train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)
train_dataloader = DataLoader(train_dataset, batch_size=32)
定义模型和优化器:
model = BertForSequenceClassification.from_pretrained('./chinese_L-12_H-768_A-12', num_labels=4)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
训练模型:
print("++++++++++++++++++++++++++++开始训练+++++++++++++++++++++++++++++++++") model.train() for epoch in range(5): print(f'Epoch {epoch + 1}') total_loss = 0 for step, batch in enumerate(train_dataloader): batch_input_ids = batch[0] batch_attention_masks = batch[1] batch_labels = batch[2] optimizer.zero_grad() outputs = model(batch_input_ids, attention_mask=batch_attention_masks, labels=batch_labels) loss = outputs.loss total_loss += loss.item() loss.backward() optimizer.step() if step%10==0: print(f'step {step}, avg_loss: {total_loss:.4f}')
保存模型:
torch.save(model.state_dict(), './bert_model.pth')
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。