赞
踩
本实验大体框架参照这篇博文:
https://lizhiyang.blog.csdn.net/article/details/132394853
基于Ernie-3.0-medium-zh大模型,这篇博文运用孤注一掷影评数据进行情感分析,共600条数据,进行了词云展示。利用正则表达式清理数据,利用paddlenlp.datasets
中的 DatasetBuilder
函数对数据进行处理,数据变成了[{‘text_a’: ‘data’, ‘label’: label},……] 的格式。
1.1 划分数据集
- import random
-
- # 读取自定义.txt文件中的内容
- with open('weibo_senti_100k.txt', 'r',encoding='utf-8') as file:
- lines = file.readlines()
-
- # 随机打乱数据
- random.shuffle(lines)
-
- # 计算切分的索引
- total_lines = len(lines)
- train_end = int(total_lines * 0.7)
- dev_end = int(total_lines * 0.9)
-
- # 切分数据
- train_data = lines[:train_end]
- dev_data = lines[train_end:dev_end]
- test_data = lines[validation_end:]
-
- # 将数据写入train.txt
- with open('train.txt', 'w' ,encoding='utf-8') as file:
- file.writelines(train_data)
-
- # 将数据写入validation.txt
- with open('dev.txt', 'w' ,encoding='utf-8') as file:
- file.writelines(dev_data)
-
- # 将数据写入test.txt
- with open('test.txt', 'w' ,encoding='utf-8') as file:
- file.writelines(test_data)
这里的数据集用的是:weibo_senti_100k数据集:ChineseNlpCorpus/datasets/weibo_senti_100k/intro.ipynb at master · SophonPlus/ChineseNlpCorpus · GitHub
将微博数据集切成三个:训练集70%、测试集20%、验证集10%,编码是utf-8,但是没有列名,放在notepad里转成ansi格式,换文件名为csv,正常打开,操作好后,另存为uncode的txt,再用notepad转换成utf-8
notepad++网盘资源
https://pan.baidu.com/s/14cRU0EjD0BiPl5doYj6pMA
[提取码]:kwii
1.2 加载数据
- # 导入DatasetBuilder
- from paddlenlp.datasets import DatasetBuilder
-
-
- class NewsData(DatasetBuilder):
- SPLITS = {
- 'train': r'train.txt', # 训练集
- 'dev': r'dev.txt', # 验证集
- 'test': r'test.txt' #测试集
- }
-
- def _get_data(self, mode, **kwargs):
- filename = self.SPLITS[mode]
- return filename
-
- def _read(self, filename):
- """读取数据"""
- with open(filename, 'r', encoding='utf-8',errors='ignore') as f:
- for line in f:
- if line == '\n':
- continue
- data = line.strip().split("\t") # 以'\t'分隔各列
- label, text_a = data
- text_a = text_a.replace(" ", "")
- if label in ['0', '1']:
- yield {"text_a": text_a, "label": label} # 此次设置数据的格式为:text_a,label,可以根据具体情况进行修改
-
- def get_labels(self):
- return label_list # 类别标签
-
- from paddlenlp.datasets import load_dataset
-
- -------------------------------------------
- D:******lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils.
- warnings.warn("Setuptools is replacing distutils.")
[是报错但没关系,不影响运行]
- # 定义数据集加载函数
- def load_dataset(name=None,
- data_files=None,
- splits=None,
- lazy=None,
- **kwargs):
-
- reader_cls = NewsData # 加载定义的数据集格式
- print(reader_cls)
- if not name:
- reader_instance = reader_cls(lazy=lazy, **kwargs)
- else:
- reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)
- datasets = reader_instance.read_datasets(data_files=data_files, splits=splits)
- return datasets
- # 加载训练和验证集
- label_list = ['0', '1']
- train_ds, dev_ds, text_t = load_dataset(splits=['train', 'dev', 'test'])
-
-
- -------------------------------------
- <class '__main__.NewsData'>
1.3 展示数据
- #展示前五行数据
- train_ds[:5]
-
- --------------------------------------
- [{'text_a': '[抓狂][抓狂][抓狂]起晚了[泪]', 'label': 0},
- {'text_a': '分享图片,不要啊~~~虽然我很喜欢周迅,可是八阿哥,你一定要等晴川啊~~[泪]', 'label': 0},
- {'text_a': '想shi的就上飞机吧???只见君去不见君还[泪]//@玉翠文章:不错很丰满,赛过杨贵妃,我喜欢,带着你超过盖茨4倍家财的嫁妆来吧[酷]请各路大仙作媒',
- 'label': 0},
- {'text_a': '多谢支持!//@洪三水:[嘻嘻]画面不错,其他继续研究中。。//@曹欣Dyson:抢怪的那叫个多。-。-//@刘波BOB:钢铁侠,有没有?有没有?亮了!',
- 'label': 1},
- {'text_a': '#周末节奏#美好的一天从早餐开始,黄金蛋炒饭,番茄牛尾汤[嘻嘻]', 'label': 1}]
2.1 导入模型
- import os
- import paddle
- import paddlenlp
- from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
- model_name = "ernie-3.0-medium-zh"
- model = AutoModelForSequenceClassification.from_pretrained(model_name, num_classes=2)
- tokenizer = AutoTokenizer.from_pretrained(model_name)
-
- ---------------------------------------
- [2024-01-26 13:15:04,015] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load 'ernie-3.0-medium-zh'.
- [2024-01-26 13:15:04,016] [ INFO] - Already cached *****\.paddlenlp\models\ernie-3.0-medium-zh\model_state.pdparams
- [2024-01-26 13:15:04,017] [ INFO] - Loading weights file model_state.pdparams from cache at *****\.paddlenlp\models\ernie-3.0-medium-zh\model_state.pdparams
- [2024-01-26 13:15:04,276] [ INFO] - Loaded weights file from disk, setting weights to model.
- [2024-01-26 13:15:09,982] [ WARNING] - Some weights of the model checkpoint at ernie-3.0-medium-zh were not used when initializing ErnieForSequenceClassification: ['ernie.encoder.layers.6.self_attn.k_proj.weight', 'ernie.encoder.layers.6.self_attn.q_proj.bias', 'ernie.encoder.layers.6.linear1.weight', 'ernie.encoder.layers.6.norm2.bias', 'ernie.encoder.layers.6.self_attn.k_proj.bias', 'ernie.encoder.layers.6.self_attn.v_proj.bias', 'ernie.encoder.layers.6.self_attn.out_proj.weight', 'ernie.encoder.layers.6.self_attn.v_proj.weight', 'ernie.encoder.layers.6.norm1.bias', 'ernie.encoder.layers.6.norm1.weight', 'ernie.encoder.layers.6.linear2.bias', 'ernie.encoder.layers.6.linear1.bias', 'ernie.encoder.layers.6.self_attn.q_proj.weight', 'ernie.encoder.layers.6.linear2.weight', 'ernie.encoder.layers.6.self_attn.out_proj.bias', 'ernie.encoder.layers.6.norm2.weight']
- - This IS expected if you are initializing ErnieForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- - This IS NOT expected if you are initializing ErnieForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
- [2024-01-26 13:15:09,982] [ WARNING] - Some weights of ErnieForSequenceClassification were not initialized from the model checkpoint at ernie-3.0-medium-zh and are newly initialized: ['classifier.bias', 'ernie.pooler.dense.bias', 'classifier.weight', 'ernie.pooler.dense.weight']
- You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
- [2024-01-26 13:15:10,009] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'ernie-3.0-medium-zh'.
- [2024-01-26 13:15:10,010] [ INFO] - Already cached *****\.paddlenlp\models\ernie-3.0-medium-zh\ernie_3.0_medium_zh_vocab.txt
- [2024-01-26 13:15:10,028] [ INFO] - tokenizer config file saved in C:\Users\徐金硕\.paddlenlp\models\ernie-3.0-medium-zh\tokenizer_config.json
- [2024-01-26 13:15:10,030] [ INFO] - Special tokens file saved in *****\.paddlenlp\models\ernie-3.0-medium-zh\special_tokens_map.json
需要首先下载utils包,conda install utils后会报错:PackagesNotFoundError: The following packages are not available from current channels,参照下面这篇博文在Anaconda里复制命令下载。
PackagesNotFoundError: The following packages are not available from current channels的解决办法-CSDN博客
- from functools import partial
- from paddlenlp.data import Stack, Tuple, Pad
- from utils import convert_example, create_dataloader
-
- # 模型运行批处理大小
- batch_size = 32
- max_seq_length = 128
-
- trans_func = partial(
- convert_example,
- tokenizer=tokenizer,
- max_seq_length=max_seq_length)
- batchify_fn = lambda samples, fn=Tuple(
- Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
- Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment
- Stack(dtype="int64") # label
- ): [data for data in fn(samples)]
- train_data_loader = create_dataloader(
- train_ds,
- mode='train',
- batch_size=batch_size,
- batchify_fn=batchify_fn,
- trans_fn=trans_func)
- dev_data_loader = create_dataloader(
- dev_ds,
- mode='dev',
- batch_size=batch_size,
- batchify_fn=batchify_fn,
- trans_fn=trans_func)
即使下载了utils,这里第3句还是会报错:ImportError: cannot import name 'convert_example' from 'utils' (D:\gpu\anaconda\in\envs\py38\lib\site-packages\utils\__init__.py)
沿着路径进去查看发现__init.py__文件是空的,该问题与下面这篇博文类似:
尝试将aistudio中下面这个开源项目的utils.py文件内容复制到本地的__init.py__文件中,运行成功。
『NLP经典项目集』02:使用预训练模型ERNIE优化情感分析 - 飞桨AI Studio星河社区 (baidu.com)
2.2 模型训练
- import paddlenlp as ppnlp
- import paddle
- from paddlenlp.transformers import LinearDecayWithWarmup
-
- # 训练过程中的最大学习率
- learning_rate = 5e-6
- # 训练轮次
- epochs = 20 #3
- # 学习率预热比例
- warmup_proportion = 0.3
- # 权重衰减系数,类似模型正则项策略,避免模型过拟合
- weight_decay = 0.1
-
- num_training_steps = len(train_data_loader) * epochs
- lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
- optimizer = paddle.optimizer.AdamW(
- learning_rate=lr_scheduler,
- parameters=model.parameters(),
- weight_decay=weight_decay,
- apply_decay_param_fun=lambda x: x in [
- p.name for n, p in model.named_parameters()
- if not any(nd in n for nd in ["bias", "norm"])
- ])
-
- criterion = paddle.nn.loss.CrossEntropyLoss()
- metric = paddle.metric.Accuracy()
- import paddle.nn.functional as F
- from utils import evaluate
- all_train_loss=[]
- all_train_accs = []
- Batch=0
- Batchs=[]
- global_step = 0
- for epoch in range(1, epochs + 1):
- for step, batch in enumerate(train_data_loader, start=1):
- input_ids, segment_ids, labels = batch
- logits = model(input_ids, segment_ids)
- loss = criterion(logits, labels)
- probs = F.softmax(logits, axis=1)
- correct = metric.compute(probs, labels)
- metric.update(correct)
- acc = metric.accumulate()
- global_step += 1
- if global_step % 10 == 0 :
- print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
- Batch += 10
- Batchs.append(Batch)
- all_train_loss.append(loss)
- all_train_accs.append(acc)
- loss.backward()
- optimizer.step()
- lr_scheduler.step()
- optimizer.clear_grad()
- evaluate(model, criterion, metric, dev_data_loader)
- model.save_pretrained('/home/aistudio/checkpoint')
- tokenizer.save_pretrained('/home/aistudio/checkpoint')
-
- -------------------------------------
- [2024-01-26 12:48:05,721] [ INFO] - Configuration saved in /home/aistudio/checkpoint\config.json
- [2024-01-26 12:48:06,184] [ INFO] - Model weights saved in /home/aistudio/checkpoint\model_state.pdparams
- [2024-01-26 12:48:06,186] [ INFO] - tokenizer config file saved in /home/aistudio/checkpoint\tokenizer_config.json
- [2024-01-26 12:48:06,187] [ INFO] - Special tokens file saved in /home/aistudio/checkpoint\special_tokens_map.json
- ('/home/aistudio/checkpoint\\tokenizer_config.json',
- '/home/aistudio/checkpoint\\special_tokens_map.json',
- '/home/aistudio/checkpoint\\added_tokens.json')
2.3 可视化曲线
- import matplotlib.pyplot as plt
- def draw_train_acc(Batchs, train_accs,train_loss):
- title="training accs"
- plt.title(title, fontsize=24)
- plt.xlabel("batch", fontsize=14)
- plt.ylabel("acc", fontsize=14)
- plt.plot(Batchs, train_accs, color='green', label='training accs')
- plt.plot(Batchs, train_loss, color='red', label='training loss')
- plt.legend()
- plt.grid()
- plt.show()
- draw_train_acc(Batchs,all_train_accs,all_train_loss)
2.4 模型预测
- # 加载模型参数
- import os
- import paddle
- params_path = 'checkpoint/model_state.pdparams'
- if params_path and os.path.isfile(params_path):
- state_dict = paddle.load(params_path)
- model.set_dict(state_dict)
- print("Successful Loaded down!")
- from utils import predict
- batch_size = 32
- data = text_t
- label_map = {0: '0', 1: '1',2:'2'}
- results = predict(
- model, data, tokenizer, label_map, batch_size=batch_size)
- for idx, text in enumerate(data):
- print('Data: {} \t Lable: {}'.format(text, results[idx]))
上面只输出了标签,不带概率,下面这个是百度aistudio的开源项目,需要使用bash,或者直接在aistudio上运行。
飞桨AI Studio星河社区-人工智能学习与实训社区 (baidu.com)
其他自定义数据集的方法:
微调版:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。