当前位置:   article > 正文

ERNIE3.0实践手记_ernie3.0代码复现

ernie3.0代码复现

一、代码实践

(一)基础版

本实验大体框架参照这篇博文:

https://lizhiyang.blog.csdn.net/article/details/132394853

基于Ernie-3.0-medium-zh大模型,这篇博文运用孤注一掷影评数据进行情感分析,共600条数据,进行了词云展示。利用正则表达式清理数据,利用paddlenlp.datasets中的 DatasetBuilder函数对数据进行处理,数据变成了[{‘text_a’: ‘data’, ‘label’: label},……] 的格式。

1 数据处理

1.1 划分数据集

  1. import random
  2. # 读取自定义.txt文件中的内容
  3. with open('weibo_senti_100k.txt', 'r',encoding='utf-8') as file:
  4. lines = file.readlines()
  5. # 随机打乱数据
  6. random.shuffle(lines)
  7. # 计算切分的索引
  8. total_lines = len(lines)
  9. train_end = int(total_lines * 0.7)
  10. dev_end = int(total_lines * 0.9)
  11. # 切分数据
  12. train_data = lines[:train_end]
  13. dev_data = lines[train_end:dev_end]
  14. test_data = lines[validation_end:]
  15. # 将数据写入train.txt
  16. with open('train.txt', 'w' ,encoding='utf-8') as file:
  17. file.writelines(train_data)
  18. # 将数据写入validation.txt
  19. with open('dev.txt', 'w' ,encoding='utf-8') as file:
  20. file.writelines(dev_data)
  21. # 将数据写入test.txt
  22. with open('test.txt', 'w' ,encoding='utf-8') as file:
  23. file.writelines(test_data)

这里的数据集用的是:weibo_senti_100k数据集:ChineseNlpCorpus/datasets/weibo_senti_100k/intro.ipynb at master · SophonPlus/ChineseNlpCorpus · GitHub

 将微博数据集切成三个:训练集70%、测试集20%、验证集10%,编码是utf-8,但是没有列名,放在notepad里转成ansi格式,换文件名为csv,正常打开,操作好后,另存为uncode的txt,再用notepad转换成utf-8

notepad++网盘资源

https://pan.baidu.com/s/14cRU0EjD0BiPl5doYj6pMA
[提取码]:kwii

1.2 加载数据

  1. # 导入DatasetBuilder
  2. from paddlenlp.datasets import DatasetBuilder
  3. class NewsData(DatasetBuilder):
  4. SPLITS = {
  5. 'train': r'train.txt', # 训练集
  6. 'dev': r'dev.txt', # 验证集
  7. 'test': r'test.txt' #测试集
  8. }
  9. def _get_data(self, mode, **kwargs):
  10. filename = self.SPLITS[mode]
  11. return filename
  12. def _read(self, filename):
  13. """读取数据"""
  14. with open(filename, 'r', encoding='utf-8',errors='ignore') as f:
  15. for line in f:
  16. if line == '\n':
  17. continue
  18. data = line.strip().split("\t") # 以'\t'分隔各列
  19. label, text_a = data
  20. text_a = text_a.replace(" ", "")
  21. if label in ['0', '1']:
  22. yield {"text_a": text_a, "label": label} # 此次设置数据的格式为:text_a,label,可以根据具体情况进行修改
  23. def get_labels(self):
  24. return label_list # 类别标签
  25. from paddlenlp.datasets import load_dataset
  26. -------------------------------------------
  27. D:******lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils.
  28. warnings.warn("Setuptools is replacing distutils.")
[是报错但没关系,不影响运行]
  1. # 定义数据集加载函数
  2. def load_dataset(name=None,
  3. data_files=None,
  4. splits=None,
  5. lazy=None,
  6. **kwargs):
  7. reader_cls = NewsData # 加载定义的数据集格式
  8. print(reader_cls)
  9. if not name:
  10. reader_instance = reader_cls(lazy=lazy, **kwargs)
  11. else:
  12. reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)
  13. datasets = reader_instance.read_datasets(data_files=data_files, splits=splits)
  14. return datasets
  1. # 加载训练和验证集
  2. label_list = ['0', '1']
  3. train_ds, dev_ds, text_t = load_dataset(splits=['train', 'dev', 'test'])
  4. -------------------------------------
  5. <class '__main__.NewsData'>

1.3 展示数据

  1. #展示前五行数据
  2. train_ds[:5]
  3. --------------------------------------
  4. [{'text_a': '[抓狂][抓狂][抓狂]起晚了[泪]', 'label': 0},
  5. {'text_a': '分享图片,不要啊~~~虽然我很喜欢周迅,可是八阿哥,你一定要等晴川啊~~[泪]', 'label': 0},
  6. {'text_a': '想shi的就上飞机吧???只见君去不见君还[泪]//@玉翠文章:不错很丰满,赛过杨贵妃,我喜欢,带着你超过盖茨4倍家财的嫁妆来吧[酷]请各路大仙作媒',
  7. 'label': 0},
  8. {'text_a': '多谢支持!//@洪三水:[嘻嘻]画面不错,其他继续研究中。。//@曹欣Dyson:抢怪的那叫个多。-。-//@刘波BOB:钢铁侠,有没有?有没有?亮了!',
  9. 'label': 1},
  10. {'text_a': '#周末节奏#美好的一天从早餐开始,黄金蛋炒饭,番茄牛尾汤[嘻嘻]', 'label': 1}]

2 ERNIE3.0模型

2.1 导入模型

  1. import os
  2. import paddle
  3. import paddlenlp
  1. from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
  2. model_name = "ernie-3.0-medium-zh"
  3. model = AutoModelForSequenceClassification.from_pretrained(model_name, num_classes=2)
  4. tokenizer = AutoTokenizer.from_pretrained(model_name)
  5. ---------------------------------------
  6. [2024-01-26 13:15:04,015] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load 'ernie-3.0-medium-zh'.
  7. [2024-01-26 13:15:04,016] [ INFO] - Already cached *****\.paddlenlp\models\ernie-3.0-medium-zh\model_state.pdparams
  8. [2024-01-26 13:15:04,017] [ INFO] - Loading weights file model_state.pdparams from cache at *****\.paddlenlp\models\ernie-3.0-medium-zh\model_state.pdparams
  9. [2024-01-26 13:15:04,276] [ INFO] - Loaded weights file from disk, setting weights to model.
  10. [2024-01-26 13:15:09,982] [ WARNING] - Some weights of the model checkpoint at ernie-3.0-medium-zh were not used when initializing ErnieForSequenceClassification: ['ernie.encoder.layers.6.self_attn.k_proj.weight', 'ernie.encoder.layers.6.self_attn.q_proj.bias', 'ernie.encoder.layers.6.linear1.weight', 'ernie.encoder.layers.6.norm2.bias', 'ernie.encoder.layers.6.self_attn.k_proj.bias', 'ernie.encoder.layers.6.self_attn.v_proj.bias', 'ernie.encoder.layers.6.self_attn.out_proj.weight', 'ernie.encoder.layers.6.self_attn.v_proj.weight', 'ernie.encoder.layers.6.norm1.bias', 'ernie.encoder.layers.6.norm1.weight', 'ernie.encoder.layers.6.linear2.bias', 'ernie.encoder.layers.6.linear1.bias', 'ernie.encoder.layers.6.self_attn.q_proj.weight', 'ernie.encoder.layers.6.linear2.weight', 'ernie.encoder.layers.6.self_attn.out_proj.bias', 'ernie.encoder.layers.6.norm2.weight']
  11. - This IS expected if you are initializing ErnieForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  12. - This IS NOT expected if you are initializing ErnieForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  13. [2024-01-26 13:15:09,982] [ WARNING] - Some weights of ErnieForSequenceClassification were not initialized from the model checkpoint at ernie-3.0-medium-zh and are newly initialized: ['classifier.bias', 'ernie.pooler.dense.bias', 'classifier.weight', 'ernie.pooler.dense.weight']
  14. You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  15. [2024-01-26 13:15:10,009] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'ernie-3.0-medium-zh'.
  16. [2024-01-26 13:15:10,010] [ INFO] - Already cached *****\.paddlenlp\models\ernie-3.0-medium-zh\ernie_3.0_medium_zh_vocab.txt
  17. [2024-01-26 13:15:10,028] [ INFO] - tokenizer config file saved in C:\Users\徐金硕\.paddlenlp\models\ernie-3.0-medium-zh\tokenizer_config.json
  18. [2024-01-26 13:15:10,030] [ INFO] - Special tokens file saved in *****\.paddlenlp\models\ernie-3.0-medium-zh\special_tokens_map.json

需要首先下载utils包,conda install utils后会报错:PackagesNotFoundError: The following packages are not available from current channels,参照下面这篇博文在Anaconda里复制命令下载。

PackagesNotFoundError: The following packages are not available from current channels的解决办法-CSDN博客

  1. from functools import partial
  2. from paddlenlp.data import Stack, Tuple, Pad
  3. from utils import convert_example, create_dataloader
  4. # 模型运行批处理大小
  5. batch_size = 32
  6. max_seq_length = 128
  7. trans_func = partial(
  8. convert_example,
  9. tokenizer=tokenizer,
  10. max_seq_length=max_seq_length)
  11. batchify_fn = lambda samples, fn=Tuple(
  12. Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
  13. Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment
  14. Stack(dtype="int64") # label
  15. ): [data for data in fn(samples)]
  16. train_data_loader = create_dataloader(
  17. train_ds,
  18. mode='train',
  19. batch_size=batch_size,
  20. batchify_fn=batchify_fn,
  21. trans_fn=trans_func)
  22. dev_data_loader = create_dataloader(
  23. dev_ds,
  24. mode='dev',
  25. batch_size=batch_size,
  26. batchify_fn=batchify_fn,
  27. trans_fn=trans_func)

即使下载了utils,这里第3句还是会报错:ImportError: cannot import name 'convert_example' from 'utils' (D:\gpu\anaconda\in\envs\py38\lib\site-packages\utils\__init__.py)

沿着路径进去查看发现__init.py__文件是空的,该问题与下面这篇博文类似:

ImportError: cannot import name SVOInfo from utils (D:\Develop_Tool\Anaconda\lib\site-packages\u_baidubce importerror: cannot import name 'expando'-CSDN博客

尝试将aistudio中下面这个开源项目的utils.py文件内容复制到本地的__init.py__文件中,运行成功。

『NLP经典项目集』02:使用预训练模型ERNIE优化情感分析 - 飞桨AI Studio星河社区 (baidu.com)

2.2 模型训练

  1. import paddlenlp as ppnlp
  2. import paddle
  1. from paddlenlp.transformers import LinearDecayWithWarmup
  2. # 训练过程中的最大学习率
  3. learning_rate = 5e-6
  4. # 训练轮次
  5. epochs = 20 #3
  6. # 学习率预热比例
  7. warmup_proportion = 0.3
  8. # 权重衰减系数,类似模型正则项策略,避免模型过拟合
  9. weight_decay = 0.1
  10. num_training_steps = len(train_data_loader) * epochs
  11. lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
  12. optimizer = paddle.optimizer.AdamW(
  13. learning_rate=lr_scheduler,
  14. parameters=model.parameters(),
  15. weight_decay=weight_decay,
  16. apply_decay_param_fun=lambda x: x in [
  17. p.name for n, p in model.named_parameters()
  18. if not any(nd in n for nd in ["bias", "norm"])
  19. ])
  20. criterion = paddle.nn.loss.CrossEntropyLoss()
  21. metric = paddle.metric.Accuracy()
  1. import paddle.nn.functional as F
  2. from utils import evaluate
  3. all_train_loss=[]
  4. all_train_accs = []
  5. Batch=0
  6. Batchs=[]
  7. global_step = 0
  8. for epoch in range(1, epochs + 1):
  9. for step, batch in enumerate(train_data_loader, start=1):
  10. input_ids, segment_ids, labels = batch
  11. logits = model(input_ids, segment_ids)
  12. loss = criterion(logits, labels)
  13. probs = F.softmax(logits, axis=1)
  14. correct = metric.compute(probs, labels)
  15. metric.update(correct)
  16. acc = metric.accumulate()
  17. global_step += 1
  18. if global_step % 10 == 0 :
  19. print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
  20. Batch += 10
  21. Batchs.append(Batch)
  22. all_train_loss.append(loss)
  23. all_train_accs.append(acc)
  24. loss.backward()
  25. optimizer.step()
  26. lr_scheduler.step()
  27. optimizer.clear_grad()
  28. evaluate(model, criterion, metric, dev_data_loader)
  1. model.save_pretrained('/home/aistudio/checkpoint')
  2. tokenizer.save_pretrained('/home/aistudio/checkpoint')
  3. -------------------------------------
  4. [2024-01-26 12:48:05,721] [ INFO] - Configuration saved in /home/aistudio/checkpoint\config.json
  5. [2024-01-26 12:48:06,184] [ INFO] - Model weights saved in /home/aistudio/checkpoint\model_state.pdparams
  6. [2024-01-26 12:48:06,186] [ INFO] - tokenizer config file saved in /home/aistudio/checkpoint\tokenizer_config.json
  7. [2024-01-26 12:48:06,187] [ INFO] - Special tokens file saved in /home/aistudio/checkpoint\special_tokens_map.json
  8. ('/home/aistudio/checkpoint\\tokenizer_config.json',
  9. '/home/aistudio/checkpoint\\special_tokens_map.json',
  10. '/home/aistudio/checkpoint\\added_tokens.json')

2.3 可视化曲线

  1. import matplotlib.pyplot as plt
  2. def draw_train_acc(Batchs, train_accs,train_loss):
  3. title="training accs"
  4. plt.title(title, fontsize=24)
  5. plt.xlabel("batch", fontsize=14)
  6. plt.ylabel("acc", fontsize=14)
  7. plt.plot(Batchs, train_accs, color='green', label='training accs')
  8. plt.plot(Batchs, train_loss, color='red', label='training loss')
  9. plt.legend()
  10. plt.grid()
  11. plt.show()
  12. draw_train_acc(Batchs,all_train_accs,all_train_loss)

2.4 模型预测

  1. # 加载模型参数
  2. import os
  3. import paddle
  4. params_path = 'checkpoint/model_state.pdparams'
  5. if params_path and os.path.isfile(params_path):
  6. state_dict = paddle.load(params_path)
  7. model.set_dict(state_dict)
  8. print("Successful Loaded down!")
  1. from utils import predict
  2. batch_size = 32
  3. data = text_t
  4. label_map = {0: '0', 1: '1',2:'2'}
  5. results = predict(
  6. model, data, tokenizer, label_map, batch_size=batch_size)
  7. for idx, text in enumerate(data):
  8. print('Data: {} \t Lable: {}'.format(text, results[idx]))

(二)提升版

上面只输出了标签,不带概率,下面这个是百度aistudio的开源项目,需要使用bash,或者直接在aistudio上运行。

飞桨AI Studio星河社区-人工智能学习与实训社区 (baidu.com)

其他自定义数据集的方法:

如何自定义数据集 — PaddleNLP 文档

微调版:

百度PaddleHub-ERNIE微调中文情感分析(文本分类)_paddle-ernie-CSDN博客

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/497017
推荐阅读
相关标签
  

闽ICP备14008679号