当前位置:   article > 正文

预训练ernie模型微调语句分类实战_ernie分类

ernie分类

1.数据集的准备

首先准备我们自己的数据集,我这里让chatgpt帮忙生成了一些

  1. {
  2. "title":"尊嘟假嘟",
  3. "data": [{"text": "我爱黑丝美女","labels": 2},
  4. {"text": "我爱白丝美女","labels": 1},
  5. {"text": "黑丝美女真性感","labels": 2},
  6. {"text": "白丝美女也很迷人","labels": 1},
  7. {"text": "网袜让美腿更加迷人","labels": 0},
  8. {"text": "黑丝和白丝都很好看","labels": 3},
  9. {"text": "黑丝美女让我心动","labels": 3},
  10. {"text": "白丝美女让我忍不住多看几眼","labels": 1},
  11. {"text": "黑丝和白丝哪个更好看呢?","labels": 3},
  12. {"text": "我喜欢穿黑丝的女孩","labels": 2},
  13. {"text": "我觉得白丝更适合我","labels": 1},
  14. {"text": "黑丝和白丝都有不同的魅力","labels": 3}]
  15. }

文件名称为 dummydata 这里的label是这样的:

0网袜
1白丝
2黑丝
3白丝+黑丝

1.导入所使用的包

  1. import evaluate
  2. import torch.utils.data
  3. from datasets import load_dataset, DatasetDict, Dataset
  4. from transformers import DataCollatorWithPadding, AutoTokenizer
  5. from torch.utils.data import DataLoader

2.使用Datasets库加载该数据集

1.读取自定义json数据集文件

这是我json文件的位置:F:\bert意图识别\data\dummydata.jsonl

  1. def load_datasets(test_size: float = 0.2) -> DatasetDict[str, Dataset]:
  2. assert 0 < test_size < 1, 'value must in range (0-1)'
  3. data = load_dataset('json', data_files='../data/dummydata.jsonl', field='data')
  4. train_test_valid = data['train'].train_test_split(test_size=0.1)
  5. dataset = DatasetDict({
  6. "train": train_test_valid["train"],
  7. "test": train_test_valid["test"],
  8. "valid": train_test_valid["train"]})
  9. return dataset

这里设置了自动将数据集按照0.1的比例将数据集分为训练集和测试集

然后dataset是一个DatasetDict类里面有训练集,测试集,验证集(其中验证集内容和训练集相同)

2.加工并返回dataloader

  1. def get_dataloaders(tokenizer, batch_size) -> dict[str:torch.utils.data.DataLoader]:
  2. #这里设置使用tokenizer将数据集自动变为index并且padding,返回的格式是pt
  3. tokenize_func = lambda x: tokenizer(x["text"], padding=True, truncation=True, return_tensors="pt")
  4. #这里的load_datasets是上一个函数
  5. dataset = load_datasets()
  6. #使用类功能中的map对dataset进行操作
  7. tokenized_datasets = dataset.map(tokenize_func, batched=True)
  8. #这个remove是去掉数据集中的文本,只保留bert类可以接受的参数
  9. tokenized_datasets = tokenized_datasets.remove_columns(["text"])
  10. #这里使用的pytorch
  11. tokenized_datasets.set_format("torch")
  12. # collect a Dataloader
  13. data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
  14. train_dataloader = DataLoader(
  15. tokenized_datasets["train"], shuffle=True, batch_size=batch_size, collate_fn=data_collator
  16. )
  17. eval_dataloader = DataLoader(
  18. tokenized_datasets["valid"], batch_size=batch_size, collate_fn=data_collator
  19. )
  20. test_dataloader = DataLoader(
  21. tokenized_datasets["test"], batch_size=batch_size, collate_fn=data_collator
  22. )
  23. return {
  24. "train": train_dataloader,
  25. "valid": eval_dataloader,
  26. "test": test_dataloader

该函数返回三个可迭代的torch中的Dataloader可以直接进行训练等操作。

2. 训练train.py

1.导入需要使用的包

  1. import os
  2. import torch
  3. import warnings
  4. import evaluate
  5. from tqdm.auto import tqdm
  6. from progressbar import ProgressBar
  7. from transformers import DataCollatorWithPadding, AutoTokenizer, AutoModelForSequenceClassification, get_scheduler
  8. from dataset import load_datasets, get_dataloaders
  9. from utils import save_model

2.设置环境变量

  1. # OS HYPER GLOBAL PARAMETERS
  2. warnings.filterwarnings("ignore")
  3. torch.backends.cudnn.enabled = True
  4. os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
  5. os.environ["TORCH_USE_CUDA_DSA"] = "1"
  6. os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
  7. os.environ['CUDA_VISIBLE_DEVICES'] = '0'#设置使用的显卡,我这里只有一张卡
  8. #剩下的都是为了调试bug更好用,并且使用cudnn设置的

3.模型超参数和保存路径设置

  1. # - - - - - - - - - - - - - - -- - - - - - - -
  2. # SAVE MODEL SETTINGS
  3. CHECK_POINT_PATH = "../model/embedding_model" # 这里是使用的预训练的ernie-zh-base3.0
  4. BEST_MODEL_SAVE_DIR = "../output/model/best_model"# 训练过程中保存最好的模型
  5. LAST_MODEL_SAVE_DIR = "../output/model/last_model"# 这是最后一个epoch保存模型
  6. # - - - - - - - - - - - - - - -- - - - - - - -
  7. # MODEL HYPER PARAMETERS
  8. LEARN_RATE = 5e-5 # 学习率
  9. NUM_EPOCHS = 12 #epoch
  10. BATCH_SIZE = 16
  11. NUM_LABELS = 4 # 总共要四分类
  12. # - - - - - - - - - - - - - - -- - - - - - - -

4.训练&评估&保存

1.训练前的准备(以下代码均在main线程下)

  1. if __name__ == "__main__":
  2. tokenizer = AutoTokenizer.from_pretrained(CHECK_POINT_PATH)
  3. train_dataloader, eval_dataloader = (get_dataloaders(tokenizer, BATCH_SIZE)['train'],
  4. get_dataloaders(tokenizer, BATCH_SIZE)['valid'])
  5. # load check point
  6. model = AutoModelForSequenceClassification.from_pretrained(CHECK_POINT_PATH, num_labels=NUM_LABELS)
  7. # define a optimizer
  8. optimizer = torch.optim.AdamW(model.parameters(), lr=LEARN_RATE)
  9. num_training_steps = len(train_dataloader) * NUM_EPOCHS
  10. lr_scheduler = get_scheduler(
  11. "linear",
  12. optimizer=optimizer,
  13. num_warmup_steps=0,
  14. num_training_steps=num_training_steps,
  15. )
  16. # evaluate settings
  17. metric = evaluate.load("metric.py", type="metric")
  18. best_accuracy = 0.0
  19. progress_bar = tqdm(range(num_training_steps))
  20. progress_bar.set_description('training')
  21. # model to device
  22. model.to(DEVICE)

这里如果要使用dataloader中的num_work多线程加载数据一定要在if __name__ == "__main__":下!!

这里首先加载分词器和模型,然后从上述定义好的get_dataloaders函数获取训练集和验证集

使用Adamw作为超参数优化器

定义评估

  metric = evaluate.load("metric.py", type="metric")

这里我加载的是本地的评估文件,其实也可以直接使用evaluate.load("metric")来从huggingfacehub上加载,可是我的网络不好,就直接copy到本地了

这里给大家放一下metric.py:

  1. # Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
  2. #
  3. # Licensed under the Apache License, Version 2.0 (the "License");
  4. # you may not use this file except in compliance with the License.
  5. # You may obtain a copy of the License at
  6. #
  7. # http://www.apache.org/licenses/LICENSE-2.0
  8. #
  9. # Unless required by applicable law or agreed to in writing, software
  10. # distributed under the License is distributed on an "AS IS" BASIS,
  11. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  12. # See the License for the specific language governing permissions and
  13. # limitations under the License.
  14. """Accuracy metric."""
  15. import datasets
  16. from sklearn.metrics import accuracy_score
  17. import evaluate
  18. _DESCRIPTION = """
  19. Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
  20. Accuracy = (TP + TN) / (TP + TN + FP + FN)
  21. Where:
  22. TP: True positive
  23. TN: True negative
  24. FP: False positive
  25. FN: False negative
  26. """
  27. _KWARGS_DESCRIPTION = """
  28. Args:
  29. predictions (`list` of `int`): Predicted labels.
  30. references (`list` of `int`): Ground truth labels.
  31. normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
  32. sample_weight (`list` of `float`): Sample weights Defaults to None.
  33. Returns:
  34. accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.
  35. Examples:
  36. Example 1-A simple example
  37. >>> accuracy_metric = evaluate.load("accuracy")
  38. >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
  39. >>> print(results)
  40. {'accuracy': 0.5}
  41. Example 2-The same as Example 1, except with `normalize` set to `False`.
  42. >>> accuracy_metric = evaluate.load("accuracy")
  43. >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
  44. >>> print(results)
  45. {'accuracy': 3.0}
  46. Example 3-The same as Example 1, except with `sample_weight` set.
  47. >>> accuracy_metric = evaluate.load("accuracy")
  48. >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
  49. >>> print(results)
  50. {'accuracy': 0.8778625954198473}
  51. """
  52. _CITATION = """
  53. @article{scikit-learn,
  54. title={Scikit-learn: Machine Learning in {P}ython},
  55. author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
  56. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
  57. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
  58. Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
  59. journal={Journal of Machine Learning Research},
  60. volume={12},
  61. pages={2825--2830},
  62. year={2011}
  63. }
  64. """
  65. @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
  66. class Accuracy(evaluate.Metric):
  67. def _info(self):
  68. return evaluate.MetricInfo(
  69. description=_DESCRIPTION,
  70. citation=_CITATION,
  71. inputs_description=_KWARGS_DESCRIPTION,
  72. features=datasets.Features(
  73. {
  74. "predictions": datasets.Sequence(datasets.Value("int32")),
  75. "references": datasets.Sequence(datasets.Value("int32")),
  76. }
  77. if self.config_name == "multilabel"
  78. else {
  79. "predictions": datasets.Value("int32"),
  80. "references": datasets.Value("int32"),
  81. }
  82. ),
  83. reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"],
  84. )
  85. def _compute(self, predictions, references, normalize=True, sample_weight=None):
  86. return {
  87. "accuracy": float(
  88. accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight)
  89. )
  90. }

2.训练&评估&保存

1)训练
  1. for epoch in range(NUM_EPOCHS):
  2. pbar = ProgressBar().start() #这里是开始进度条
  3. model.train()
  4. total_loss = 0
  5. for batch in train_dataloader:
  6. batch = {k: v.to(DEVICE) for k, v in batch.items()}
  7. outputs = model(**batch)
  8. loss = outputs.loss
  9. loss.backward()
  10. optimizer.step()
  11. lr_scheduler.step()
  12. optimizer.zero_grad()
  13. total_loss += loss.item()
  14. average_loss = total_loss / len(train_dataloader) #计算一个epoch的平均损失
2)评估
  1. # noinspection DuplicatedCode
  2. model.eval()
  3. for batch in eval_dataloader:
  4. batch = {k: v.to(DEVICE) for k, v in batch.items()}
  5. with torch.no_grad():
  6. outputs = model(**batch)
  7. logits = outputs.logits
  8. predictions = torch.argmax(logits, dim=-1)
  9. metric.add_batch(predictions=predictions, references=batch["labels"])
  10. result = metric.compute()
  11. accuracy = result['accuracy']
  12. print(f"epoch: {epoch}, average_loss: {average_loss:.4f},accuracy: {result['accuracy']:.4f}")
  13. progress_bar.update(1)
3) 保存模型

1.首先定义一个保存模型函数:

  1. def save_model(tokenizer, model, save_dir):
  2. tokenizer.save_pretrained(save_dir)
  3. model.save_pretrained(save_dir)
  4. logging.info('save done')

2 .保存模型代码:

  1. if accuracy > best_accuracy:
  2. best_accuracy = accuracy
  3. save_model(tokenizer, model, BEST_MODEL_SAVE_DIR)
  4. save_model(tokenizer, model, LAST_MODEL_SAVE_DIR)
  5. pbar.finish()
4)完整的train.py
  1. import os
  2. import torch
  3. import warnings
  4. import evaluate
  5. from tqdm.auto import tqdm
  6. from progressbar import ProgressBar
  7. from transformers import DataCollatorWithPadding, AutoTokenizer, AutoModelForSequenceClassification, get_scheduler
  8. from dataset import load_datasets, get_dataloaders
  9. from utils import save_model
  10. # OS HYPER GLOBAL PARAMETERS
  11. warnings.filterwarnings("ignore")
  12. torch.backends.cudnn.enabled = True
  13. os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
  14. os.environ["TORCH_USE_CUDA_DSA"] = "1"
  15. os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
  16. os.environ['CUDA_VISIBLE_DEVICES'] = '0'
  17. DEVICE = 'cuda'
  18. # - - - - - - - - - - - - - - -- - - - - - - -
  19. # SAVE MODEL SETTINGS
  20. CHECK_POINT_PATH = "../model/embedding_model"
  21. BEST_MODEL_SAVE_DIR = "../output/model/best_model"
  22. LAST_MODEL_SAVE_DIR = "../output/model/last_model"
  23. # - - - - - - - - - - - - - - -- - - - - - - -
  24. # MODEL HYPER PARAMETERS
  25. LEARN_RATE = 5e-5
  26. NUM_EPOCHS = 12
  27. BATCH_SIZE = 16
  28. NUM_LABELS = 4
  29. # - - - - - - - - - - - - - - -- - - - - - - -
  30. if __name__ == "__main__":
  31. tokenizer = AutoTokenizer.from_pretrained(CHECK_POINT_PATH)
  32. train_dataloader, eval_dataloader = (get_dataloaders(tokenizer, BATCH_SIZE)['train'],
  33. get_dataloaders(tokenizer, BATCH_SIZE)['valid'])
  34. # load check point
  35. model = AutoModelForSequenceClassification.from_pretrained(CHECK_POINT_PATH, num_labels=NUM_LABELS)
  36. # define a optimizer
  37. optimizer = torch.optim.AdamW(model.parameters(), lr=LEARN_RATE)
  38. num_training_steps = len(train_dataloader) * NUM_EPOCHS
  39. lr_scheduler = get_scheduler(
  40. "linear",
  41. optimizer=optimizer,
  42. num_warmup_steps=0,
  43. num_training_steps=num_training_steps,
  44. )
  45. # evaluate settings
  46. metric = evaluate.load("metric.py", type="metric")
  47. best_accuracy = 0.0
  48. progress_bar = tqdm(range(num_training_steps))
  49. progress_bar.set_description('training')
  50. # model to device
  51. model.to(DEVICE)
  52. # train & eval & save
  53. for epoch in range(NUM_EPOCHS):
  54. pbar = ProgressBar().start()
  55. model.train()
  56. total_loss = 0
  57. for batch in train_dataloader:
  58. batch = {k: v.to(DEVICE) for k, v in batch.items()}
  59. outputs = model(**batch)
  60. loss = outputs.loss
  61. loss.backward()
  62. optimizer.step()
  63. lr_scheduler.step()
  64. optimizer.zero_grad()
  65. total_loss += loss.item()
  66. average_loss = total_loss / len(train_dataloader)
  67. # noinspection DuplicatedCode
  68. model.eval()
  69. for batch in eval_dataloader:
  70. batch = {k: v.to(DEVICE) for k, v in batch.items()}
  71. with torch.no_grad():
  72. outputs = model(**batch)
  73. logits = outputs.logits
  74. predictions = torch.argmax(logits, dim=-1)
  75. metric.add_batch(predictions=predictions, references=batch["labels"])
  76. result = metric.compute()
  77. accuracy = result['accuracy']
  78. print(f"epoch: {epoch}, average_loss: {average_loss:.4f},accuracy: {result['accuracy']:.4f}")
  79. progress_bar.update(1)
  80. if accuracy > best_accuracy:
  81. best_accuracy = accuracy
  82. save_model(tokenizer, model, BEST_MODEL_SAVE_DIR)
  83. save_model(tokenizer, model, LAST_MODEL_SAVE_DIR)
  84. pbar.finish()

3. 预测predict.py

这里与训练过程中大同小异,我直接写了评估,没有实际输出标签

  1. import evaluate
  2. import os
  3. import torch
  4. from transformers import AutoTokenizer, AutoModelForSequenceClassification
  5. from dataset import get_dataloaders
  6. from progressbar import ProgressBar
  7. # DEFINE MODEL PATH
  8. # ------------------------------------------------------
  9. CHECK_POINT_PATH = '../output/model/best_model'
  10. os.environ['CUDA_VISIBLE_DEVICES'] = '0'
  11. DEVICE = 'cuda'
  12. BATCH = 8
  13. NUM_LABELS = 4
  14. # ------------------------------------------------------
  15. # load tokenizer
  16. tokenizer = AutoTokenizer.from_pretrained(CHECK_POINT_PATH)
  17. # load test data
  18. test_dataloader = get_dataloaders(tokenizer, batch_size=BATCH)['test']
  19. # init evaluate
  20. accuracy = evaluate.load('metric.py', type='metric')
  21. model = AutoModelForSequenceClassification.from_pretrained(CHECK_POINT_PATH, num_labels=NUM_LABELS)
  22. if __name__ == '__main__':
  23. pbar = ProgressBar().start()
  24. model.to(DEVICE)
  25. # noinspection DuplicatedCode
  26. model.eval()
  27. for batch in test_dataloader:
  28. batch = {k: v.to(DEVICE) for k, v in batch.items()}
  29. with torch.no_grad():
  30. outputs = model(**batch)
  31. logits = outputs.logits
  32. print(outputs.logits)
  33. predictions = torch.argmax(logits, dim=-1)
  34. accuracy.add_batch(predictions=predictions, references=batch["labels"])
  35. result = accuracy.compute()
  36. print(f"Accuracy: {result['accuracy']*100:.2f}%")
  37. pbar.finish()

这篇文章就到这里啦,记得关注喔

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/352101
推荐阅读
相关标签
  

闽ICP备14008679号