赞
踩
LLM:Transformers 库_-柚子皮-的博客-CSDN博客
还要安装
accelerate >= 0.12.0
datasets >= 1.8.0
sentencepiece != 0.1.92
scipy
scikit-learn
protobuf
torch >= 1.3
evaluate
- sys.argv = 'run_glue.py --model_name_or_path {} --dataset_name {} --do_train --do_eval
- --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --num_train_epochs 8
- --output_dir {}'.format(pretrained_model, dataset_name, output_model).split()
-
- sys.argv = 'run_glue.py --model_name_or_path {} --dataset_name {} --do_eval --do_predict
- --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --num_train_epochs 1
- --output_dir {}'.format(pretrained_model, dataset_name, output_model).split()
[transformers/examples/pytorch/text-classification/run_glue.py]
[transformers/examples/pytorch/text-classification at main · huggingface/transformers · GitHub]
Note:需要注意几个参数可能需要修改
max_seq_length=256
pad_to_max_length=False
以及metrics提前下载并修改
分类模型结构和bert下游任务处理一样,就是在[Bert 主体模型]基础上加上了 (dropout): Dropout 和 (classifier): Linear。使用的模型是AutoModelForSequenceClassification,具体可以是BertForSequenceClassification。
- BertForSequenceClassification(
- (bert): BertModel(
- (embeddings): BertEmbeddings(
- (word_embeddings): Embedding(21128, 768, padding_idx=0)
- (position_embeddings): Embedding(512, 768)
- (token_type_embeddings): Embedding(2, 768)
- (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
- (dropout): Dropout(p=0.1, inplace=False)
- )
- (encoder): BertEncoder(
- (layer): ModuleList(
- (0-11): 12 x BertLayer(
- (attention): BertAttention(
- (self): BertSelfAttention(
- (query): Linear(in_features=768, out_features=768, bias=True)
- (key): Linear(in_features=768, out_features=768, bias=True)
- (value): Linear(in_features=768, out_features=768, bias=True)
- (dropout): Dropout(p=0.1, inplace=False)
- )
- (output): BertSelfOutput(
- (dense): Linear(in_features=768, out_features=768, bias=True)
- (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
- (dropout): Dropout(p=0.1, inplace=False)
- )
- )
- (intermediate): BertIntermediate(
- (dense): Linear(in_features=768, out_features=3072, bias=True)
- (intermediate_act_fn): GELUActivation()
- )
- (output): BertOutput(
- (dense): Linear(in_features=3072, out_features=768, bias=True)
- (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
- (dropout): Dropout(p=0.1, inplace=False)
- )
- )
- )
- )
- (pooler): BertPooler(
- (dense): Linear(in_features=768, out_features=768, bias=True)
- (activation): Tanh()
- )
- )
- (dropout): Dropout(p=0.1, inplace=False)
- (classifier): Linear(in_features=768, out_features=2, bias=True)
- )
在Pytorch 中,把一个 batch 的数据进行聚合的方式是 collate 函数。它是 Trainer 类的一个参数。默认的时候,collate函数就会将获取的数据样本转化为Pytorch Tensor 然后将它们进行拼接(适用于lists,tuples,dictionary)。所以这就要求我们的 batch 中的example 必须是相同的 size,所以我们这时候需要在 collate 的时候进行 dynamic padding。
注意:如果使用TPU,则还是需要padding 到模型的 max length,因为TPU这样子效率更高。
为了实现 dynamic padding,我们需要定义一个 collate 函数,对于不同的batch 数据,进行不同长度的padding。transformers提供了一个函数 DataCollatorWithPadding, 它传入参数:tokenizer,可以知道 tokenizer 的padding token,以及token的策略(左padding 或者右padding)。
示例1:直接使用
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# batch = data_collator(samples)
# {k: v.shape for k, v in batch.items()}
示例2:在Trainer参数中使用
trainer = Trainer(model=model, args=training_args,
train_dataset=train_dataset, eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
tokenizer=tokenizer, data_collator=data_collator)
data_collator参数:default to [`default_data_collator`] if no `tokenizer` is provided, an instance of [`DataCollatorWithPadding`] otherwise.即在代码中是default_collator = default_data_collator if tokenizer is None else DataCollatorWithPadding(tokenizer)
self.data_collator = data_collator if data_collator is not None else default_collator
Trainer中使用是在DataLoader中:
def get_train_dataloader(self) -> DataLoader:
...
return DataLoader(train_dataset, batch_size=self._train_batch_size,
collate_fn=data_collator, drop_last=self.args.dataloader_drop_last...)
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
metric = evaluate.combine(["./metrics/accuracy", "./metrics/precision", "./metrics/recall", "./metrics/f1"])
def compute_metrics(p: EvalPrediction):
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
result = metric.compute(predictions=preds, references=p.label_ids)
return result
Note:如果没有配置 compute_metrics 函数,告诉模型如何评价指标,则模型只会输出loss。具体参考下面的evaluate库。
Transformers 提供了 Trainer class帮助进行模型的fine-tuning。完成了数据预处理,就可以使用 Trainer 进行模型的训练了。
trainer = Trainer(
model=model, args=training_args,
train_dataset=train_dataset, eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
tokenizer=tokenizer, data_collator=data_collator,
)
trainer.train()
Note:
1 Trainer 默认支持 多GPU/TPU,也支持混合精度训练,可以在训练的配置 TrainingArguments 中,设置 fp16 = True。
2 Trainer 的默认优化器时 AdamW[最优化方法:深度学习最优化方法_深度学习 权重优化]。
train_result = trainer.train(resume_from_checkpoint=checkpoint)
1层
site-packages/transformers/trainer.py:1645
return inner_training_loop(args=args,resume_from_checkpoint=...)
--train_dataloader = self.get_train_dataloader()
--for epoch in range(epochs_trained, num_train_epochs):
--for step, inputs in enumerate(epoch_iterator):
-- site-packages/transformers/trainer.py:1938
tr_loss_step = self.training_step(model, inputs)
2层
site-packages/transformers/trainer.py:2733
transformers.trainer.Trainer.training_step
model.train()
compute_loss(model, inputs)
outputs = BertForSequenceClassification.forward(**inputs)
3层
anaconda3/Lib/site-packages/transformers/models/bert/modeling_bert.py:1595
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
self.accelerator.backward(loss)
self.optimizer.step()
self.lr_scheduler.step() # Delay optimizer scheduling until metrics are generated
model.zero_grad()
自定义训练 lower-level training loop 以及 accelerate 进行分布式训练参考
Evaluate 通过仅计算第一个节点上的最终指标来解决这个问题。为每个节点分别计算预测和参考并将其提供给度量。这些临时存储在 Apache Arrow 表中,避免了 GPU 或 CPU 内存混乱。当您准备好计算()最终指标时,第一个节点能够访问存储在所有其他节点上的预测和参考。一旦它收集了所有的预测和参考,compute() 将执行最终的度量评估。该解决方案允许 Evaluate 执行分布式预测,这对于分布式设置中的评估速度很重要。同时,您还可以使用复杂的非附加指标,而不会浪费宝贵的 GPU 或 CPU 内存。[HuggingFace evaluate库总结 - 知乎]
pip install evaluate
[Python的包管理工具pip_-柚子皮-的博客-CSDN博客]
[https://huggingface.co/docs/evaluate/package_reference/main_classes#main-classes]
1 从Using a custom metric script加载:
建议直接从[evaluate-metric (Evaluate Metric)]下载评估的py文件,再从本地load
e.g. 下载evaluate-metric/accuracy at main中的accuracy.py到当前训练脚本的./metrics目录下
metric = evaluate.load("./metrics/accuracy")
同时load多个评估py文件[evaluate.combine][Evaluate multiple metrics]
metric = evaluate.combine(["./metrics/accuracy", "./metrics/precision", "./metrics/recall", "./metrics/f1"])
本地下载运行成功后会在本地有缓存,直接通过下面也可以,但是
不建议
metric = evaluate.combine(["accuracy", "precision", "recall", "f1"])
2 从From the HuggingFace Hub加载:
metric = evaluate.load("accuracy")
连接失败时,可能会出错:FileNotFoundError: Couldn't find a module script at **run_dir**\accuracy\accuracy.py. Module 'accuracy' doesn't exist on the Hugging Face Hub either.
3 可能被弃用的老方式使用datasets.load_metric
from datasets import load_metric
metric = load_metric('glue', 'mrpc')
[Loading a Metric — datasets 1.0.1 documentation]
[python 3.x - Calculate precision, recall, f1 score for custom dataset for Huggingface library]
if is_regression:
metric = evaluate.load("metrics/mse")
else:
metric = evaluate.combine(["metrics/accuracy", "metrics/precision", "metrics/recall", "metrics/f1"])
def compute_metrics(p: EvalPrediction):
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
result = metric.compute(predictions=preds, references=p.label_ids)
return result
或者 mse也可以直接写(不清楚并行是否会影响):
def compute_metrics(p: EvalPrediction):
...
if is_regression:
return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
else:
return metric.compute(predictions=preds, references=p.label_ids)
训练时:
trainer = Trainer(
model=model, args=training_args,
train_dataset=train_dataset, eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
tokenizer=tokenizer, data_collator=data_collator
)
evalation或者prediction时:
ort_model = ORTModel(
os.path.join(model_args.model_name_or_path, model_args.model_name),
execution_provider=optim_args.execution_provider,
compute_metrics=compute_metrics, label_names=["label"])
outputs = evaluation_loop(ort_model, eval_dataset)
ort_model = ORTModel(
os.path.join(model_args.model_name_or_path, model_args.model_name),
execution_provider=optim_args.execution_provider, label_names=["label"])
outputs = evaluation_loop(ort_model, predict_dataset)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。