笔触狂放9

这个屌丝很懒，什么也没留下！

热门标签

LLM：Transformers模型推理和加速_transformer推理加速

作者：笔触狂放9 | 2024-04-05 03:25:13

踩

transformer推理加速

Pipeline

pipeline() 的作用是使用预训练模型进行推断。

支持的任务task类型

不同类型的任务所下载的默认预训练模型可以在 Transformers 库的源码

[transformers/SUPPORTED_TASKS]中的 SUPPORTED_TASKS 定义。

常用的比如"text-classification"，"question-answering"，"summarization"，"translation"，"text-generation"，"conversational"。

Note: task别名TASK_ALIASES = {"sentiment-analysis": "text-classification", "ner": "token-classification", "vqa": "visual-question-answering"}

参数Parameters

Batch size

推理时没必要。By default, pipelines will not batch inference for reasons explained in detail here. The reason is that batching is not necessarily faster, and can actually be quite slower in some cases.

[parameters]

pipline用法

使用本地模型进行推理

示例1：文本分类


sentences = ['...',...,'...']
model_path = '...'
 
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
 
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
 
classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
prediction = classifier(sentences)
print(prediction)
# [{'label': 0, 'score': 0.9995700716972351}...{'label': 1, 'score': 0.6509987115859985}]

Note:
1 文本分类示例中，classifier = pipeline(task=task, model=model, tokenizer=tokenizer)返回的类型是<class 'transformers.pipelines.text_classification.TextClassificationPipeline'>
2 prediction = classifier(sentences)
TextClassificationPipeline进行__call__时，返回的是[... {'label': 1, 'score': 0.6509...}]这种。
实际上原始输出是tensor([[0.3490, 0.6510]])，后面通过torch.softmax(torch.tensor([[-0.2518, 0.3716]]), dim=1)转换了成了概率。 [outputs转成score实现代码TextClassificationPipeline.postprocess]

[官方文档Pipelines for inference] [Pipeline 使用]

其它

[Using pipelines on a dataset]

使用 ONNX 和 Optimum 对 Transformers pipline 加速推理

Onnx：Fine tune your model for size, accuracy, resource utilization, and performance.[Learn more about ONNX]

HuggingFace Optimum是Transformers的扩展，它提供了性能优化工具的统一 API，以实现在加速硬件上训练和运行模型的最高效率，包括在Graphcore IPU和Habana Gaudi上优化性能的工具包。Optimum可通过其模块将模型从 PyTorch 或 TensorFlow 导出为序列化格式，例如 ONNX 和 TFLite exporters。Optimum还提供了一套性能优化工具，可以在目标硬件上以最高效率训练和运行模型，可用于加速训练、量化、图形优化，现在还可用于推理以及对Transformer pipeline的支持。Deploy your ONNX model using runtimes designed to accelerate inferencing. Optimum can be used for converting, quantization, graph optimization, accelerated training & inference with support for transformers pipelines.[Learn more about Optimum]

Note: ONNX 不是一个Runtime，ONNX 只是一种表示，可与 Runtimes如ONNX Runtime（ORT）等运行时一起使用。[受支持的加速器列表]

When a model is exported to the ONNX format, these operators are used to construct a computational graph (often called an intermediate representation) which represents the flow of data through the neural network.

pseudo ONNX graph, visualized with NETRON:

a typical developer journey of how you can leverage Optimum with ONNX:

安装依赖

$pip install optimum[onnxruntime]

Note: 也可以通过下面安装必须的
pip install optimum
pip install onnx
pip install onnxruntime # CPU build
pip install onnxruntime-gpu # GPU build
上面会安装optimum onnx onnxruntime，并多出 evaluate responses。

Transformer模型导出为普通onnx及推理

通过ONNX Runtime将Transformers模型导出为ONNX


model_path = '...'
 
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
 
onnx_path = "onnx/"
 
# Load a model from transformers and export it to ONNX
ort_model = ORTModelForSequenceClassification.from_pretrained(model_path, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_path)
 
# save onnx checkpoint and tokenizer
ort_model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

只需将AutoModelForXxx类替换为ORTModelForXxx 中相应的类即可。

ORTModelForXxx.from_pretrained参数：
from_transformers (bool, defaults to False) — Defines whether the provided model_id contains a vanilla Transformers checkpoint. 但是此参数已弃用The argument `from_transformers` is deprecated, and will be removed in optimum 2.0. Use `export` instead. 代码中export = from_transformers.
export=True - To load a PyTorch checkpoint and convert format on-the-fly, you can set export=True when loading your model. If you want to load from a PyTorch checkpoint, set export=True to export your model to the ONNX format. 如果模型来至于Transformers库，需要加上export=True，代码中有from_pretrained_method = cls._from_transformers if export else cls._from_pretrained。

Note: 导出的onnx文件保存目录下，除了原来的重要文件外，只是将pytorch_model.bin改成了model.onnx。[quicktour#onnx-runtime] [Export to ONNX with-optimumonnxruntime]其它导出方式参考[Export to ONNX]

通过ONNX Runtime加载ONNX模型并使用pipline进行推理

示例1：分类任务

输出结果相对原始transformers模型一样。


sentences = ['...',...,'...']
 
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
 
onnx_path = "onnx/"
ort_model = ORTModelForSequenceClassification.from_pretrained(onnx_path)
tokenizer = AutoTokenizer.from_pretrained(onnx_path)
 
# test the model with using transformers pipeline
ort_classifier = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer)
prediction = ort_classifier(sentences)
print(prediction)
# [{'label': 0, 'score': 0.9995700716972351}...{'label': 1, 'score': 0.6509921550750732}]

示例2：qa任务


task = "question-answering"
# test the model with using transformers pipeline, with handle_impossible_answer for squad_v2
optimum_qa = pipeline(task, model=model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")
print(prediction)
# {'score': 0.9041663408279419, 'start': 11, 'end': 18, 'answer': 'Philipp'}

示例3：qa任务：ONNX模型直接推理inference

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("distilbert_base_uncased_squad_onnx")
model = ORTModelForQuestionAnswering.from_pretrained("distilbert_base_uncased_squad_onnx")
inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt")
outputs = model(**inputs)

使用ORTOptimizer优化模型后再导出为onnx

在我们保存 onnx 检查点后，我们现在可以使用ORTOptimizer应用图优化，例如算子融合operator fusion（算子融合是指将多个算子合并成一个更大的算子，以减少计算图中的算子节点数量和计算量。其中算子是指执行某种操作的函数，例如卷积、池化、激活函数等）、常量折叠constant folding（在模型优化过程中将常量值直接嵌入计算图中，以减少计算图中的常量节点数量和计算量monica），来减少计算图的复杂度、模型的内存占用和计算时间，从而提高模型的执行速度和效率，以加速延迟和推理。此外，它还可以自动识别和解决模型中的性能瓶颈，以进一步提高模型的性能。

优化后保存为onnx


model_path = '...'
 
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig
 
optimized_onnx_path = "optimized_onnx/"
# Export the model
ort_model = ORTModelForSequenceClassification.from_pretrained(model_path, export=True)
 
# Create the optimizer
optimizer = ORTOptimizer.from_pretrained(ort_model)
# Create the optimization configuration containing all the optimization parameters
optimization_config = OptimizationConfig(optimization_level=1, optimize_for_gpu=False,
                                         enable_transformers_specific_optimizations=True)
# Optimize the model
optimizer.optimize(optimization_config=optimization_config, save_dir=optimized_onnx_path,
                   use_external_data_format=False, one_external_file=True)

Note: OptimizationArguments和OnnxExportArguments参数参考[OptimizationArguments参数]，其中optimize_with_onnxruntime_only will be deprecated soon, use enable_transformers_specific_optimizations instead, enable_transformers_specific_optimizations is set to True。

除了保存成了model_optimized.onnx外，还多出一个ort_config.json文件。

加载优化模型并使用pipline进行推理

输出结果相对原始transformers模型一样。


sentences = ['...',...,'...']
 
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
 
optimized_onnx_path = "optimized_onnx/"
optimized_model = ORTModelForSequenceClassification.from_pretrained(optimized_onnx_path)
tokenizer = AutoTokenizer.from_pretrained(optimized_onnx_path)
 
# test the model with using transformers pipeline
optimized_classifier = pipeline(task="text-classification", model=optimized_model, tokenizer=tokenizer)
prediction = optimized_classifier(sentences)
print(prediction)
# [{'label': 0, 'score': 0.9995700716972351}...{'label': 1, 'score': 0.6509921550750732}]

[examples/onnxruntime/optimization/text-classification/run_glue.py#L265]

使用`ORTQuantizer`来应用动态量化

优化模型后，我们可以通过使用ORTQuantizer量化模型并进一步加速。具体来说，它使用了多种量化技术，如动态量化、静态量化[Quantization]等，来将模型中的浮点数转换为整数或低精度浮点数，从而减小模型大小并加速延迟和推理。Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32). Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic.

量化的一些原理：先寻找tensor矩阵的绝对值的最大值，并计算最大值到127的缩放因子，然后使用该缩放因子对整个tensor进行缩放后，再round到整数。把outlier和非outlier的矩阵分开计算，再把结果进行合并。[8bit Quantization]

动态量化后保存为onnx


from optimum.onnxruntime.configuration import AutoQuantizationConfig, QuantizationConfig
from optimum.onnxruntime.configuration import QuantFormat, QuantizationMode, QuantType
from optimum.onnxruntime.preprocessors import QuantizationPreprocessor
from optimum.onnxruntime import ORTQuantizer
 
# onnx_path = "onnx/"
# quantized_path = "quantized_onnx/"
onnx_path = "optimized_onnx/"
quantized_path = "quantized_opt_onnx/"
 
# Define the quantization methodology
# Create the quantization configuration containing all the quantization parameters
# qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
qconfig = QuantizationConfig(
    is_static=False,
    format=QuantFormat.QOperator, mode=QuantizationMode.IntegerOps,
    activations_dtype=QuantType.QUInt8, weights_dtype=QuantType.QInt8,
    per_channel=False, reduce_range=False, operators_to_quantize=["MatMul", "Add"])
 
quantizer = ORTQuantizer.from_pretrained(onnx_path)
 
# Apply quantization on the model
# quantizer.quantize(save_dir=quantized_onnx_path, quantization_config=qconfig)
quantizer.quantize(
    save_dir=quantized_path, calibration_tensors_range=None,
    quantization_config=qconfig, preprocessor=QuantizationPreprocessor(),
    use_external_data_format=False)

Note: 1 ORTQuantizer.from_pretrained()只能读取ort读取出来的onnx模型，或者onnx模型目录，无法直接读取原transformers模型。

2 如果由支持 avx512 的 intel Cascade-Lake CPU 提供支持，那可以使用avx512_vnni。即qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)。

文件保存成了model_optimized_quantized.onnx。模型大小基本上会减小一半左右。

这里也可以不优化，直接从原始onnx转成量化onnx，模型大小会变更小一点，但是不知道两者实际predict差异有多大（未测试）。

加载量化模型并使用pipline进行推理

示例：分类

输出结果相对原始transformers模型有一点变化。


sentences = ['...',...,'...']
 
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import pipeline, AutoTokenizer
 
# quantized_path = "quantized_onnx/"
quantized_path = "quantized_opt_onnx/"
model = ORTModelForSequenceClassification.from_pretrained(quantized_path)  # , file_name="model_quantized.onnx"
tokenizer = AutoTokenizer.from_pretrained(quantized_path)
quantized_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
prediction = quantized_classifier(sentences)
print(prediction)
# quantized: [{'label': 0, 'score': 0.9995672106742859}, ..., {'label': 1, 'score': 0.6428356766700745}]
# quantized_opt: [{'label': 0, 'score': 0.999567449092865},..., {'label': 1, 'score': 0.6610999703407288}]

[examples/onnxruntime/quantization/text-classification/run_glue.py#L395]

推理时，只需要安装transformers optimum[onnxruntime]这两个包即可（或者transformers optimum onnxruntime onnx）。

Note: optimum不同版本ORTModelForSequenceClassification.from_pretrained参数不一样，比如1.4.1版本必须指定file_name，而1.9.0不需要。否则出错huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name'... Use `repo_type` argument if needed.

示例：qa


# load quantized model
quantized_model = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model-quantized.onnx")
 
# test the quantized model with using transformers pipeline
quantized_optimum_qa = pipeline(task, model=quantized_model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = quantized_optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")
print(prediction)
# {'score': 0.9246969819068909, 'start': 11, 'end': 18, 'answer': 'Philipp'}

性能测试

比如在qa模型[Accelerated Inference with Optimum and Transformers Pipelines]测试中，优化&量化后的模型从延迟117.61ms加速至64.94ms大约2 倍，同时f1值能保持原始模型的99.61%。

量化模型评估

代码参考[examples/onnxruntime/quantization/text-classification/run_glue.py]

需要修改成自己的数据；量化部分也可以删除，直接使用前面保存好的量化模型
还有一个bug需要修改：
outputs = ort_model.evaluation_loop(eval_dataset) >>
将ort_model.evaluation_loop copy出来evaluation_loop，并将onnx_inputs = {key: np.array([inputs[key]]) for key in self.onnx_input_names if key in inputs} >> onnx_inputs = {key: np.array([inputs[key]], dtype=np.int64) for key in self.onnx_input_names if key in inputs}
outputs = evaluation_loop(ort_model, eval_dataset)
否则出错：onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(int32)) , expected: (tensor(int64))
Note:
1 这样也不行
def trans(examples):
examples["input_ids"] = np.array([examples["input_ids"]], dtype=np.int64)
return examples
eval_dataset = eval_dataset.map(trans)
2 这样也不行
dataset = dataset.cast_column("input_ids", Sequence(feature=Value(dtype='int64', id=None)))

模型推理bug

输入text长度超过限制

model推理时，用的tokenize并没有自动clip文本长度，因为训练时也是在tokenized之前处理的，以及在tokenized初始化之后加了max_length参数。所以可能需要自己先预处理一下text长度clip，或者给tokenizer加一个参数。

Log Type: stderr
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'/model/embeddings/Add_1' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:560 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 512 by 1319
解决：text= text[:510]
Note: 模型最长512，设置510+前后<sep>

from:LLM：Transformers模型推理和加速_-柚子皮-的博客-CSDN博客

ref: [optimum官方文档https://www.wpsshop.cn/w/笔触狂放9/article/detail/362993