赞
踩
本篇先将搭建基础Triton设置模块,将ChatGLM3-6B部署为服务跑通,再加入动态批处理和模型预热来提升服务的性能和效率,包括以下几个模块
拉取Docker仓库下的nvcr.io/nvidia/tritonserver:21.02-py3,以此作为基础镜像,安装torch,transformers,sentencepiece等Python依赖构建一个新的镜像,下文中统一命名为triton_chatglm3_6b:v1,基础环境构建有疑问的读者可以翻阅笔者往期的文章,在本篇中此内容略过。
我们先交代模型仓库下的目录结构,在Triton要求的model_repository的目录下创建chatglm3-6b文件夹,结构如下
. ├── 1 │ ├── chatglm3-6b │ │ ├── config.json │ │ ├── configuration_chatglm.py │ │ ├── gitattributes │ │ ├── modeling_chatglm.py │ │ ├── MODEL_LICENSE │ │ ├── pytorch_model-00001-of-00007.bin │ │ ├── pytorch_model-00002-of-00007.bin │ │ ├── pytorch_model-00003-of-00007.bin │ │ ├── pytorch_model-00004-of-00007.bin │ │ ├── pytorch_model-00005-of-00007.bin │ │ ├── pytorch_model-00006-of-00007.bin │ │ ├── pytorch_model-00007-of-00007.bin │ │ ├── pytorch_model.bin.index.json │ │ ├── quantization.py │ │ ├── README.md │ │ ├── tokenization_chatglm.py │ │ ├── tokenizer_config.json │ │ └── tokenizer.model │ └── model.py ├── config.pbtxt └── warmup └── raw_data
其中1文件夹代表模型版本号,其下面又包含模型文件和自定义后端脚本model.py,config.pbtxt为Triton的配置信息,warmup文件夹存放模型预热需要的数据文件。
首先完成config.pbtxt的设置,主要包括输入输出要素约定,数据类型约定,设置如下
name: "chatglm3-6b" backend: "python" max_batch_size: 0 input [ { name: "prompt" data_type: TYPE_STRING dims: [ -1 ] }, { name: "history" data_type: TYPE_STRING dims: [ -1 ] }, { name: "temperature" data_type: TYPE_STRING dims: [ -1 ] }, { name: "max_token" data_type: TYPE_INT16 dims: [ 1 ] }, { name: "history_len" data_type: TYPE_INT16 dims: [ 1 ] } ] output [ { name: "response" data_type: TYPE_STRING dims: [ -1 ] }, { name: "history" data_type: TYPE_STRING dims: [ -1 ] } ] instance_group [ { count: 1 kind: KIND_GPU gpus: [ 2 ] } ]
对该文件中的要素做简要说明
config.pbtxt搭建起了客户端和服务端的桥梁,下一步编辑自定义后端脚本model.py,它基于config.pbtxt中的约定抽取对应的数据进行推理逻辑的编写,model.py内容如下
import os os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:32' os.environ['TRANSFORMERS_CACHE'] = os.path.dirname(os.path.abspath(__file__)) + "/work/" os.environ['HF_MODULES_CACHE'] = os.path.dirname(os.path.abspath(__file__)) + "/work/" import json import triton_python_backend_utils as pb_utils import sys import gc import time import logging import torch from transformers import AutoTokenizer, AutoModel import numpy as np gc.collect() torch.cuda.empty_cache() logging.basicConfig(format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s', level=logging.INFO) class TritonPythonModel: def initialize(self, args): device = "cuda" if args["model_instance_kind"] == "GPU" else "cpu" device_id = args["model_instance_device_id"] self.device = f"{device}:{device_id}" self.model_config = json.loads(args['model_config']) output_response_config = pb_utils.get_output_config_by_name(self.model_config, "response") output_history_config = pb_utils.get_output_config_by_name(self.model_config, "history") self.output_response_dtype = pb_utils.triton_string_to_numpy(output_response_config['data_type']) self.output_history_dtype = pb_utils.triton_string_to_numpy(output_history_config['data_type']) ChatGLM_path = os.path.dirname(os.path.abspath(__file__)) + "/chatglm3-6b" self.tokenizer = AutoTokenizer.from_pretrained(ChatGLM_path, trust_remote_code=True) model = AutoModel.from_pretrained(ChatGLM_path, torch_dtype=torch.bfloat16, trust_remote_code=True).half().to(self.device) self.model = model.eval() logging.info("model init success") def execute(self, requests): responses = [] for request in requests: prompt = pb_utils.get_input_tensor_by_name(request, "prompt").as_numpy()[0].decode('utf-8') history_origin = pb_utils.get_input_tensor_by_name(request, "history").as_numpy()[0].decode('utf-8') if history_origin: history = eval(history_origin) else: history = [] temperature = float(pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy()[0].decode("utf-8")) max_token = int(pb_utils.get_input_tensor_by_name(request, "max_token").as_numpy()[0]) history_len = int(pb_utils.get_input_tensor_by_name(request, "history_len").as_numpy()[0]) # 日志输出传入信息 in_log_info = { "in_prompt": prompt, "in_history": history, "in_temperature": temperature, "in_max_token": max_token, "in_history_len": history_len } logging.info(in_log_info) response, history = self.model.chat(self.tokenizer, prompt, # 由于history的结构,问和答是分开的因此*2 history=history[-history_len * 2:] if history_len > 0 else [], max_length=max_token, temperature=temperature) # 日志输出处理后的信息 out_log_info = { "out_response": response, "out_history": history } logging.info(out_log_info) response = np.char.encode(np.array([response])) history = np.char.encode(np.array([str(history)])) response_output_tensor = pb_utils.Tensor("response", response.astype(self.output_response_dtype)) history_output_tensor = pb_utils.Tensor("history", history.astype(self.output_response_dtype)) final_inference_response = pb_utils.InferenceResponse( output_tensors=[response_output_tensor, history_output_tensor]) responses.append(final_inference_response) return responses def finalize(self): print('Cleaning up...')
首先在初始化initialize中通过model_instance_device_id和model_instance_device_id拿到对应的设备device,通过HuggingFace加载模型并装载到GPU上,在execute中实现推理逻辑,从请求requests中解析出对应的prompt,history,温度系数等参数,执行chatglm3模型自带的chat方法即可完成推理,最终输出为Triton指定的类型格式包装输出。
这里对history做简要说明,在chatglm3中history的格式为由字典组成的列表,每字典包含角色和内容,例如
>>> history
>>> [{'role': 'user', 'content': '你好'}, {'role': 'assistant', 'metadata': '', 'content': '你好声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/黑客灵魂/article/detail/746110
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。