赞
踩
最近在做一个RAG的项目,尝试多种模型以后,发现chatglm3-6b-32k在中文领域明显优于其它模型,基于transformer在测试环境验证后需要生产环境部署,这就需要用到英伟达的triton server。
我们的生产服务器有8块 Tesla T4显卡,如果部署非量化版模型,每一个显卡16G可以部署一个实例(单个实例占用显存12G左右),如果是4bit量化版一个显卡可以部署至少2个实例。
1.拉取triton镜像:
docker pull instill/tritonserver:23.12-py3
2.创建容器(有两种方式,直接启动triton或者守护模式启动然后进去容器启动triton):
直接启动:
docker run -it --name chatglmtest --gpus all --shm-size=1g --ulimit memlock=-1 -p 8000:8000 -p 8001:8001 -p 8002:8002 --net=host -v /home/server/model_repository:/models --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:23.12-py3 tritonserver --model-repository=/models
守护模式:
docker run -itd --name chatglmtest --gpus all --shm-size=1g --ulimit memlock=-1 -p 8000:8000 -p 8001:8001 -p 8002:8002 --net=host -v /home/server/model_repository:/models --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:23.12-py3
3.进入容器,pip安装模型依赖,torch的cuda版本根据主机的cuda版本确定
- docker exec -it chatglmtest bash
-
- #cuda版本跟主机的cuda版本有关
- pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
-
- pip install sentence_transformers transformers tiktoken accelerate packaging ninja transformers_stream_generator einops optimum bitsandbytes
4.配置模型,模型存放在刚才创建docker容器时映射的目录中/home/server/model_repository
/home/server/model_repository目录结构如下,我只放了一个模型,__pycache__和work目录不用管它,这两个目录是运行triton以后自动生成的。
目录1是模型版本,目录1下面放着huggingface下载下来的模型和model.py(运行脚本)文件。
和目录1平级的需要一个配置文件config.pbtxt,说明输入输出的协议和实例对应GPU的配置
下面开始写配置文件config.pbtxt和model.py。
config.pbtxt
- name: "chatglm3-6b-32k" // 模型名,与模型的文件夹名字相同
- backend: "python" // 模型所使用的后端引擎
-
- max_batch_size: 0
- input [ // 输入定义
- {
- name: "prompt" //名称
- data_type: TYPE_STRING //类型
- dims: [ -1 ] //数据维度,-1 表示可变维度
- },
- {
- name: "history"
- data_type: TYPE_STRING
- dims: [ -1 ]
- },
- {
- name: "temperature"
- data_type: TYPE_STRING
- dims: [ -1 ]
- },
- {
- name: "max_token"
- data_type: TYPE_STRING
- dims: [ -1 ]
- },
- {
- name: "history_len"
- data_type: TYPE_STRING
- dims: [ -1 ]
- }
- ]
- output [ //输出定义
- {
- name: "response"
- data_type: TYPE_STRING
- dims: [ -1 ]
- },
- {
- name: "history"
- data_type: TYPE_STRING
- dims: [ -1 ]
- }
- ]
- //实例配置,我使用了3个显卡,每个显卡配置了一个实例
- instance_group [
- {
- count: 1
- kind: KIND_GPU
- gpus: [ 0 ]
- },
- {
- count: 1
- kind: KIND_GPU
- gpus: [ 1 ]
- },
- {
- count: 1
- kind: KIND_GPU
- gpus: [ 2 ]
- }
- ]

model.py
- import os
- # 设置显存空闲block最大分割阈值
- os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:32'
- # 设置work目录
-
- os.environ['TRANSFORMERS_CACHE'] = os.path.dirname(os.path.abspath(__file__))+"/work/"
- os.environ['HF_MODULES_CACHE'] = os.path.dirname(os.path.abspath(__file__))+"/work/"
-
- import json
-
- # triton_python_backend_utils is available in every Triton Python model. You
- # need to use this module to create inference requests and responses. It also
- # contains some utility functions for extracting information from model_config
- # and converting Triton input/output types to numpy types.
- import triton_python_backend_utils as pb_utils
- import sys
- import gc
- import time
- import logging
- import torch
- from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
- import numpy as np
-
- gc.collect()
- torch.cuda.empty_cache()
-
- logging.basicConfig(format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s',
- level=logging.INFO)
-
- class TritonPythonModel:
- """Your Python model must use the same class name. Every Python model
- that is created must have "TritonPythonModel" as the class name.
- """
-
- def initialize(self, args):
- """`initialize` is called only once when the model is being loaded.
- Implementing `initialize` function is optional. This function allows
- the model to intialize any state associated with this model.
- Parameters
- ----------
- args : dict
- Both keys and values are strings. The dictionary keys and values are:
- * model_config: A JSON string containing the model configuration
- * model_instance_kind: A string containing model instance kind
- * model_instance_device_id: A string containing model instance device ID
- * model_repository: Model repository path
- * model_version: Model version
- * model_name: Model name
- """
- # You must parse model_config. JSON string is not parsed here
- self.model_config = json.loads(args['model_config'])
-
- output_response_config = pb_utils.get_output_config_by_name(self.model_config, "response")
- output_history_config = pb_utils.get_output_config_by_name(self.model_config, "history")
-
- # Convert Triton types to numpy types
- self.output_response_dtype = pb_utils.triton_string_to_numpy(output_response_config['data_type'])
- self.output_history_dtype = pb_utils.triton_string_to_numpy(output_history_config['data_type'])
-
- ChatGLM_path = os.path.dirname(os.path.abspath(__file__))+"/chatglm3-6b-32k"
- self.tokenizer = AutoTokenizer.from_pretrained(ChatGLM_path, trust_remote_code=True)
- #下面to('cuda:'+args['model_instance_device_id'])这里一定要注意,这里是把实例部署到对应的显卡上,如果不写会分散到所有显卡上或者集中到一个显卡上,都会造成问题
- model = AutoModelForCausalLM.from_pretrained(ChatGLM_path,
- torch_dtype=torch.float16, trust_remote_code=True).half().to('cuda:'+args['model_instance_device_id'])
- self.model = model.eval()
- logging.info("model init success")
-
- def execute(self, requests):
- """`execute` MUST be implemented in every Python model. `execute`
- function receives a list of pb_utils.InferenceRequest as the only
- argument. This function is called when an inference request is made
- for this model. Depending on the batching configuration (e.g. Dynamic
- Batching) used, `requests` may contain multiple requests. Every
- Python model, must create one pb_utils.InferenceResponse for every
- pb_utils.InferenceRequest in `requests`. If there is an error, you can
- set the error argument when creating a pb_utils.InferenceResponse
- Parameters
- ----------
- requests : list
- A list of pb_utils.InferenceRequest
- Returns
- -------
- list
- A list of pb_utils.InferenceResponse. The length of this list must
- be the same as `requests`
-
- """
- output_response_dtype = self.output_response_dtype
- output_history_dtype = self.output_history_dtype
-
- # output_dtype = self.output_dtype
- responses = []
- # Every Python backend must iterate over everyone of the requests
- # and create a pb_utils.InferenceResponse for each of them.
- for request in requests:
- prompt = pb_utils.get_input_tensor_by_name(request, "prompt").as_numpy()[0]
- prompt = prompt.decode('utf-8')
- history_origin = pb_utils.get_input_tensor_by_name(request, "history").as_numpy()
- if len(history_origin) > 0:
- history = np.array([item.decode('utf-8') for item in history_origin]).reshape((-1,2)).tolist()
- else:
- history = []
- temperature = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy()[0]
- temperature = float(temperature.decode('utf-8'))
- max_token = pb_utils.get_input_tensor_by_name(request, "max_token").as_numpy()[0]
- max_token = int(max_token.decode('utf-8'))
- history_len = pb_utils.get_input_tensor_by_name(request, "history_len").as_numpy()[0]
- history_len = int(history_len.decode('utf-8'))
-
- # 日志输出传入信息
- in_log_info = {
- "in_prompt":prompt,
- "in_history":history,
- "in_temperature":temperature,
- "in_max_token":max_token,
- "in_history_len":history_len
- }
- logging.info(in_log_info)
- response,history = self.model.chat(self.tokenizer,
- prompt,
- history=history[-history_len:] if history_len > 0 else [],
- max_length=max_token,
- temperature=temperature)
- # 日志输出处理后的信息
- out_log_info = {
- "out_response":response,
- "out_history":history
- }
- logging.info(out_log_info)
- response = np.array(response)
- history = np.array(history)
-
- response_output_tensor = pb_utils.Tensor("response",response.astype(self.output_response_dtype))
- history_output_tensor = pb_utils.Tensor("history",history.astype(self.output_history_dtype))
-
- final_inference_response = pb_utils.InferenceResponse(output_tensors=[response_output_tensor,history_output_tensor])
- responses.append(final_inference_response)
- # Create InferenceResponse. You can set an error here in case
- # there was a problem with handling this inference request.
- # Below is an example of how you can set errors in inference
- # response:
- #
- # pb_utils.InferenceResponse(
- # output_tensors=..., TritonError("An error occured"))
-
- # You should return a list of pb_utils.InferenceResponse. Length
- # of this list must match the length of `requests` list.
- return responses
-
- def finalize(self):
- """`finalize` is called only once when the model is being unloaded.
- Implementing `finalize` function is OPTIONAL. This function allows
- the model to perform any necessary clean ups before exit.
- """
- print('Cleaning up...')

5:启动triton server
- #守护模式(-itd创建的容器),进入容器运行
- tritonserver --model-repository=/models
-
- #非守护模式(-it创建的容器),在宿主机运行
- docker start chatglmtest
6:验证
- curl -X POST localhost:8000/v2/models/chatglm3-6b-32k/generate \
- -d '{"prompt": "你好,请问你叫什么?", "history":[], "temperature":"0.3","max_token":"100","history_len":"0"}'
响应:
{"history":["{'role': 'user', 'content': '你好,请问你叫什么?'}","{'role': 'assistant', 'metadata': '', 'content': '你好!我是一个名为 ChatGLM3-6B 的人工智能助手,是基于清华大学 KEG 实验室和智谱 AI 公司于 2023 年共同训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。'}"],"model_name":"chatglm3-6b-32k","model_version":"1","response":"你好!我是一个名为 ChatGLM3-6B 的人工智能助手,是基于清华大学 KEG 实验室和智谱 AI 公司于 2023 年共同训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。"}
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。