当前位置:   article > 正文

triton server使用python backend部署chatglm3-6b-32k_python triton

python triton

最近在做一个RAG的项目,尝试多种模型以后,发现chatglm3-6b-32k在中文领域明显优于其它模型,基于transformer在测试环境验证后需要生产环境部署,这就需要用到英伟达的triton server。

我们的生产服务器有8块 Tesla T4显卡,如果部署非量化版模型,每一个显卡16G可以部署一个实例(单个实例占用显存12G左右),如果是4bit量化版一个显卡可以部署至少2个实例。

1.拉取triton镜像:

docker pull instill/tritonserver:23.12-py3

2.创建容器(有两种方式,直接启动triton或者守护模式启动然后进去容器启动triton):

直接启动:

docker run -it --name chatglmtest --gpus all --shm-size=1g --ulimit memlock=-1 -p 8000:8000 -p 8001:8001 -p 8002:8002 --net=host -v /home/server/model_repository:/models --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:23.12-py3 tritonserver --model-repository=/models

守护模式:

docker run -itd --name chatglmtest --gpus all --shm-size=1g --ulimit memlock=-1 -p 8000:8000 -p 8001:8001 -p 8002:8002 --net=host -v /home/server/model_repository:/models --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:23.12-py3

3.进入容器,pip安装模型依赖,torch的cuda版本根据主机的cuda版本确定

  1. docker exec -it chatglmtest bash
  2. #cuda版本跟主机的cuda版本有关
  3. pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
  4. pip install sentence_transformers transformers tiktoken accelerate packaging ninja transformers_stream_generator einops optimum bitsandbytes

4.配置模型,模型存放在刚才创建docker容器时映射的目录中/home/server/model_repository

/home/server/model_repository目录结构如下,我只放了一个模型,__pycache__和work目录不用管它,这两个目录是运行triton以后自动生成的。

目录1是模型版本,目录1下面放着huggingface下载下来的模型和model.py(运行脚本)文件。

和目录1平级的需要一个配置文件config.pbtxt,说明输入输出的协议和实例对应GPU的配置

下面开始写配置文件config.pbtxt和model.py。

config.pbtxt

  1. name: "chatglm3-6b-32k" // 模型名,与模型的文件夹名字相同
  2. backend: "python" // 模型所使用的后端引擎
  3. max_batch_size: 0
  4. input [ // 输入定义
  5. {
  6. name: "prompt" //名称
  7. data_type: TYPE_STRING //类型
  8. dims: [ -1 ] //数据维度,-1 表示可变维度
  9. },
  10. {
  11. name: "history"
  12. data_type: TYPE_STRING
  13. dims: [ -1 ]
  14. },
  15. {
  16. name: "temperature"
  17. data_type: TYPE_STRING
  18. dims: [ -1 ]
  19. },
  20. {
  21. name: "max_token"
  22. data_type: TYPE_STRING
  23. dims: [ -1 ]
  24. },
  25. {
  26. name: "history_len"
  27. data_type: TYPE_STRING
  28. dims: [ -1 ]
  29. }
  30. ]
  31. output [ //输出定义
  32. {
  33. name: "response"
  34. data_type: TYPE_STRING
  35. dims: [ -1 ]
  36. },
  37. {
  38. name: "history"
  39. data_type: TYPE_STRING
  40. dims: [ -1 ]
  41. }
  42. ]
  43. //实例配置,我使用了3个显卡,每个显卡配置了一个实例
  44. instance_group [
  45. {
  46. count: 1
  47. kind: KIND_GPU
  48. gpus: [ 0 ]
  49. },
  50. {
  51. count: 1
  52. kind: KIND_GPU
  53. gpus: [ 1 ]
  54. },
  55. {
  56. count: 1
  57. kind: KIND_GPU
  58. gpus: [ 2 ]
  59. }
  60. ]

model.py

  1. import os
  2. # 设置显存空闲block最大分割阈值
  3. os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:32'
  4. # 设置work目录
  5. os.environ['TRANSFORMERS_CACHE'] = os.path.dirname(os.path.abspath(__file__))+"/work/"
  6. os.environ['HF_MODULES_CACHE'] = os.path.dirname(os.path.abspath(__file__))+"/work/"
  7. import json
  8. # triton_python_backend_utils is available in every Triton Python model. You
  9. # need to use this module to create inference requests and responses. It also
  10. # contains some utility functions for extracting information from model_config
  11. # and converting Triton input/output types to numpy types.
  12. import triton_python_backend_utils as pb_utils
  13. import sys
  14. import gc
  15. import time
  16. import logging
  17. import torch
  18. from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
  19. import numpy as np
  20. gc.collect()
  21. torch.cuda.empty_cache()
  22. logging.basicConfig(format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s',
  23. level=logging.INFO)
  24. class TritonPythonModel:
  25. """Your Python model must use the same class name. Every Python model
  26. that is created must have "TritonPythonModel" as the class name.
  27. """
  28. def initialize(self, args):
  29. """`initialize` is called only once when the model is being loaded.
  30. Implementing `initialize` function is optional. This function allows
  31. the model to intialize any state associated with this model.
  32. Parameters
  33. ----------
  34. args : dict
  35. Both keys and values are strings. The dictionary keys and values are:
  36. * model_config: A JSON string containing the model configuration
  37. * model_instance_kind: A string containing model instance kind
  38. * model_instance_device_id: A string containing model instance device ID
  39. * model_repository: Model repository path
  40. * model_version: Model version
  41. * model_name: Model name
  42. """
  43. # You must parse model_config. JSON string is not parsed here
  44. self.model_config = json.loads(args['model_config'])
  45. output_response_config = pb_utils.get_output_config_by_name(self.model_config, "response")
  46. output_history_config = pb_utils.get_output_config_by_name(self.model_config, "history")
  47. # Convert Triton types to numpy types
  48. self.output_response_dtype = pb_utils.triton_string_to_numpy(output_response_config['data_type'])
  49. self.output_history_dtype = pb_utils.triton_string_to_numpy(output_history_config['data_type'])
  50. ChatGLM_path = os.path.dirname(os.path.abspath(__file__))+"/chatglm3-6b-32k"
  51. self.tokenizer = AutoTokenizer.from_pretrained(ChatGLM_path, trust_remote_code=True)
  52. #下面to('cuda:'+args['model_instance_device_id'])这里一定要注意,这里是把实例部署到对应的显卡上,如果不写会分散到所有显卡上或者集中到一个显卡上,都会造成问题
  53. model = AutoModelForCausalLM.from_pretrained(ChatGLM_path,
  54. torch_dtype=torch.float16, trust_remote_code=True).half().to('cuda:'+args['model_instance_device_id'])
  55. self.model = model.eval()
  56. logging.info("model init success")
  57. def execute(self, requests):
  58. """`execute` MUST be implemented in every Python model. `execute`
  59. function receives a list of pb_utils.InferenceRequest as the only
  60. argument. This function is called when an inference request is made
  61. for this model. Depending on the batching configuration (e.g. Dynamic
  62. Batching) used, `requests` may contain multiple requests. Every
  63. Python model, must create one pb_utils.InferenceResponse for every
  64. pb_utils.InferenceRequest in `requests`. If there is an error, you can
  65. set the error argument when creating a pb_utils.InferenceResponse
  66. Parameters
  67. ----------
  68. requests : list
  69. A list of pb_utils.InferenceRequest
  70. Returns
  71. -------
  72. list
  73. A list of pb_utils.InferenceResponse. The length of this list must
  74. be the same as `requests`
  75. """
  76. output_response_dtype = self.output_response_dtype
  77. output_history_dtype = self.output_history_dtype
  78. # output_dtype = self.output_dtype
  79. responses = []
  80. # Every Python backend must iterate over everyone of the requests
  81. # and create a pb_utils.InferenceResponse for each of them.
  82. for request in requests:
  83. prompt = pb_utils.get_input_tensor_by_name(request, "prompt").as_numpy()[0]
  84. prompt = prompt.decode('utf-8')
  85. history_origin = pb_utils.get_input_tensor_by_name(request, "history").as_numpy()
  86. if len(history_origin) > 0:
  87. history = np.array([item.decode('utf-8') for item in history_origin]).reshape((-1,2)).tolist()
  88. else:
  89. history = []
  90. temperature = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy()[0]
  91. temperature = float(temperature.decode('utf-8'))
  92. max_token = pb_utils.get_input_tensor_by_name(request, "max_token").as_numpy()[0]
  93. max_token = int(max_token.decode('utf-8'))
  94. history_len = pb_utils.get_input_tensor_by_name(request, "history_len").as_numpy()[0]
  95. history_len = int(history_len.decode('utf-8'))
  96. # 日志输出传入信息
  97. in_log_info = {
  98. "in_prompt":prompt,
  99. "in_history":history,
  100. "in_temperature":temperature,
  101. "in_max_token":max_token,
  102. "in_history_len":history_len
  103. }
  104. logging.info(in_log_info)
  105. response,history = self.model.chat(self.tokenizer,
  106. prompt,
  107. history=history[-history_len:] if history_len > 0 else [],
  108. max_length=max_token,
  109. temperature=temperature)
  110. # 日志输出处理后的信息
  111. out_log_info = {
  112. "out_response":response,
  113. "out_history":history
  114. }
  115. logging.info(out_log_info)
  116. response = np.array(response)
  117. history = np.array(history)
  118. response_output_tensor = pb_utils.Tensor("response",response.astype(self.output_response_dtype))
  119. history_output_tensor = pb_utils.Tensor("history",history.astype(self.output_history_dtype))
  120. final_inference_response = pb_utils.InferenceResponse(output_tensors=[response_output_tensor,history_output_tensor])
  121. responses.append(final_inference_response)
  122. # Create InferenceResponse. You can set an error here in case
  123. # there was a problem with handling this inference request.
  124. # Below is an example of how you can set errors in inference
  125. # response:
  126. #
  127. # pb_utils.InferenceResponse(
  128. # output_tensors=..., TritonError("An error occured"))
  129. # You should return a list of pb_utils.InferenceResponse. Length
  130. # of this list must match the length of `requests` list.
  131. return responses
  132. def finalize(self):
  133. """`finalize` is called only once when the model is being unloaded.
  134. Implementing `finalize` function is OPTIONAL. This function allows
  135. the model to perform any necessary clean ups before exit.
  136. """
  137. print('Cleaning up...')

5:启动triton server

  1. #守护模式(-itd创建的容器),进入容器运行
  2. tritonserver --model-repository=/models
  3. #非守护模式(-it创建的容器),在宿主机运行
  4. docker start chatglmtest

6:验证

  1. curl -X POST localhost:8000/v2/models/chatglm3-6b-32k/generate \
  2. -d '{"prompt": "你好,请问你叫什么?", "history":[], "temperature":"0.3","max_token":"100","history_len":"0"}'

    响应:

{"history":["{'role': 'user', 'content': '你好,请问你叫什么?'}","{'role': 'assistant', 'metadata': '', 'content': '你好!我是一个名为 ChatGLM3-6B 的人工智能助手,是基于清华大学 KEG 实验室和智谱 AI 公司于 2023 年共同训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。'}"],"model_name":"chatglm3-6b-32k","model_version":"1","response":"你好!我是一个名为 ChatGLM3-6B 的人工智能助手,是基于清华大学 KEG 实验室和智谱 AI 公司于 2023 年共同训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。"}

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号