当前位置:   article > 正文

基于nvidia triton的模型工程化实践_config.pbtxt

config.pbtxt

什么是triton inference server?

它的前身是nvidia的tensorRT,triton在具备tensorRT的基础上,增加了主流的TF,pytorch,onnx等模型的推理部署支持。

是一款非常好的推理模型部署服务。

具体了解:NVIDIA Triton Inference Server | NVIDIA Developerhttps://developer.nvidia.com/nvidia-triton-inference-server

模型部署及优化实践

pytorch模型部署

pytorch模型需要提供jit之后的模型。

文件夹层次为:

model_name/

        1/model.pt

        config.pbtxt

只需要将上述文件夹拷贝到triton server里的models文件夹即可生效(可以配置triton监听文件夹变化,如果变化自动重启)。

config.pbtxt是这次讲解的重点,也是部署时最需要学习的地方。

以下是具体实例:

  1. #this MUST be the same name with the outside folder
  2. name: "ibuddha_chitchat"
  3. # pytorch
  4. platform: "pytorch_libtorch"
  5. # you should limit this ,or else the graphic card will doom...
  6. max_batch_size: 64
  7. input [
  8. {
  9. #pytorch output this 0,1,2 silly name by default
  10. name: "INPUT__0"
  11. #int64 or int32, must be the same as the model define
  12. data_type: TYPE_INT64
  13. #dynamic sequence len, means you can input text len from 1 to 510 typically, or else you should put a fix value here
  14. dims: [-1]
  15. },
  16. {
  17. name: "INPUT__1"
  18. data_type: TYPE_INT64
  19. dims: [-1]
  20. },
  21. {
  22. name: "INPUT__2"
  23. data_type: TYPE_INT64
  24. dims: [-1]
  25. }
  26. ]
  27. output [
  28. {
  29. #pytorch silly default name
  30. name: "OUTPUT__0"
  31. data_type: TYPE_FP32
  32. dims: [13088]
  33. }
  34. ]
  35. # output only one which has bigger version
  36. version_policy: { latest {num_versions: 1}}
  37. #version_policy: { all {}}
  38. # enable dynamic will improve your performance greatly
  39. dynamic_batching {
  40. }
  41. # enable this will make your inference faster
  42. parameters: {
  43. key: "INFERENCE_MODE"
  44. value: {
  45. string_value:"true"
  46. }
  47. }
  48. # disable this. It is slower than default in my test
  49. #parameters: {
  50. #key: "ENABLE_NVFUSER"
  51. # value: {
  52. # string_value:"true"
  53. # }
  54. #}
  55. #pytorch model only run in graphic card 0 by default
  56. instance_group [
  57. {
  58. count: 1
  59. kind: KIND_GPU
  60. gpus: [ 0 ]
  61. }
  62. ]

1代表版本号(建议从1...N,0无效)

model.pt为约定名字

name为模型名字,要求与外层的文件夹名字一致,因此外面的文件夹必须改为ibuddha_chitchat。

ibuddha_chitchat/

        1/model.pt

        config.pbtxt

pytorch模型的platform为:pytorch_libtorch

这个实例采用的是动态batching,也是官方推荐的优化方式。

dynamic_batching {}

使能动态batch会非常有效的提高推理的系统效率。

max_batch_size 需要设置合适,太大会导致显卡显存爆(triton显存爆可能导致triton挂且无法自动重启)(注意:dynamic_batching生效时,这个选项才有效)

input代表模型的输入

pytorch的bert,典型的名字为INPUT__0..INPUT__2

数据类型到底是TYPE_INT64还是TYPE_INT32,需要根据模型训练使用的数据类型定,同样是bert,有的是INT64有的是INT32,但3个INPUT都会是相同类型(目前没有找到具体规律)

dims: [-1]

代表动态sequence,表示输入的文本长度不需要是一个固定值。

注意,由于这里是动态batching,所以第一个维度的-1可以省略不写。

(如果不是动态batching,则dims: [N, -1])

output和input的格式一样

这里实例由于是GPT模型,会返回整句话中每个位置的13088个vocab的概率(浮点型)(后处理会选择概率最高的那个token作为输出(实际会复杂些))。

version_policy用来控制版本

实例的写法是只会有一个版本,triton自动选择数字最大的那个。

如果需要所有版本都输出,可以写如下:

version_policy: { all {}}

instance_group

count为1代表只有1个实例

KIND_GPU顾名思义是运行在GPU(也可以配置运行在CPU)

gpus: [0] 代表只运行在显卡0上

注意:pytorch模型目前有一个缺陷,只能固定在某个显卡上,默认都是显卡0(有可以不限制显卡0,可运行在多个显卡的,还请告知一下作者)

onnx模型部署

整个过程和pytorch非常类似,这里只说差异点:

模型统一约定名字为model.onnx

config.pbtxt的编写中:

platform: onnxruntime_onnx

由于pytorch转onnx,可以配置input_names,所以建议给团队约定的名字,便于维护:

input_ids, attention_mask, token_type_ids

实例的output,由于是返回句子的平均向量,因此直接是一个768长度的浮点数数组。

onnx模型也可以动态转为tensorRT,是不是能更快,需要各位自己实测。

  1. name: "sps_sbert_onnx"
  2. #onnx model
  3. platform: "onnxruntime_onnx"
  4. max_batch_size: 32
  5. #recommend use the same name in your team, input_ids, attention_mask, token_type_ids
  6. input [
  7. {
  8. name: "input_ids"
  9. data_type: TYPE_INT64
  10. dims: [-1]
  11. },
  12. {
  13. name: "attention_mask"
  14. data_type: TYPE_INT64
  15. dims: [-1]
  16. },
  17. {
  18. name: "token_type_ids"
  19. data_type: TYPE_INT64
  20. dims: [-1]
  21. }
  22. ]
  23. output [
  24. {
  25. #recommend to use meaningful name
  26. name: "vector"
  27. data_type: TYPE_FP32
  28. dims: [768]
  29. }
  30. ]
  31. #version_policy: { all {}}
  32. version_policy: { latest {num_versions: 1}}
  33. dynamic_batching { }
  34. #you should test whether this can be faster
  35. #change onnx
  36. optimization { execution_accelerators {
  37. gpu_execution_accelerator : [ { name : "tensorrt" } ]
  38. }}

tensorflow模型部署

tensorflow模型推荐采用saved_model格式

将saved_model文件夹拷贝到版本文件夹中,命名为:model.savedmodel

1/model.savedmodel

        assets

        saved_model.pb

        variables

config.pbtxt

  1. name: "shansou_rank"
  2. platform: "tensorflow_savedmodel"
  3. max_batch_size: 128
  4. input [
  5. {
  6. name: "input_ids"
  7. data_type: TYPE_INT32
  8. #fix length of input. input should padding to max length or truncate the text over max length
  9. dims: [128]
  10. },
  11. {
  12. name: "input_mask"
  13. data_type: TYPE_INT32
  14. dims: [128]
  15. },
  16. {
  17. name: "segment_ids"
  18. data_type: TYPE_INT32
  19. dims: [128]
  20. }
  21. ]
  22. output [
  23. {
  24. name: "output"
  25. data_type: TYPE_FP32
  26. dims: [1]
  27. }
  28. ]
  29. dynamic_batching { }
  30. #this will use V100/T4 or better graphic mix precision unit
  31. #always fasters than tensorRT
  32. optimization { execution_accelerators {
  33. gpu_execution_accelerator : [
  34. { name : "auto_mixed_precision" }
  35. ]
  36. }}
  37. version_policy: { latest {num_versions: 1}}

nvidia和tensorflow打磨的时间最久,支持的功能也最多。

例如可以直接配置tensorRT,动态将tensorflow模型直接转为tensorRT。

将过去繁琐的转tensorRT过程,变成了极其简单的配置即可生效的过程(推荐)。

如果不加parameters一句,默认是无损的FP32精度

  1. optimization { execution_accelerators {
  2. gpu_execution_accelerator : [ {
  3. name : "tensorrt"
  4. #parameters { key: "precision_mode" value: "FP16" }}]
  5. }}

实际上,作者最终选择的是混合精度模式。

  1. optimization { execution_accelerators {
  2. gpu_execution_accelerator : [
  3. { name : "auto_mixed_precision" }
  4. ]
  5. }}

tensorflow模型选择混合精度模式后,可以发挥显卡能力7及以上的混合处理单元(V100, T4及以上均可使用)。

显卡其实有2个发动机,普通的FP32处理单元(民用发动机),混合精度处理单元(赛车发动机)。

tensorflow模型转为tensorRT,等价于民用发动机上的极致优化,属于软件优化。

tensorflow模型采用混合精度模式,等价于运行在赛车发动机上,属于硬件加强。

实测混合精度模式要明显强于tensorRT(这边的测试大约是2倍)。

目前,无法让tensorRT和混合精度模型一起生效(这是最理想的优化),期望未来可以支持。

python代码部署

可以将python代码类似模型一样部署,本质也是input->handle->output

  1. models
  2. └── ibuddha_chitchat_bls
  3. ├── 1
  4. │ └── model.py
  5. └── config.pbtxt

这里讲解21.08开始才有的BLS功能(Business Logic Scripting)

常用的闲聊模型采用GPT模型,每次推理只能获取一个字,需要反复循环,且每次返回的向量非常多(网络传输时间消耗大),因此,将这部分逻辑放到triton的BLS中,在进程内完成,是非常合适的。

详看:

GitHub - triton-inference-server/python_backend: Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.

  1. name: "ibuddha_chitchat_bls"
  2. backend: "python"
  3. max_batch_size: 64
  4. input [
  5. {
  6. name: "INPUT__0"
  7. data_type: TYPE_INT64
  8. dims: [ -1 ]
  9. }
  10. ]
  11. input [
  12. {
  13. name: "INPUT__1"
  14. data_type: TYPE_INT64
  15. dims: [ -1 ]
  16. }
  17. ]
  18. input [
  19. {
  20. name: "INPUT__2"
  21. data_type: TYPE_INT64
  22. dims: [ -1 ]
  23. }
  24. ]
  25. output [
  26. {
  27. name: "OUTPUT__0"
  28. data_type: TYPE_INT32
  29. dims: [ -1 ]
  30. }
  31. ]
  32. output [
  33. {
  34. name: "OUTPUT__1"
  35. data_type: TYPE_FP32
  36. dims: [ -1 ]
  37. }
  38. ]
  39. instance_group [{ kind: KIND_CPU }]
  40. dynamic_batching {
  41. }

由于是python代码,因此涉及第三方库的问题,需要在原triton镜像的基础上新增三方库,因此,需要额外build镜像。

这里重点讲解一点:

python backend是配置的是:instance_group [{ kind: KIND_CPU }]

具体执行的模型,运行在GPU上。

因此

infer_response = infer_request.exec()

这句完成模型推理后的结果是在GPU上的,无法直接使用

必须采用pytorch的to_dlpack将GPU的内容放到共享内存中,再用from_dlpack把共享内存的内容转为pytorch的tensor

logits = from_dlpack(output0.to_dlpack())

triton的变量转为pytorch的tensor有2种方法:

input_ids = from_dlpack(in_0.to_dlpack())

input_ids = torch.from_numpy(in_0.as_numpy())

采用to_dlpack和from_dlpack 具有更低的消耗。

这个是没有代码优化的model.py

  1. import triton_python_backend_utils as pb_utils
  2. from torch.utils.dlpack import from_dlpack,to_dlpack
  3. import torch.nn.functional as F
  4. import torch
  5. import json
  6. import numpy as np
  7. class TritonPythonModel:
  8. """Your Python model must use the same class name. Every Python model
  9. that is created must have "TritonPythonModel" as the class name.
  10. """
  11. def initialize(self, args):
  12. """`initialize` is called only once when the model is being loaded.
  13. Implementing `initialize` function is optional. This function allows
  14. the model to intialize any state associated with this model.
  15. Parameters
  16. ----------
  17. args : dict
  18. Both keys and values are strings. The dictionary keys and values are:
  19. * model_config: A JSON string containing the model configuration
  20. * model_instance_kind: A string containing model instance kind
  21. * model_instance_device_id: A string containing model instance device ID
  22. * model_repository: Model repository path
  23. * model_version: Model version
  24. * model_name: Model name
  25. """
  26. # You must parse model_config. JSON string is not parsed here
  27. self.model_config = json.loads(args['model_config'])
  28. input0_config = pb_utils.get_input_config_by_name(
  29. self.model_config, "INPUT__0")
  30. input1_config = pb_utils.get_input_config_by_name(
  31. self.model_config, "INPUT__1")
  32. input2_config = pb_utils.get_input_config_by_name(
  33. self.model_config, "INPUT__2")
  34. output0_config = pb_utils.get_output_config_by_name(
  35. self.model_config, "OUTPUT__0")
  36. output1_config = pb_utils.get_output_config_by_name(
  37. self.model_config, "OUTPUT__1")
  38. # Convert Triton types to numpy types
  39. self.input0_dtype = pb_utils.triton_string_to_numpy(
  40. input0_config['data_type'])
  41. self.input1_dtype = pb_utils.triton_string_to_numpy(
  42. input1_config['data_type'])
  43. self.input2_dtype = pb_utils.triton_string_to_numpy(
  44. input2_config['data_type'])
  45. self.output0_dtype = pb_utils.triton_string_to_numpy(
  46. output0_config['data_type'])
  47. self.output1_dtype = pb_utils.triton_string_to_numpy(
  48. output1_config['data_type'])
  49. #self.cls, self.sep, self.pad, self.speaker1, self.speaker2 = self.tokenizer.convert_tokens_to_ids(["[CLS]", "[SEP]", "[PAD]", "[speaker1]", "[speaker2]"])
  50. #self.special_tokens_ids = [self.cls, self.sep, self.pad, self.speaker1, self.speaker2]
  51. self.special_tokens_ids = [0, 2, 1, 13086, 13087]
  52. self.output_min_length = 1
  53. self.output_max_length = 64 #TODO: change
  54. self.temperature = 0.7
  55. self.top_p = 0.7
  56. self.round = 1
  57. def execute(self, requests):
  58. """`execute` must be implemented in every Python model. `execute`
  59. function receives a list of pb_utils.InferenceRequest as the only
  60. argument. This function is called when an inference request is made
  61. for this model. Depending on the batching configuration (e.g. Dynamic
  62. Batching) used, `requests` may contain multiple requests. Every
  63. Python model, must create one pb_utils.InferenceResponse for every
  64. pb_utils.InferenceRequest in `requests`. If there is an error, you can
  65. set the error argument when creating a pb_utils.InferenceResponse
  66. Parameters
  67. ----------
  68. requests : list
  69. A list of pb_utils.InferenceRequest
  70. Returns
  71. -------
  72. list
  73. A list of pb_utils.InferenceResponse. The length of this list must
  74. be the same as `requests`
  75. """
  76. responses = []
  77. # Every Python backend must iterate over everyone of the requests
  78. # and create a pb_utils.InferenceResponse for each of them.
  79. for request in requests:
  80. # Get INPUT0
  81. in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT__0")
  82. in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT__1")
  83. in_2 = pb_utils.get_input_tensor_by_name(request, "INPUT__2")
  84. #pytorch_tensor = from_dlpack(in_0.to_dlpack())
  85. #print(pytorch_tensor)
  86. # Get Model Name
  87. #model_name = pb_utils.get_input_tensor_by_name(
  88. # request, "MODEL_NAME")
  89. # Model Name string
  90. #model_name_string = model_name.as_numpy()[0]
  91. model_name_string = "ibuddha_chitchat"
  92. # Create inference request object
  93. # Perform synchronous blocking inference request
  94. # Create InferenceResponse. You can set an error here in case
  95. # there was a problem with handling this inference request.
  96. # Below is an example of how you can set errors in inference
  97. # response:
  98. #
  99. # pb_utils.InferenceResponse(
  100. # output_tensors=..., TritonError("An error occured"))
  101. #
  102. # Because the infer_response of the models contains the final
  103. # outputs with correct output names, we can just pass the list
  104. # of outputs to the InferenceResponse object.
  105. #print(type(infer_response))
  106. output_ids = []
  107. output_confidences = []
  108. for i in range(self.output_max_length):
  109. infer_request = pb_utils.InferenceRequest(
  110. model_name=model_name_string,
  111. requested_output_names=["OUTPUT__0"],
  112. inputs=[in_0, in_1, in_2])
  113. infer_response = infer_request.exec()
  114. if infer_response.has_error():
  115. raise pb_utils.TritonModelException(
  116. infer_response.error().message())
  117. output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT__0')
  118. #_logits = output0.as_numpy()
  119. #logits = torch.from_numpy(np.array(_logits))
  120. logits = from_dlpack(output0.to_dlpack())
  121. #print(pytorch_tensor)
  122. #_logits = self.triton_infer(encoded_input)[0]
  123. #logits = torch.from_numpy(np.array(_logits))
  124. logits = logits[0, :] / self.temperature
  125. top_logits = self.top_filtering(logits, self.top_p)
  126. probs = F.softmax(top_logits, dim=-1)
  127. prev = torch.multinomial(probs, num_samples=1)
  128. if i < self.output_min_length and prev.item() in self.special_tokens_ids:
  129. while prev.item() in self.special_tokens_ids:
  130. prev = torch.multinomial(probs, num_samples=1)
  131. output_id = prev.item()
  132. if output_id in self.special_tokens_ids:
  133. break
  134. output_ids.append(output_id)
  135. output_confidences.append(probs[output_id].item())
  136. input_ids = torch.from_numpy(in_0.as_numpy())
  137. attention_mask = torch.from_numpy(in_1.as_numpy())
  138. token_type_ids = torch.from_numpy(in_2.as_numpy())
  139. #input_ids = from_dlpack(in_0.to_dlpack())
  140. #attention_mask = from_dlpack(in_1.to_dlpack())
  141. #token_type_ids = from_dlpack(in_2.to_dlpack())
  142. input_ids = torch.cat((input_ids, torch.LongTensor([[output_id]])), 1)
  143. attention_mask = torch.cat((attention_mask, torch.LongTensor([[1]])), 1)
  144. token_type_ids = torch.cat((token_type_ids, torch.LongTensor([[output_id]])), 1)
  145. in_0 = pb_utils.Tensor("INPUT__0", input_ids.numpy().astype(self.input0_dtype))
  146. in_1 = pb_utils.Tensor("INPUT__1", attention_mask.numpy().astype(self.input1_dtype))
  147. in_2 = pb_utils.Tensor("INPUT__2", token_type_ids.numpy().astype(self.input2_dtype))
  148. #in_0 = pb_utils.Tensor.from_dlpack("INPUT__0", to_dlpack(input_ids))
  149. #in_1 = pb_utils.Tensor.from_dlpack("INPUT__1", to_dlpack(attention_mask))
  150. #in_2 = pb_utils.Tensor.from_dlpack("INPUT__2", to_dlpack(token_type_ids))
  151. #print(infer_response.output_tensors())
  152. output_ids = torch.tensor(output_ids)
  153. output_confidences = torch.tensor(output_confidences)
  154. output_0 = pb_utils.Tensor("OUTPUT__0", output_ids.numpy().astype(self.output0_dtype))
  155. output_1 = pb_utils.Tensor("OUTPUT__1", output_confidences.numpy().astype(self.output1_dtype))
  156. #output_0 = pb_utils.Tensor.from_dlpack("OUTPUT__0", to_dlpack(output_ids))
  157. #output_1 = pb_utils.Tensor.from_dlpack("OUTPUT__1", to_dlpack(output_confidences))
  158. inference_response = pb_utils.InferenceResponse(
  159. output_tensors=[output_0, output_1])
  160. #print(type(inference_response))
  161. responses.append(inference_response)
  162. # You should return a list of pb_utils.InferenceResponse. Length
  163. # of this list must match the length of `requests` list.
  164. return responses
  165. def top_filtering(self, logits, top_p=0.0, threshold=-float('Inf'), filter_value=-float('Inf')):
  166. #assert logits.dim() == 1 # Only work for batch size 1 for now - could update but it would obfuscate a bit the code
  167. if top_p > 0.0:
  168. sorted_logits, sorted_indices = torch.sort(logits, descending=True)
  169. cumulative_probabilities = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
  170. sorted_indices_to_remove = cumulative_probabilities > top_p
  171. sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
  172. sorted_indices_to_remove[..., 0] = 0
  173. indices_to_remove = sorted_indices[sorted_indices_to_remove]
  174. logits[indices_to_remove] = filter_value
  175. indices_to_remove = logits < threshold
  176. logits[indices_to_remove] = filter_value
  177. return logits
  178. def finalize(self):
  179. """`finalize` is called only once when the model is being unloaded.
  180. Implementing `finalize` function is OPTIONAL. This function allows
  181. the model to perform any necessary clean ups before exit.
  182. """
  183. print('Cleaning up...')

可以参考python_backend里的examples。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/351772
推荐阅读
相关标签
  

闽ICP备14008679号