当前位置:   article > 正文

Tensorrt安装及使用(python版本)_pip install tensorrt

pip install tensorrt

官方的教程

tensorrt的安装:Installation Guide :: NVIDIA Deep Learning TensorRT Documentation

视频教程:TensorRT 教程 | 基于 8.6.1 版本 | 第一部分_哔哩哔哩_bilibili

代码教程:trt-samples-for-hackathon-cn/cookbook at master · NVIDIA/trt-samples-for-hackathon-cn (github.com)

Tensorrt的安装

官方的教程:

安装指南 :: NVIDIA Deep Learning TensorRT Documentation --- Installation Guide :: NVIDIA Deep Learning TensorRT Documentation

Tensorrt的安装方法主要有:

1、使用 pip install 进行安装;

2、下载 tar、zip、deb 文件进行安装;

3、使用docker容器进行安装:TensorRT Container Release Notes

Windows系统

首先选择和本机nVidia驱动、cuda版本、cudnn版本匹配的Tensorrt版本。

我使用的:cuda版本:11.4;cudnn版本:11.4

建议下载 zip 进行Tensorrt的安装,参考的教程:

windows安装tensorrt - 知乎 (zhihu.com)

Ubuntu系统

首先选择和本机nVidia驱动、cuda版本、cudnn版本匹配的Tensorrt版本。

我使用的:cuda版本:11.7;cudnn版本:8.9.0

1、使用 pip 进行安装:

pip install tensorrt==8.6.1

我这边安装失败

2、下载 deb 文件进行安装

  1. os="ubuntuxx04"
  2. tag="8.x.x-cuda-x.x"
  3. sudo dpkg -i nv-tensorrt-local-repo-${os}-${tag}_1.0-1_amd64.deb
  4. sudo cp /var/nv-tensorrt-local-repo-${os}-${tag}/*-keyring.gpg /usr/share/keyrings/
  5. sudo apt-get update sudo apt-get install tensorrt

我这边同样没安装成功

3、使用 tar 文件进行安装(推荐)

推荐使用这种方法进行安装,成功率较高

下载对应的版本:developer.nvidia.com/tensorrt-download

下载后

  1. tar -xzvf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz # 解压文件
  2. # 将lib添加到环境变量里面
  3. vim ~/.bashrc
  4. export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./TensorRT-8.6.1.6/lib
  5. source ~/.bashrc
  6. # 或 直接将 TensorRT-8.6.1.6/lib 添加到 cuda/lib64 里面
  7. cp -r ./lib/* /usr/local/cuda/lib64/
  8. # 安装python的包
  9. cd TensorRT-8.6.1.6/python
  10. pip install tensorrt-xxx-none-linux_x86_64.whl

下载成功后验证:

  1. # 验证是否安装成功:
  2. python
  3. >>>import tensorrt
  4. >>>print(tensorrt.__version__)
  5. >>>assert tensorrt.Builder(tensorrt.Logger())

如果没有报错说明安装成功

使用方法

我这边的使用的流程是:pytorch -> onnx -> tensorrt

选择resnet18进行转换

pytorch转onnx

安装onnx,onnxruntime安装一个就行

  1. pip install onnx
  2. pip install onnxruntime
  3. pip install onnxruntime-gpu # gpu版本

将pytorch模型转成onnx模型

  1. import torch
  2. import torchvision
  3. model = torchvision.models.resnet18(pretrained=False)
  4. device = 'cuda' if torch.cuda.is_available else 'cpu'
  5. dummy_input = torch.randn(1, 3, 224, 224, device=device)
  6. model.to(device)
  7. model.eval()
  8. output = model(dummy_input)
  9. print("pytorch result:", torch.argmax(output))
  10. import torch.onnx
  11. torch.onnx.export(model, dummy_input, './model.onnx', input_names=["input"], output_names=["output"], do_constant_folding=True, verbose=True, keep_initializers_as_inputs=True, opset_version=14, dynamic_axes={"input": {0: "nBatchSize"}, "output": {0: "nBatchSize"}})
  12. # 一般情况
  13. # torch.onnx.export(model, torch.randn(1, c, nHeight, nWidth, device="cuda"), './model.onnx', input_names=["x"], output_names=["y", "z"], do_constant_folding=True, verbose=True, keep_initializers_as_inputs=True, opset_version=14, dynamic_axes={"x": {0: "nBatchSize"}, "z": {0: "nBatchSize"}})
  14. import onnx
  15. import numpy as np
  16. import onnxruntime as ort
  17. model_onnx_path = './model.onnx'
  18. # 验证模型的合法性
  19. onnx_model = onnx.load(model_onnx_path)
  20. onnx.checker.check_model(onnx_model)
  21. # 创建ONNX运行时会话
  22. ort_session = ort.InferenceSession(model_onnx_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
  23. # 准备输入数据
  24. input_data = {
  25. 'input': dummy_input.cpu().numpy()
  26. }
  27. # 运行推理
  28. y_pred_onnx = ort_session.run(None, input_data)
  29. print("onnx result:", np.argmax(y_pred_onnx[0]))

onnx转tensorrt

Window使用zip安装后使用 TensorrtRT-8.6.1.6/bin/trtexec.exe 文件生成 tensorrt 模型文件

Ubuntu使用tar安装后使用 TensorrtRT-8.6.1.6/bin/trtexec 文件生成 tensorrt 模型文件

./trtexec --onnx=model.onnx --saveEngine=model.trt --fp16 --workspace=16 --shapes=input:2x3x224x224

其中的参数:

--fp16:是否使用fp16

--shapes:输入的大小。tensorrt支持 动态batch 设置,感兴趣可以尝试

tensorrt的使用

nVidia的官方使用方法:

trt-samples-for-hackathon-cn/cookbook at master · NVIDIA/trt-samples-for-hackathon-cn (github.com)

打印转换后的tensorrt的模型的信息

  1. import tensorrt as trt
  2. # 加载TensorRT引擎
  3. logger = trt.Logger(trt.Logger.INFO)
  4. with open('./model.trt', "rb") as f, trt.Runtime(logger) as runtime:
  5. engine = runtime.deserialize_cuda_engine(f.read())
  6. for idx in range(engine.num_bindings):
  7. name = engine.get_tensor_name(idx)
  8. is_input = engine.get_tensor_mode(name)
  9. op_type = engine.get_tensor_dtype(name)
  10. shape = engine.get_tensor_shape(name)
  11. print('input id: ',idx, '\tis input: ', is_input, '\tbinding name: ', name, '\tshape: ', shape, '\ttype: ', op_type)

测试转换后的tensorrt模型,来自nVidia的 cookbook/08-Advance/MultiStream/main.py

  1. from time import time
  2. import numpy as np
  3. import tensorrt as trt
  4. from cuda import cudart # 安装 pip install cuda-python
  5. np.random.seed(31193)
  6. nWarmUp = 10
  7. nTest = 30
  8. nB, nC, nH, nW = 1, 3, 224, 224
  9. data = dummy_input.cpu().numpy()
  10. def run1(engine):
  11. input_name = engine.get_tensor_name(0)
  12. output_name = engine.get_tensor_name(1)
  13. output_type = engine.get_tensor_dtype(output_name)
  14. output_shape = engine.get_tensor_shape(output_name)
  15. context = engine.create_execution_context()
  16. context.set_input_shape(input_name, [nB, nC, nH, nW])
  17. _, stream = cudart.cudaStreamCreate()
  18. inputH0 = np.ascontiguousarray(data.reshape(-1))
  19. outputH0 = np.empty(output_shape, dtype=trt.nptype(output_type))
  20. _, inputD0 = cudart.cudaMallocAsync(inputH0.nbytes, stream)
  21. _, outputD0 = cudart.cudaMallocAsync(outputH0.nbytes, stream)
  22. # do a complete inference
  23. cudart.cudaMemcpyAsync(inputD0, inputH0.ctypes.data, inputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream)
  24. context.execute_async_v2([int(inputD0), int(outputD0)], stream)
  25. cudart.cudaMemcpyAsync(outputH0.ctypes.data, outputD0, outputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream)
  26. cudart.cudaStreamSynchronize(stream)
  27. # Count time of memory copy from host to device
  28. for i in range(nWarmUp):
  29. cudart.cudaMemcpyAsync(inputD0, inputH0.ctypes.data, inputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream)
  30. trtTimeStart = time()
  31. for i in range(nTest):
  32. cudart.cudaMemcpyAsync(inputD0, inputH0.ctypes.data, inputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream)
  33. cudart.cudaStreamSynchronize(stream)
  34. trtTimeEnd = time()
  35. print("%6.3fms - 1 stream, DataCopyHtoD" % ((trtTimeEnd - trtTimeStart) / nTest * 1000))
  36. # Count time of inference
  37. for i in range(nWarmUp):
  38. context.execute_async_v2([int(inputD0), int(outputD0)], stream)
  39. trtTimeStart = time()
  40. for i in range(nTest):
  41. context.execute_async_v2([int(inputD0), int(outputD0)], stream)
  42. cudart.cudaStreamSynchronize(stream)
  43. trtTimeEnd = time()
  44. print("%6.3fms - 1 stream, Inference" % ((trtTimeEnd - trtTimeStart) / nTest * 1000))
  45. # Count time of memory copy from device to host
  46. for i in range(nWarmUp):
  47. cudart.cudaMemcpyAsync(outputH0.ctypes.data, outputD0, outputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream)
  48. trtTimeStart = time()
  49. for i in range(nTest):
  50. cudart.cudaMemcpyAsync(outputH0.ctypes.data, outputD0, outputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream)
  51. cudart.cudaStreamSynchronize(stream)
  52. trtTimeEnd = time()
  53. print("%6.3fms - 1 stream, DataCopyDtoH" % ((trtTimeEnd - trtTimeStart) / nTest * 1000))
  54. # Count time of end to end
  55. for i in range(nWarmUp):
  56. context.execute_async_v2([int(inputD0), int(outputD0)], stream)
  57. trtTimeStart = time()
  58. for i in range(nTest):
  59. cudart.cudaMemcpyAsync(inputD0, inputH0.ctypes.data, inputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream)
  60. context.execute_async_v2([int(inputD0), int(outputD0)], stream)
  61. cudart.cudaMemcpyAsync(outputH0.ctypes.data, outputD0, outputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream)
  62. cudart.cudaStreamSynchronize(stream)
  63. trtTimeEnd = time()
  64. print("%6.3fms - 1 stream, DataCopy + Inference" % ((trtTimeEnd - trtTimeStart) / nTest * 1000))
  65. cudart.cudaStreamDestroy(stream)
  66. cudart.cudaFree(inputD0)
  67. cudart.cudaFree(outputD0)
  68. print("tensorrt result:", np.argmax(outputH0))
  69. if __name__ == "__main__":
  70. cudart.cudaDeviceSynchronize()
  71. f = open("./model.trt", "rb") # 读取trt模型
  72. runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) # 创建一个Runtime(传入记录器Logger)
  73. engine = runtime.deserialize_cuda_engine(f.read()) # 从文件中加载trt引擎
  74. run1(engine) # do inference with single stream
  75. print(dummy_input.shape, dummy_input.dtype)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/640082
推荐阅读
相关标签
  

闽ICP备14008679号