当前位置:   article > 正文

【TensorRT】trtexec工具转engine_tensorrt导出engine文件

tensorrt导出engine文件

目前官方的转换工具 ONNX-TensorRT https://github.com/onnx/onnx-tensorrt

trtexec的用法说明参考 https://blog.csdn.net/qq_29007291/article/details/116135737

trtexec有两个主要用途:

        测试网络性能 - 如果您将模型保存为 UFF 文件、ONNX 文件,或者如果您有 Caffe prototxt 格式的网络描述,您可以使用 trtexec 工具来测试推理的性能。 注意如果只使用 Caffe prototxt 文件并且未提供模型,则会生成随机权重。trtexec 工具有许多选项用于指定输入和输出、性能计时的迭代、允许的精度等。序列化引擎生成 - 可以将UFF、ONNX、Caffe格式的模型构建成engine。

1、Caffe–>engine

生成engine

  1. #生成engine
  2. ./trtexec     --deploy=/path/to/mnist.prototxt \        #指定网络模型文件,caffe独有的
  3.             --model=/path/to/mnist.caffemodel \      #指定权重文件
  4.             --output=prob \                           #标记输出节点名称(可以多次指定)
  5.             --batch=16 \                              #为隐式批处理引擎设置批处理大小
  6.             --saveEngine=mnist16.trt                   #输出engine
  7. #生成engine启用INT8精度
  8. ./trtexec     --deploy=GoogleNet_N2.prototxt \        #指定网络模型文件,caffe独有的
  9.             --output=prob \                           #标记输出节点名称(可以多次指定)
  10.             --batch=1 \                              #为隐式批处理引擎设置批处理大小,默认=1
  11.             --saveEngine=g1.trt \                      #输出engine
  12.             --int8 \                                  #除了fp32之外,还启用int8精度(默认=禁用)
  13.             --buildOnly                              #跳过性能测量

测试网络

  1. #使用engine进行性能测试
  2. ./trtexec --loadEngine=mnist16.trt --batch=16
  3. #在 FP16 模式下在 NVIDIA DLA(深度学习加速器)上运行 AlexNet 网络
  4. ./trtexec     --deploy=data/AlexNet/AlexNet_N2.prototxt \      #指定网络模型文件,caffe独有的
  5.               --output=prob \                                 #标记输出节点名称(可以多次指定)
  6.               --useDLACore=1 \                                #使用NVIDIA DLA(深度学习加速器)
  7.               --fp16 \                                         #除了fp32之外,还启用fp16精度(默认=禁用)
  8.               --allowGPUFallback                              #启用DLA时,允许GPU回退不支持的层(默认值=禁用)
  9.         
  10. #在 INT8 模式下在 DLA 上运行 AlexNet 网络        
  11. ./trtexec     --deploy=data/AlexNet/AlexNet_N2.prototxt \      #指定网络模型文件,caffe独有的
  12.             --output=prob \                                  #标记输出节点名称(可以多次指定)
  13.             --useDLACore=1 \                                  #使用NVIDIA DLA(深度学习加速器)
  14.             --int8 \                                          #除了fp32之外,还启用int8精度(默认=禁用)
  15.             --allowGPUFallback                                #启用DLA时,允许GPU回退不支持的层(默认值=禁用)
  16.         
  17. #trtexec测试模型并打印测量的性能,并将计时结果写入json文件
  18. ./trtexec     --deploy=data/AlexNet/AlexNet_N2.prototxt \     #指定网络模型文件,caffe独有的
  19.             --output=prob \                                   #标记输出节点名称(可以多次指定)
  20.             --exportTimes=trace.json                         #将计时结果写入json文件(默认=禁用)
  21. #通过多流调整吞吐量
  22. ./trtexec     --loadEngine=g1.trt --batch=1 --streams=2
  23. ./trtexec     --loadEngine=g1.trt --batch=1 --streams=3
  24. ./trtexec     --loadEngine=g1.trt --batch=1 --streams=4
  25. ./trtexec     --loadEngine=g2.trt --batch=2 --streams=2

2、ONNX–>engine

  1. #生成静态batchsize的engine
  2. ./trtexec     --onnx=<onnx_file> \                         #指定onnx模型文件
  3.             --explicitBatch \                             #在构建引擎时使用显式批大小(默认=隐式)显示批处理
  4.             --saveEngine=<tensorRT_engine_file> \         #输出engine
  5.             --workspace=<size_in_megabytes> \             #设置工作空间大小单位是MB(默认为16MB)
  6.             --fp16                                         #除了fp32之外,还启用fp16精度(默认=禁用)
  7.         
  8. #生成动态batchsize的engine
  9. ./trtexec     --onnx=<onnx_file> \                        #指定onnx模型文件
  10.             --minShapes=input:<shape_of_min_batch> \     #最小的batchsize x 通道数 x 输入尺寸x x 输入尺寸y
  11.             --optShapes=input:<shape_of_opt_batch> \      #最佳输入维度,跟maxShapes一样就好
  12.             --maxShapes=input:<shape_of_max_batch> \     #最大输入维度
  13.             --workspace=<size_in_megabytes> \             #设置工作空间大小单位是MB(默认为16MB)
  14.             --saveEngine=<engine_file> \                   #输出engine
  15.             --fp16                                       #除了fp32之外,还启用fp16精度(默认=禁用)

举例:         

  1. #小尺寸的图片可以多batchsize即8x3x416x416
  2. /home/zxl/TensorRT-7.2.3.4/bin/trtexec  --onnx=yolov4_-1_3_416_416_dynamic.onnx \
  3.                                         --minShapes=input:1x3x416x416 \
  4.                                         --optShapes=input:8x3x416x416 \
  5.                                         --maxShapes=input:8x3x416x416 \
  6.                                         --workspace=4096 \
  7.                                         --saveEngine=yolov4_-1_3_416_416_dynamic_b8_fp16.engine \
  8.                                         --fp16
  9. #由于内存不够了所以改成4x3x608x608
  10. /home/zxl/TensorRT-7.2.3.4/bin/trtexec  --onnx=yolov4_-1_3_608_608_dynamic.onnx \
  11.                                         --minShapes=input:1x3x608x608 \
  12.                                         --optShapes=input:4x3x608x608 \
  13.                                         --maxShapes=input:4x3x608x608 \
  14.                                         --workspace=4096 \
  15.                                         --saveEngine=yolov4_-1_3_608_608_dynamic_b4_fp16.engine \
  16.                                         --fp16  

生成engine得到同时也包含了测试性能的信息:

3、trtexec命令行参数

  1. (base) zxl@R7000P:~/TensorRT-7.2.3.4/bin$ ./trtexec --help
  2. &&&& RUNNING TensorRT.trtexec # ./trtexec --help
  3. === Model Options ===
  4.   --uff=<file>                UFF model
  5.   --onnx=<file>               ONNX model
  6.   --model=<file>              Caffe model (default = no model, random weights used)
  7.   --deploy=<file>             Caffe prototxt file
  8.   --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output is required for UFF and Caffe
  9.   --uffInput=<name>,X,Y,Z     Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models
  10.   --uffNHWC                   Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput)
  11. === Build Options ===
  12.   --maxBatch                  Set max batch size and build an implicit batch engine (default = 1)
  13.   --explicitBatch             Use explicit batch sizes when building the engine (default = implicit)
  14.   --minShapes=spec            Build with dynamic shapes using a profile with the min shapes provided
  15.   --optShapes=spec            Build with dynamic shapes using a profile with the opt shapes provided
  16.   --maxShapes=spec            Build with dynamic shapes using a profile with the max shapes provided
  17.   --minShapesCalib=spec       Calibrate with dynamic shapes using a profile with the min shapes provided
  18.   --optShapesCalib=spec       Calibrate with dynamic shapes using a profile with the opt shapes provided
  19.   --maxShapesCalib=spec       Calibrate with dynamic shapes using a profile with the max shapes provided
  20.                               Note: All three of min, opt and max shapes must be supplied.
  21.                                     However, if only opt shapes is supplied then it will be expanded so
  22.                                     that min shapes and max shapes are set to the same values as opt shapes.
  23.                                     In addition, use of dynamic shapes implies explicit batch.
  24.                                     Input names can be wrapped with escaped single quotes (ex: \'Input:0\').
  25.                               Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128
  26.                               Each input shape is supplied as a key-value pair where key is the input name and
  27.                               value is the dimensions (including the batch dimension) to be used for that input.
  28.                               Each key-value pair has the key and value separated using a colon (:).
  29.                               Multiple input shapes can be provided via comma-separated key-value pairs.
  30.   --inputIOFormats=spec       Type and format of each of the input tensors (default = all inputs in fp32:chw)
  31.                               See --outputIOFormats help for the grammar of type and format list.
  32.                               Note: If this option is specified, please set comma-separated types and formats for all
  33.                                     inputs following the same order as network inputs ID (even if only one input
  34.                                     needs specifying IO format) or set the type and format once for broadcasting.
  35.   --outputIOFormats=spec      Type and format of each of the output tensors (default = all outputs in fp32:chw)
  36.                               Note: If this option is specified, please set comma-separated types and formats for all
  37.                                     outputs following the same order as network outputs ID (even if only one output
  38.                                     needs specifying IO format) or set the type and format once for broadcasting.
  39.                               IO Formats: spec  ::= IOfmt[","spec]
  40.                                           IOfmt ::= type:fmt
  41.                                           type  ::= "fp32"|"fp16"|"int32"|"int8"
  42.                                           fmt   ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32"|"dhwc8")["+"fmt]
  43.   --workspace=N               Set workspace size in megabytes (default = 16)
  44.   --noBuilderCache            Disable timing cache in builder (default is to enable timing cache)
  45.   --nvtxMode=mode             Specify NVTX annotation verbosity. mode ::= default|verbose|none
  46.   --minTiming=M               Set the minimum number of iterations used in kernel selection (default = 1)
  47.   --avgTiming=M               Set the number of times averaged in each iteration for kernel selection (default = 8)
  48.   --noTF32                    Disable tf32 precision (default is to enable tf32, in addition to fp32)
  49.   --refit                     Mark the engine as refittable. This will allow the inspection of refittable layers 
  50.                               and weights within the engine.
  51.   --fp16                      Enable fp16 precision, in addition to fp32 (default = disabled)
  52.   --int8                      Enable int8 precision, in addition to fp32 (default = disabled)
  53.   --best                      Enable all precisions to achieve the best performance (default = disabled)
  54.   --calib=<file>              Read INT8 calibration cache file
  55.   --safe                      Only test the functionality available in safety restricted flows
  56.   --saveEngine=<file>         Save the serialized engine
  57.   --loadEngine=<file>         Load a serialized engine
  58.   --tacticSources=tactics     Specify the tactics to be used by adding (+) or removing (-) tactics from the default 
  59.                               tactic sources (default = all available tactics).
  60.                               Note: Currently only cuBLAS and cuBLAS LT are listed as optional tactics.
  61.                               Tactic Sources: tactics ::= [","tactic]
  62.                                               tactic  ::= (+|-)lib
  63.                                               lib     ::= "cublas"|"cublasLt"
  64. === Inference Options ===
  65.   --batch=N                   Set batch size for implicit batch engines (default = 1)
  66.   --shapes=spec               Set input shapes for dynamic shapes inference inputs.
  67.                               Note: Use of dynamic shapes implies explicit batch.
  68.                                     Input names can be wrapped with escaped single quotes (ex: \'Input:0\').
  69.                               Example input shapes spec: input0:1x3x256x256, input1:1x3x128x128
  70.                               Each input shape is supplied as a key-value pair where key is the input name and
  71.                               value is the dimensions (including the batch dimension) to be used for that input.
  72.                               Each key-value pair has the key and value separated using a colon (:).
  73.                               Multiple input shapes can be provided via comma-separated key-value pairs.
  74.   --loadInputs=spec           Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0')
  75.                               Input values spec ::= Ival[","spec]
  76.                                            Ival ::= name":"file
  77.   --iterations=N              Run at least N inference iterations (default = 10)
  78.   --warmUp=N                  Run for N milliseconds to warmup before measuring performance (default = 200)
  79.   --duration=N                Run performance measurements for at least N seconds wallclock time (default = 3)
  80.   --sleepTime=N               Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
  81.   --streams=N                 Instantiate N engines to use concurrently (default = 1)
  82.   --exposeDMA                 Serialize DMA transfers to and from device. (default = disabled)
  83.   --noDataTransfers           Do not transfer data to and from the device during inference. (default = disabled)
  84.   --useSpinWait               Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled)
  85.   --threads                   Enable multithreading to drive engines with independent threads (default = disabled)
  86.   --useCudaGraph              Use cuda graph to capture engine execution and then launch inference (default = disabled)
  87.   --separateProfileRun        Do not attach the profiler in the benchmark run; if profiling is enabled, a second profile run will be executed (default = disabled)
  88.   --buildOnly                 Skip inference perf measurement (default = disabled)
  89. === Build and Inference Batch Options ===
  90.                               When using implicit batch, the max batch size of the engine, if not given, 
  91.                               is set to the inference batch size;
  92.                               when using explicit batch, if shapes are specified only for inference, they 
  93.                               will be used also as min/opt/max in the build profile; if shapes are 
  94.                               specified only for the build, the opt shapes will be used also for inference;
  95.                               if both are specified, they must be compatible; and if explicit batch is 
  96.                               enabled but neither is specified, the model must provide complete static
  97.                               dimensions, including batch size, for all inputs
  98. === Reporting Options ===
  99.   --verbose                   Use verbose logging (default = false)# 使用详细日志记录
  100.   --avgRuns=N                 Report performance measurements averaged over N consecutive iterations (default = 10)
  101.   --percentile=P              Report performance for the P percentage (0<=P<=100, 0 representing max perf, and 100 representing min perf; (default = 99%)
  102.   --dumpRefit                 Print the refittable layers and weights from a refittable engine
  103.   --dumpOutput                Print the output tensor(s) of the last inference iteration (default = disabled)
  104.   --dumpProfile               Print profile information per layer (default = disabled)
  105.   --exportTimes=<file>        Write the timing results in a json file (default = disabled)
  106.   --exportOutput=<file>       Write the output tensors to a json file (default = disabled)
  107.   --exportProfile=<file>      Write the profile information per layer in a json file (default = disabled)
  108. === System Options ===
  109.   --device=N                  Select cuda device N (default = 0)
  110.   --useDLACore=N              Select DLA core N for layers that support DLA (default = none)
  111.   --allowGPUFallback          When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
  112.   --plugins                   Plugin library (.so) to load (can be specified multiple times)
  113. === Help ===
  114.   --help, -h                  Print this message


————————————————
Thanks to:https://blog.csdn.net/weixin_41562691/article/details/118277574

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/423918
推荐阅读
相关标签
  

闽ICP备14008679号