赞
踩
tensorRT是用来干嘛的在这里就不多介绍了
在使用tensorRT提速之前需要先训练模型
在将训练好的模型转ONNX再转engine
一、将训练好的模型转ONNX这里就提供将torch转ONNX,其余的网上还是有很多教程的
import torch
import torch.nn
import onnx
model = torch.load('best.pt')
model.eval()
input_names = ['input']
output_names = ['output']
x = torch.randn(1,3,32,32,requires_grad=True)
torch.onnx.export(model, x, 'flame.onnx', input_names=input_names, output_names=output_names, verbose='True')
输出就行
二、将ONNX转engine
可以直接使用tensorrt自带的trtexec将onnx模型转engine:
进入tensorrt的安装目录下的bin文件,就能看到trtexec:输入
ubuntu下的trtexec
/usr/src/tensorrt/bin
trtexec -h 查看帮助命令
=== Model Options === --uff=<file> UFF model --onnx=<file> ONNX model --model=<file> Caffe model (default = no model, random weights used) --deploy=<file> Caffe prototxt file --output=<name>[,<name>]* Output names (it can be specified multiple times); at least one output is required for UFF and Caffe --uffInput=<name>,X,Y,Z Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models --uffNHWC Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput) === Build Options === --maxBatch Set max batch size and build an implicit batch engine (default = 1) --explicitBatch Use explicit batch sizes when building the engine (default = implicit) --minShapes=spec Build with dynamic shapes using a profile with the min shapes provided --optShapes=spec Build with dynamic shapes using a profile with the opt shapes provided --maxShapes=spec Build with dynamic shapes using a profile with the max shapes provided --minShapesCalib=spec Calibrate with dynamic shapes using a profile with the min shapes provided --optShapesCalib=spec Calibrate with dynamic shapes using a profile with the opt shapes provided --maxShapesCalib=spec Calibrate with dynamic shapes using a profile with the max shapes provided Note: All three of min, opt and max shapes must be supplied. However, if only opt shapes is supplied then it will be expanded so that min shapes and max shapes are set to the same values as opt shapes. In addition, use of dynamic shapes implies explicit batch. Input names can be wrapped with escaped single quotes (ex: \'Input:0\'). Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128 Each input shape is supplied as a key-value pair where key is the input name and value is the dimensions (including the batch dimension) to be used for that input. Each key-value pair has the key and value separated using a colon (:). Multiple input shapes can be provided via comma-separated key-value pairs. --inputIOFormats=spec Type and format of each of the input tensors (default = all inputs in fp32:chw) See --outputIOFormats help for the grammar of type and format list. Note: If this option is specified, please set comma-separated types and formats for all inputs following the same order as network inputs ID (even if only one input needs specifying IO format) or set the type and format once for broadcasting. --outputIOFormats=spec Type and format of each of the output tensors (default = all outputs in fp32:chw) Note: If this option is specified, please set comma-separated types and formats for all outputs following the same order as network outputs ID (even if only one output needs specifying IO format) or set the type and format once for broadcasting. IO Formats: spec ::= IOfmt[","spec] IOfmt ::= type:fmt type ::= "fp32"|"fp16"|"int32"|"int8" fmt ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32"|"dhwc8")["+"fmt] --workspace=N Set workspace size in megabytes (default = 16) --noBuilderCache Disable timing cache in builder (default is to enable timing cache) --nvtxMode=mode Specify NVTX annotation verbosity. mode ::= default|verbose|none --minTiming=M Set the minimum number of iterations used in kernel selection (default = 1) --avgTiming=M Set the number of times averaged in each iteration for kernel selection (default = 8) --noTF32 Disable tf32 precision (default is to enable tf32, in addition to fp32) --refit Mark the engine as refittable. This will allow the inspection of refittable layers and weights within the engine. --fp16 Enable fp16 precision, in addition to fp32 (default = disabled) --int8 Enable int8 precision, in addition to fp32 (default = disabled) --best Enable all precisions to achieve the best performance (default = disabled) --calib=<file> Read INT8 calibration cache file --safe Only test the functionality available in safety restricted flows --saveEngine=<file> Save the serialized engine --loadEngine=<file> Load a serialized engine --tacticSources=tactics Specify the tactics to be used by adding (+) or removing (-) tactics from the default tactic sources (default = all available tactics). Note: Currently only cuBLAS and cuBLAS LT are listed as optional tactics. Tactic Sources: tactics ::= [","tactic] tactic ::= (+|-)lib lib ::= "cublas"|"cublasLt" === Inference Options === --batch=N Set batch size for implicit batch engines (default = 1) --shapes=spec Set input shapes for dynamic shapes inference inputs. Note: Use of dynamic shapes implies explicit batch. Input names can be wrapped with escaped single quotes (ex: \'Input:0\'). Example input shapes spec: input0:1x3x256x256, input1:1x3x128x128 Each input shape is supplied as a key-value pair where key is the input name and value is the dimensions (including the batch dimension) to be used for that input. Each key-value pair has the key and value separated using a colon (:). Multiple input shapes can be provided via comma-separated key-value pairs. --loadInputs=spec Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0') Input values spec ::= Ival[","spec] Ival ::= name":"file --iterations=N Run at least N inference iterations (default = 10) --warmUp=N Run for N milliseconds to warmup before measuring performance (default = 200) --duration=N Run performance measurements for at least N seconds wallclock time (default = 3) --sleepTime=N Delay inference start with a gap of N milliseconds between launch and compute (default = 0) --streams=N Instantiate N engines to use concurrently (default = 1) --exposeDMA Serialize DMA transfers to and from device. (default = disabled) --noDataTransfers Do not transfer data to and from the device during inference. (default = disabled) --useSpinWait Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled) --threads Enable multithreading to drive engines with independent threads (default = disabled) --useCudaGraph Use cuda graph to capture engine execution and then launch inference (default = disabled) --separateProfileRun Do not attach the profiler in the benchmark run; if profiling is enabled, a second profile run will be executed (default = disabled) --buildOnly Skip inference perf measurement (default = disabled) === Build and Inference Batch Options === When using implicit batch, the max batch size of the engine, if not given, is set to the inference batch size; when using explicit batch, if shapes are specified only for inference, they will be used also as min/opt/max in the build profile; if shapes are specified only for the build, the opt shapes will be used also for inference; if both are specified, they must be compatible; and if explicit batch is enabled but neither is specified, the model must provide complete static dimensions, including batch size, for all inputs === Reporting Options === --verbose Use verbose logging (default = false) --avgRuns=N Report performance measurements averaged over N consecutive iterations (default = 10) --percentile=P Report performance for the P percentage (0<=P<=100, 0 representing max perf, and 100 representing min perf; (default = 99%) --dumpRefit Print the refittable layers and weights from a refittable engine --dumpOutput Print the output tensor(s) of the last inference iteration (default = disabled) --dumpProfile Print profile information per layer (default = disabled) --exportTimes=<file> Write the timing results in a json file (default = disabled) --exportOutput=<file> Write the output tensors to a json file (default = disabled) --exportProfile=<file> Write the profile information per layer in a json file (default = disabled) === System Options === --device=N Select cuda device N (default = 0) --useDLACore=N Select DLA core N for layers that support DLA (default = none) --allowGPUFallback When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled) --plugins Plugin library (.so) to load (can be specified multiple times)
运行就行
trtexec --onnx=flame_sim.onnx --saveEngine=flame_sim.engine --best --workspace=1024 --minShapes=inputx:1x3x224x224 --optShapes=inputx:1x3x2224x224 --maxShapes=inputx:1x3x224x224
但是要记住设置工作区间 默认的工作区间为16,单位为MB --minShapes=inputx:1x3x224x224 --optShapes=inputx:1x3x2224x224 --maxShapes=inputx:1x3x224x22
设置最小,最大和最佳的输入 注意保存engine的时候不要保存在bin下面 可能会报错的 保存引擎错误
有的时候转engine的时候回报错
onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64
weights, while TensorRT does not natively support INT64. Attempting to
cast down to INT32.
是因为你的onnx是INT64权重生成的,而tensorrt是支持INT32 的所有要将onnx转为更简单的模型。需要用到 onnx-simplifier 使用 pip install onnx-simplifier
就能直接安装了
安装完毕后就可以转了 python -m onnxsim .\flame.onnx .\flame_sim.onnx
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。