赞
踩
1.官方代码使用python2转的onnx,比如这份代码:https://github.com/Cw-zero/TensorRT_yolo3_module
这里有份使用python3转onnx的代码:https://github.com/jkjung-avt/tensorrt_demos,由于个人python3安装onnx低版本不成功,未测试该代码
2.报错ERROR: ValueError: not enough values to unpack (expected 2, got 1)
yolov3.cfg文件使用官方提供的,有两点需注意:一.两层之间至少有1条空行;二**.cfg文件末尾有2个空行**。亲测末尾空一行要报这个错,被这坑惨了,网上有些地方说的是末尾空一行。
3.报错:onnx.onnx_cpp2py_export.checker.ValidationError: Op registered for Upsample is deprecated in domain_version of 12
和 onnx.onnx_cpp2py_export.checker.ValidationError: Node (086_upsample) has input size 1 not in range [min=2, max=2]
这两种
onnx默认安装的版本过高,降级为1.2.1,降为1.4.1还是会报错。
pip2 install onnx==1.2.1
4.用python2转yolov3的onnx成功了,使用python3转yolov3的onnx时,降低onnx版本,一直报错
(base) lgy@lgy:~$ pip install onnx==1.2.1 Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting onnx==1.2.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/0c/ce/e66db6ac8462eeca295b30749ec3497c8d607d822de03288531577c725ce/onnx-1.2.1.tar.gz (2.6 MB) |████████████████████████████████| 2.6 MB 48 kB/s ERROR: Command errored out with exit status 1: command: /home/lgy/anaconda3/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-jz71wfj8/onnx/setup.py'"'"'; __file__='"'"'/tmp/pip-install-jz71wfj8/onnx/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-xwrptm41 cwd: /tmp/pip-install-jz71wfj8/onnx/ Complete output (6 lines): fatal: Not a git repository (or any of the parent directories): .git Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-jz71wfj8/onnx/setup.py", line 71, in <module> assert CMAKE, 'Could not find "cmake" executable!' AssertionError: Could not find "cmake" executable! ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
尝试各种方法,仍未解决,先用python2转换把
https://github.com/Cw-zero/TensorRT_yolo3_module默认使用608的图片进行推理,权重使用官方下载的https://pjreddie.com/media/files/yolov3.weights,按readme提示操作,正常运行。
改成416*416大小的试哈,修改代码:
yolov3-608.cfg:第8,9行: width=608,height=608,改为416
weight_to_onnx.py:注释掉640-642行,取消注释644-646行:
#yolo-v3(608*608)
# output_tensor_dims['082_convolutional'] = [255, 19, 19]
# output_tensor_dims['094_convolutional'] = [255, 38, 38]
# output_tensor_dims['106_convolutional'] = [255, 76, 76]
#yolo-v3(416*416)
output_tensor_dims['082_convolutional'] = [255, 13, 13]
output_tensor_dims['094_convolutional'] = [255, 26, 26]
output_tensor_dims['106_convolutional'] = [255, 52, 52]
trt_yolo3_module_1batch.py:修改55和57行
self.inp_dim = 416 # 608改为416
self.num_classes = 80
# self.output_shapes = [(1, 255, 19, 19), (1, 255, 38, 38), (1, 255, 76, 76)] # yolov3-608
self.output_shapes = [(1, 255, 13, 13), (1, 255, 26, 26), (1, 255, 52, 52)] # yolov3-416
运行无问题,可是看推理的时间和我自己用的那份yolov3代码推理时间差不多
1.该项目的yolov3权重有两种,一种是yolov3.weights,这是darknet的官方权重,另一种是yolov3.pt样式的,这是作者自己训练的pytorch权重,上面用官方的yolov3.weights转换onnx和tensorrt无问题,在使用yolov3.pt转换时报错,问题在于两种权重的保存方式有点区别,需要先将pytorch的权重.pt转换为darknet样式的权重.weights。
个人用的代码还是是去年的版本,现在已经大更新了,最新版本的代码新增了一个功能,将该项目的pytorch权重和darknet权重互相转换,将.pt转换为.weights后再转为onnx和tensorrt,使用无问题。
由于个人用的代码跟最新版本的代码有些不一样的地方,保存的权重.pt不一样,使用该转换代码时报错,仔细分析后,发现darknet的权重前5行是头文件信息,作者在构建模型时还加了其他信息,比对着最新版代码里的转换文件,稍加修改即可。
import torch import numpy as np from models import Darknet def save_weights(self, path='model.weights', cutoff=-1): fp = open(path, 'wb') version = np.array([0, 2, 5], dtype=np.int32) # (int32) version info: major, minor, revision seen = np.array([0], dtype=np.int64) # (int64) number of images seen during training version.tofile(fp) seen.tofile(fp) # Iterate through layers for i, (module_def, module) in enumerate(zip(self.module_defs[:cutoff], self.module_list[:cutoff])): if module_def['type'] == 'convolutional': conv_layer = module[0] # If batch norm, load bn first if module_def['batch_normalize']: bn_layer = module[1] bn_layer.bias.data.cpu().numpy().tofile(fp) bn_layer.weight.data.cpu().numpy().tofile(fp) bn_layer.running_mean.data.cpu().numpy().tofile(fp) bn_layer.running_var.data.cpu().numpy().tofile(fp) # Load conv bias else: conv_layer.bias.data.cpu().numpy().tofile(fp) # Load conv weights conv_layer.weight.data.cpu().numpy().tofile(fp) fp.close() def convert(cfg='cfg/yolov3.cfg', weights='weights/yolov3.pt'): # Converts between PyTorch and Darknet format per extension (i.e. *.weights convert to *.pt and vice versa) # from models import *; convert('cfg/yolov3-spp.cfg', 'weights/yolov3-spp.weights') # Initialize model model = Darknet(cfg) # Load weights and save if weights.endswith('.pt'): # if PyTorch format model.load_state_dict(torch.load(weights, map_location='cpu')['model']) save_weights(model, path='weights/converted.weights', cutoff=-1) print("Success: converted '%s' to 'converted.weights'" % weights) else: print('Error: extension not supported.') if __name__ == '__main__': convert(cfg='cfg/my_yolov3.cfg', weights='weights/best.pt')
1.cfg文件直接替换为自己用的
2.trt_yolo3_module_1batch.py:修改56行的类别数num_classes和59行的anchor
self.inp_dim = 416 # 608改为416
self.num_classes = 2
# self.output_shapes = [(1, 255, 19, 19), (1, 255, 38, 38), (1, 255, 76, 76)] # yolov3-608
self.output_shapes = [(1, 255, 13, 13), (1, 255, 26, 26), (1, 255, 52, 52)] # yolov3-416
self.yolo_anchors = [[(116, 90), (156, 198), (373, 326)],
[(30, 61), (62, 45), (59, 119)],
[(10, 13), (16, 30), (33, 23)]]
最后运行时报错:
File "/home/lgy/PycharmProjects/TensorRT_yolo3_module/trt_yolo3_module_1batch.py", line 101, in detection
output = output.reshape(shape)
ValueError: cannot reshape array of size 3549 into shape (1,255,13,13)
在前一步打印output.shape和要转为的shape,分别为:(3549,) (1, 255, 13, 13),
分析原因:3549=131321,而后面为1313255,个人训练的类别为2类,在cfg里的yolo层前的filters=21,显然是此处255有问题,查找为上一步修改时的self.output_shapes = [(1, 255, 13, 13), (1, 255, 26, 26), (1, 255, 52, 52)]
,输出形状还有255,改为21
self.output_shapes = [(1, 21, 13, 13), (1, 21, 26, 26), (1, 21, 52, 52)]
再运行,不报错,但是未在测试图片上画框。
再次找到个错误,在weights_to_onnx.py的646行里,output_tensor_dims应该改为自己模型的输出维度
# yolo-v3(416*416)
# output_tensor_dims['082_convolutional'] = [255, 13, 13]
# output_tensor_dims['094_convolutional'] = [255, 26, 26]
# output_tensor_dims['106_convolutional'] = [255, 52, 52]
# my_yolo-v3(416*416)
output_tensor_dims['082_convolutional'] = [21, 13, 13]
output_tensor_dims['094_convolutional'] = [21, 26, 26]
output_tensor_dims['106_convolutional'] = [21, 52, 52]
改完再测,还是没有显示,打印输出,从trt出来有输出,在做完dets = dynamic_write_results(detections, 0.5, self.num_classes, nms=True, nms_conf=0.3)
即nms后没有输出,怀疑有可能是nms的问题,看了半天原作者的代码,实在是不好下手,由于
for output, shape, anchors in zip(trt_outputs, self.output_shapes, self.yolo_anchors): output = output.reshape(shape) trt_output = torch.from_numpy(output).cuda().data print(trt_output.shape) # trt_output = trt_output.data # cuda_time1 = time.time() trt_output = predict_transform(trt_output, self.inp_dim, anchors, self.num_classes, self.use_cuda) print(trt_output.shape) # cuda_time2 = time.time() # print('CUDA time : %f' % (cuda_time2 - cuda_time1)) if type(trt_output) == int: continue if not write: detections = trt_output write = 1 else: detections = torch.cat((detections, trt_output), 1)
这步结束后,得到的detections的shape是 torch.Size([1, 10647, 85]) (这是80类的),这后面可以用我自己那份yolov3的推理代码里的mns代替。
替换后输出box等值画框,终于没出问题了,再用darknet的weights进行测试,也没问题,泪流满面啊,居然是原作者的nms坑我这么久。
修改后的trt的推理时间居然比我用pytorch推理的时间还长,检测发现时图片预处理时间过长,将近10ms(1060显卡),找了半天发现是自己代码的图片处理耗时较多,而该转换代码的图片处理时间少了将近一半,再更换后发现总时间的确少了点,v3的主网络推理耗时大概减少了10%左右,跟网上说的大幅提速差远了啊有木有,后续再改吧。
cfg和weights文件用官方提供的,注意cfg最后要空两行,图片输入尺寸改为416*416;
yolov3-spp比v3多了maxpool,在weights_to_onnx.py里增加maxpool相关代码
class GraphBuilderONNX(object)里增加:
def _make_maxpool_node(self, layer_name, layer_dict): stride = layer_dict['stride'] # stride = 1 kernel_size = layer_dict['size'] previous_node_specs = self._get_previous_node_specs() inputs = [previous_node_specs.name] channels = previous_node_specs.channels kernel_shape = [kernel_size, kernel_size] strides = [stride, stride] assert channels > 0 maxpool_node = helper.make_node( 'MaxPool', inputs=inputs, outputs=[layer_name], kernel_shape=kernel_shape, strides=strides, auto_pad='SAME_UPPER', name=layer_name, ) self._nodes.append(maxpool_node) return layer_name, channels
该类的def _make_onnx_node(self, layer_name, layer_dict)
里增加:
node_creators['maxpool'] = self._make_maxpool_node
main函数里多少层进行输出也要修改,可以直接将每层的权重名称打出来查看
# yolo-v3-spp(416*416)
output_tensor_dims['089_convolutional'] = [255, 13, 13]
output_tensor_dims['101_convolutional'] = [255, 26, 26]
output_tensor_dims['113_convolutional'] = [255, 52, 52]
始终报错:
File "/home/lgy/PycharmProjects/TensorRT_yolo3_module/spp_weight_to_onnx.py", line 296, in _load_one_param_type
buffer=self.weights_file.read(param_size * 4))
TypeError: buffer is too small for requested array
报错信息是数组给定的shape与spp.weights读出来的shape不匹配,初步怀疑cfg和weights文件不匹配,但是在pytorch项目里推理,能够正常运行,排除该可能。
在报错代码前打印出相关shape:
def _load_one_param_type(self, conv_params, param_category, suffix): """Deserializes the weights from a file stream in the DarkNet order. Keyword arguments: conv_params -- a ConvParams object param_category -- the category of parameters to be created ('bn' or 'conv') suffix -- a string determining the sub-type of above param_category (e.g., 'weights' or 'bias') """ param_name = conv_params.generate_param_name(param_category, suffix) channels_out, channels_in, filter_h, filter_w = conv_params.conv_weight_dims if param_category == 'bn': param_shape = [channels_out] elif param_category == 'conv': if suffix == 'weights': param_shape = [channels_out, channels_in, filter_h, filter_w] elif suffix == 'bias': param_shape = [channels_out] param_size = np.product(np.array(param_shape)) print(param_name,param_shape) print(param_size) param_data = np.ndarray( shape=param_shape, dtype='float32', buffer=self.weights_file.read(param_size * 4)) print(param_data.shape) print('----') param_data = param_data.flatten().astype(float) return param_name, param_data, param_shape
输出为:
---- 109_convolutional_conv_weights [128, 256, 1, 1] 32768 (128, 256, 1, 1) ---- 110_convolutional_bn_bias [256] 256 (256,) ---- 110_convolutional_bn_scale [256] 256 (256,) ---- 110_convolutional_bn_mean [256] 256 (256,) ---- 110_convolutional_bn_var [256] 256 (256,) ---- 110_convolutional_conv_weights [256, 128, 3, 3] 294912 Traceback (most recent call last): File "/home/lgy/PycharmProjects/TensorRT_yolo3_module/spp_weight_to_onnx.py", line 721, in <module> main(cfg='weights_spp_80/yolov3-spp.cfg', weights_file='weights_spp_80/yolov3-spp.weights', onnx_file='weights_spp_80/yolov3-spp.onnx')
显然,是第110层的shape不匹配,在这里进行了各种尝试,都未解决问题,其中,将dtype='float32'
改为float16,居然能输出110层的shape,但是又在111层卡住了,在v3的转换里,改成int类型都没问题。
网上关于这个问题的解决方法是重新下载匹配的cfg和weights,我也重下了还是没用。
最后尝试将yolov3-spp.weights(或者.pt)里每一层的shape打印出来,与该代码的shape进行比较,发现
weights['module_list.109.conv_109.weight'].shape = torch.Size([256, 128, 3, 3])
weights['module_list.110.conv_110.weight'].shape = torch.Size([128, 256, 1, 1])
而在weights_to_onnx.py里:
109_convolutional_conv_weights [128, 256, 1, 1]
32768
(128, 256, 1, 1)
110_convolutional_conv_weights [256, 128, 3, 3]
294912
显然有一层的位置错开了,而v3-spp与v3相比只多了个spp模块,估计是spp模块里的maxpool或者route转换为onnx出错了,
然而经过仔细比对,发现yolov3-spp.weights是从module_list.0.conv_0.weight开始计数的,而onnx里是从001_convolutional_conv_weights开始计数的,所以应该110_convolutional_conv_weights [256, 128, 3, 3]
对应weights['module_list.109.conv_109.weight'].shape = torch.Size([256, 128, 3, 3])
,没毛病啊,再把yolov3.weights和对应的onnx每层打印出来看,的确也是这样的。
未完待续
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。