赞
踩
如果对Tensorflow实现最新的Yolo v7算法感兴趣的朋友,可以参见我最新发布的文章,Yolo v7的最简TensorFlow实现_gzroy的博客-CSDN博客
YOLO V3版本是一个强大和快速的物体检测模型,同时原理上也相对简单。我之前的博客中已经介绍了如何用Tensorflow来实现YOLO V1版本,之后我自己也用Tensorflow 1.X版本实现了YOLO V3,现在Tensorflow演进到了2.0版本,相比较1.X版本做了很大的改进,也更加易用了,因此我记录一下如何用Tensorflow 2.0版本来实现YOLO V3。网上能找到的很多Tensorflow YOLO V3的代码都没有完整的一个训练过程,基本上都是转换和加载YOLO的作者在Darknet上发布的训练好的权重数据,直接进行检测的。我的这个代码实现了完整的训练流程,包括了搭建基础架构网络Darknet53进行Imagenet预训练,以及增加YOLO V3网络模块进行物体检测训练,对模型的训练效果进行评测,以及用训练好的模型进行物体检测的过程。
需要准备两份训练数据,一个是Imagenent的物体分类数据,包括了1000种类别的物体的数据,共128万张图片。数据集需要预先处理为TFRECORD格式,具体过程可以参见我之前的博客基于Tensorflow的Imagenet数据集的完整处理过程(包括物体标识框BBOX的处理)_valid_classes_gzroy的博客-CSDN博客。 第二个训练数据是物体检测的数据,目前有很多个数据集可以采用,例如COCO数据集(包括80种物体的检测框),OpenImage,Pascal VOC等等,比较流行的是COCO数据集,大部分物体检测的论文都会基于这个数据集来提供性能指标。我也采用COCO数据集,同样也是预处理为TFRECORD格式,具体过程可以参见我的另一篇博客基于Tensorflow对COCO目标检测数据进行预处理_gzroy的博客-CSDN博客
按照YOLO V3论文的描述,基础网络架构是一个叫做Darknet53的网络模型,共有53个卷积层,其网络架构如下:
用Tensorflow可以很方便的构建一个Darknet53模型,代码如下:
- import tensorflow as tf
- from tensorflow.keras import Model
- l=tf.keras.layers
-
- def _conv(inputs, filters, kernel_size, strides, padding, bias=False, normalize=True, activation='relu', last=False):
- output = inputs
- padding_str = 'same'
- if padding>0:
- output = l.ZeroPadding2D(padding=padding, data_format='channels_first')(output)
- padding_str = 'valid'
- output = l.Conv2D(filters, kernel_size, strides, padding_str, \
- 'channels_first', use_bias=bias, \
- kernel_initializer='he_normal', \
- kernel_regularizer=tf.keras.regularizers.l2(l=5e-4))(output)
- if normalize:
- if not last:
- output = l.BatchNormalization(axis=1)(output)
- else:
- output = l.BatchNormalization(axis=1, gamma_initializer='zeros')(output)
- if activation=='relu':
- output = l.ReLU()(output)
- if activation=='relu6':
- output = l.ReLU(max_value=6)(output)
- if activation=='leaky_relu':
- output = l.LeakyReLU(alpha=0.1)(output)
- return output
-
- def _residual(inputs, out_channels, activation='relu', name=None):
- output1 = _conv(inputs, out_channels//2, 1, 1, 0, False, True, 'leaky_relu', False)
- output2 = _conv(output1, out_channels, 3, 1, 1, False, True, 'leaky_relu', True)
- output = l.Add(name=name)([inputs, output2])
- return output
-
- def darknet53_base():
- image = tf.keras.Input(shape=(3,None,None))
- net = _conv(image, 32, 3, 1, 1, False, True, 'leaky_relu') #32*H*W
- net = _conv(net, 64, 3, 2, 1, False, True, 'leaky_relu') #64*H/2*W/2
- net = _residual(net, 64, 'leaky_relu') #64*H/2*W/2
- net = _conv(net, 128, 3, 2, 1, False, True, 'leaky_relu') #128*H/4*W/4
- net = _residual(net, 128, 'leaky_relu') #128*H/4*W/4
- net = _residual(net, 128, 'leaky_relu') #128*H/4*W/4
- net = _conv(net, 256, 3, 2, 1, False, True, 'leaky_relu') #256*H/8*W/8
- net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
- net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
- net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
- net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
- net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
- net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
- net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
- net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
- route1 = l.Activation('linear', dtype='float32', name='route1')(net)
- net = _conv(net, 512, 3, 2, 1, False, True, 'leaky_relu') #512*H/16*W/16
- net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
- net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
- net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
- net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
- net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
- net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
- net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
- net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
- route2 = l.Activation('linear', dtype='float32', name='route2')(net)
- net = _conv(net, 1024, 3, 2, 1, False, True, 'leaky_relu') #1024*H/32*W/32
- net = _residual(net, 1024, 'leaky_relu') #1024*H/32*W/32
- net = _residual(net, 1024, 'leaky_relu') #1024*H/32*W/32
- net = _residual(net, 1024, 'leaky_relu') #1024*H/32*W/32
- net = _residual(net, 1024, 'leaky_relu') #1024*H/32*W/32
- route3 = l.Activation('linear', dtype='float32', name='route3')(net)
- net = tf.reduce_mean(net, axis=[2,3], keepdims=True)
- net = _conv(net, 1000, 1, 1, 0, True, False, 'linear') #1000
- net = l.Flatten(data_format='channels_first', name='logits')(net)
- net = l.Activation('linear', dtype='float32', name='output')(net)
- model = tf.keras.Model(inputs=image, outputs=[net, route1, route2, route3])
- return model
我们需要先基于这个骨干网络架构来进行Imagenet的预训练,以提取有效的图片内容的特征数据。我用这个网络训练了30个EPOCH,最终到达Top-1 71%,Top-5 91%的准确率。具体的训练过程可以见我的博客Imagenet图像分类训练总结(基于Tensorflow 2.0实现)_keras .image_gzroy的博客-CSDN博客
训练好了骨干网络之后,我们就可以在这个网络的基础上再增加相应的卷积层,实现图像特征金字塔(FPN)的架构,这里我们会用到骨干网络输出的route1, route2, route3这几个不同图像分辨率的特征值,最终构建一个可以对图片进行下采样8倍,16倍和32倍的基于网格的检测系统,例如训练图片的分辨率为416*416,那么将输出52*52, 26*26, 13*13这三个不同维度的检测结果。具体的原理可以参见网上的一些文章,例如:我这里参照Darknet的源代码来搭建了一个YOLO V3的网络,代码如下:
- category_num = 80
- vector_size = 3*(1+4+category_num)
- def darknet53_yolov3():
- route1 = tf.keras.Input(shape=(256,None,None), name='input1') #256*H/8*W/8
- route2 = tf.keras.Input(shape=(512,None,None), name='input2') #256*H/16*W/16
- route3 = tf.keras.Input(shape=(1024,None,None), name='input3') #256*H/32*W/32
- net = _conv(route3, 512, 1, 1, 0, False, True, 'leaky_relu') #512*H/32*W/32
- net = _conv(net, 1024, 3, 1, 1, False, True, 'leaky_relu') #1024*H/32*W/32
- net = _conv(net, 512, 1, 1, 0, False, True, 'leaky_relu') #512*H/32*W/32
- net = _conv(net, 1024, 3, 1, 1, False, True, 'leaky_relu') #1024*H/32*W/32
- net = _conv(net, 512, 1, 1, 0, False, True, 'leaky_relu') #512*H/32*W/32
- route4 = tf.identity(net, 'route4')
- net = _conv(net, 1024, 3, 1, 1, False, True, 'leaky_relu') #1024*H/32*W/32
- predict1 = _conv(net, vector_size, 1, 1, 0, True, False, 'linear') #vector_size*H/32*W/32
- predict1 = l.Activation('linear', dtype='float32')(predict1)
- predict1 = l.Reshape((vector_size, imageHeight//32*imageWidth//32))(predict1)
- net = _conv(route4, 256, 1, 1, 0, False, True, 'leaky_relu') #256*H/32*W/32
- net = l.UpSampling2D((2,2),"channels_first",'nearest')(net) #256*H/16*W/16
- net = l.Concatenate(axis=1)([route2, net]) #768*H/16*W/16
- net = _conv(net, 256, 1, 1, 0, False, True, 'leaky_relu') #256*H/16*W/16
- net = _conv(net, 512, 3, 1, 1, False, True, 'leaky_relu') #512*H/16*W/16
- net = _conv(net, 256, 1, 1, 0, False, True, 'leaky_relu') #256*H/16*W/16
- net = _conv(net, 512, 3, 1, 1, False, True, 'leaky_relu') #512*H/16*W/16
- net = _conv(net, 256, 1, 1, 0, False, True, 'leaky_relu') #256*H/16*W/16
- route5 = tf.identity(net, 'route5')
- net = _conv(net, 512, 3, 1, 1, False, True, 'leaky_relu') #512*H/16*W/16
- predict2 = _conv(net, vector_size, 1, 1, 0, True, False, 'linear') #vector_size*H/16*W/16
- predict2 = l.Activation('linear', dtype='float32')(predict2)
- predict2 = l.Reshape((vector_size, imageHeight//16*imageWidth//16))(predict2)
- net = _conv(route5, 128, 1, 1, 0, False, True, 'leaky_relu') #128*H/16*W/16
- net = l.UpSampling2D((2,2),"channels_first",'nearest')(net) #128*H/8*W/8
- net = l.Concatenate(axis=1)([route1, net]) #384*H/8*W/8
- net = _conv(net, 128, 1, 1, 0, False, True, 'leaky_relu') #128*H/8*W/8
- net = _conv(net, 256, 3, 1, 1, False, True, 'leaky_relu') #256*H/8*W/8
- net = _conv(net, 128, 1, 1, 0, False, True, 'leaky_relu') #128*H/8*W/8
- net = _conv(net, 256, 3, 1, 1, False, True, 'leaky_relu') #256*H/8*W/8
- net = _conv(net, 128, 1, 1, 0, False, True, 'leaky_relu') #128*H/8*W/8
- net = _conv(net, 256, 3, 1, 1, False, True, 'leaky_relu') #256*H/8*W/8
- predict3 = _conv(net, vector_size, 1, 1, 0, True, False, 'linear') #vector_size*H/8*W/8
- predict3 = l.Activation('linear', dtype='float32')(predict3)
- predict3 = l.Reshape((vector_size, imageHeight//8*imageWidth//8))(predict3)
- predict = l.Concatenate()([predict3, predict2, predict1])
- predict = tf.transpose(predict, perm=[0, 2, 1], name='predict')
- model = tf.keras.Model(inputs=[route1, route2, route3], outputs=predict, name='darknet53_yolo')
- return model
可以看到,这个网络模型是以骨干网络的三个输出route1, route2, route3作为输入的,最终输出的Predict是三个不同维度的预测结果。
有了训练数据和搭建好网络模型之后,我们就可以开始训练了。整个训练过程分为如下几步:
1. 对骨干网络进行更高分辨率的训练
因为我们的骨干网络是基于224*224这个分辨率来进行训练和提取图片特征的,但是在物体检测中,这个分辨率太低,不利于检测小物体,因此我们需要基于更加高的分辨率,例如416*416来进行训练。我们可以把骨干网络基于这个高的分辨率再多训练一些次数,让网络适应高分辨率,这样可以最终提升物体检测的性能。为此我们可以重新加载之前训练好的骨干网络来进行训练。
2. 骨干网络和检测网络组合
把预训练完成后的骨干网络和检测网络组合起来,构成一个YOLO V3的网络模型。训练图片先通过骨干网络进行特征提取,输出route1,route2,route3这三个不同维度的图像特征数据,然后作为输入进到检测网络中进行训练,最终得到三个维度的预测结果。这个组合网络中,需要设置骨干网络的参数为不可训练,只训练检测网络的参数。代码如下:
- #Load the pretrained backbone model
- model_base = tf.keras.models.load_model('darknet53/epoch_60.h5')
- model_base.trainable = False
- image = tf.keras.Input(shape=(3,image_height,image_width))
- _, route1, route2, route3 = model_base(image, training=False)
-
- #The detect model will accept the backmodel output as input
- predict = darknet53_yolov3(image_height,image_width)([route1, route2, route3])
-
- #Construct the combined yolo model
- model_yolo = tf.keras.Model(inputs=image, outputs=predict, name='model_yolo')
3. 读取训练数据并进行预处理
读取COCO训练集的图片和检测框的数据,并进行数据增广,生成数据标签等预处理。这里的数据增广除了按照Darknet源码的处理方式之外,还参照论文https://arxiv.org/pdf/1902.04103.pdf中提出的数据增广处理流程,增加了Mixup的处理,即一次取两张图片,通过随机的透明度的处理之后,同时叠加在一起。
因为涉及到检测框的位置,在做数据增广时,需要相应调整检测框的位置。包括以下几个步骤:
首先是定义模型的一些参数,代码如下:
- mixup_flag = True
- #Parameters for PCA noice
- eigvec = tf.constant(
- [
- [-0.5675, 0.7192, 0.4009],
- [-0.5808, -0.0045, -0.8140],
- [-0.5836, -0.6948, 0.4203]
- ],
- shape=[3,3],
- dtype=tf.float32
- )
- eigval = tf.constant([55.46, 4.794, 1.148], shape=[3,1], dtype=tf.float32)
- #Parameters for normalization
- mean_RGB = tf.constant([123.68, 116.779, 109.939], dtype=tf.float32)
- std_RGB = tf.constant([58.393, 57.12, 57.375], dtype=tf.float32)
- #Train and valid batch size
- batch_size = 16
- val_batch_size = 10
- epoch_size = 118287
- epoch_batch = int(epoch_size/batch_size)
- #Parameters for yolo loss scale
- no_object_scale = 1.0
- iou_threshold = 0.7
- object_scale=3.0
- class_scale=1.0
- jitter = 0.3
- #Label and prediction vector size
- category_num = 80
- label_vector_size = 1+1+4+1+category_num #index 0:grid_id,1:obj_conf,2-5:(x,y,w,h),6:mixup weight,7-86:category
- vector_size = 1+4+category_num #index 0:obj_conf,1-4:(x,y,w,h),5-84:category
- #Images parameter
- image_size_list = [320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
- image_size = image_size_list[random.randint(0,9)]
- val_image_size = 608
- #Grids parameter
- grid_wh_array = np.array([[8.,8.],[16.,16.],[32.,32.]])
- grid_size = [8.,16.,32.]
- #The Anchor size for image_size 416*416
- anchors_base = [10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326]
定义读取训练文件的函数:
- def _parse_function(example_proto):
- features = {
- "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
- "height": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
- "width": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
- "channels": tf.io.FixedLenFeature([1], tf.int64, default_value=[3]),
- "colorspace": tf.io.FixedLenFeature([], tf.string, default_value=""),
- "img_format": tf.io.FixedLenFeature([], tf.string, default_value=""),
- "label": tf.io.VarLenFeature(tf.int64),
- "bbox_xmin": tf.io.VarLenFeature(tf.int64),
- "bbox_xmax": tf.io.VarLenFeature(tf.int64),
- "bbox_ymin": tf.io.VarLenFeature(tf.int64),
- "bbox_ymax": tf.io.VarLenFeature(tf.int64),
- "filename": tf.io.FixedLenFeature([], tf.string, default_value="")
- }
- parsed_features = tf.io.parse_single_example(example_proto, features)
- label = tf.expand_dims(parsed_features["label"].values, 0)
- label = tf.cast(label, tf.float32)
- image_raw = tf.image.decode_jpeg(parsed_features["image"], channels=3)
- image_decoded = tf.cast(image_raw, dtype=tf.float32)
- filename = parsed_features["filename"]
- #Get the coco image id as we need to use COCO API to evaluate
- image_id = tf.strings.to_number(tf.strings.substr(filename, 0, 12), tf.int32)
- image_id = tf.expand_dims(image_id, 0)
- #Get the bbox
- xmin = tf.cast(tf.expand_dims(parsed_features["bbox_xmin"].values, 0), tf.float32)
- xmax = tf.cast(tf.expand_dims(parsed_features["bbox_xmax"].values, 0), tf.float32)
- ymin = tf.cast(tf.expand_dims(parsed_features["bbox_ymin"].values, 0), tf.float32)
- ymax = tf.cast(tf.expand_dims(parsed_features["bbox_ymax"].values, 0), tf.float32)
- mixup_w = tf.ones_like(xmin)
- boxes = tf.concat([xmin,ymin,xmax,ymax,label,mixup_w], axis=0)
- boxes = tf.transpose(boxes, [1, 0])
- return {'image':image_decoded, 'bbox':boxes, 'imageid':image_id}
定义一个Flatmap函数,每次读取两张图片,通过Flatmap函数来把这两张图片组合在一起
- def _flatmap_function(feature):
- dataset_image = feature['image'].padded_batch(2, [-1,-1,3])
- dataset_bbox = feature['bbox'].padded_batch(2, [-1,6])
- dataset_combined = tf.data.Dataset.zip({'image':dataset_image, 'bbox':dataset_bbox})
- return dataset_combined
Mixup函数把组合后的两张图片进行数据增广处理,同时生成训练的Label
- def _label_fn(bbox):
- global image_size,grid_wh_array,anchors_base
- grids_list = [image_size//8, image_size//16, image_size//32]
- image_ratio = image_size/416
- anchors = [round(a*image_ratio) for a in anchors_base]
- labels_list = [np.zeros([a**2,label_vector_size]) for a in grids_list]
- for i in range(3):
- labels_list[i][:,0] = np.arange(grids_list[i]**2)
- labels_list = [np.tile(a,3) for a in labels_list]
- box_num, _ = bbox.shape
- for i in range(box_num):
- center_x = (bbox[i,0]+bbox[i,2])/2
- center_y = (bbox[i,1]+bbox[i,3])/2
- if (center_x==0 and center_y==0):
- continue
- box_width = bbox[i,2]-bbox[i,0]
- box_height = bbox[i,3]-bbox[i,1]
- label = np.int(bbox[i,4].numpy())
- anchor_id = np.int(bbox[i,5].numpy())
- featuremap_id = anchor_id//3
- anchorid_offset = anchor_id%3
- g_h = grid_wh_array[featuremap_id,1]
- g_w = grid_wh_array[featuremap_id,0]
- grid_id = np.int((center_y//g_h*grids_list[featuremap_id] + center_x//g_w).numpy())
- index = anchorid_offset*label_vector_size
- #set the object exist flag
- labels_list[featuremap_id][grid_id, index+1] = 1.
- #set the center_x_offset
- labels_list[featuremap_id][grid_id, index+2]=(center_x%g_w)/g_w
- #set the center_y_offset
- labels_list[featuremap_id][grid_id, index+3]=(center_y%g_h)/g_h
- #set the width
- labels_list[featuremap_id][grid_id, index+4]=math.log(box_width/anchors[2*anchor_id])
- #set the height
- labels_list[featuremap_id][grid_id, index+5]=math.log(box_height/anchors[2*anchor_id+1])
- #set the mixup weight
- labels_list[featuremap_id][grid_id, index+6]=bbox[i,6]
- #set the class label, using label smoothing
- labels_list[featuremap_id][grid_id, (index+7):(index+label_vector_size)]=0.1/(category_num-1)
- labels_list[featuremap_id][grid_id, index+7+label]=0.9
- #labels_list[featuremap_id][grid_id, index+7+label]=1.0
- return tf.concat(labels_list, axis=0)
-
- def _mixup_function(features):
- global anchors_base,image_size,mixup_flag,grid_size
- image_ratio = image_size/416
- anchors = [round(a*image_ratio) for a in anchors_base]
- image_height = image_size
- image_width = image_size
- images = features['image']
- bboxes = features['bbox']
- #imageid = features['imageid']
- if mixup_flag:
- lam = np.random.beta(1.5,1.5,1)
- lam_all = np.vstack([lam,1.-lam])
- lam_all = np.expand_dims(lam_all, 1)
- #bboxes = tf.cast(bboxes, tf.float32)
- mixup_w = bboxes[...,-1:] + lam_all
- bboxes_mixup = tf.concat([bboxes[...,:-1], mixup_w], axis=-1)
- bboxes_mixup = tf.reshape(bboxes_mixup, [-1,6])
- true_box_mask = tf.logical_or(
- bboxes_mixup[:,1]>0,
- bboxes_mixup[:,1]>0
- )
- bboxes_all = tf.boolean_mask(bboxes_mixup, true_box_mask)
- image_mix = (images[0]*lam[0] + images[1]*(1.-lam[0]))
- else:
- image_mix = images
- bboxes_all = bboxes
- #Random jitter and resize the image
- height = tf.shape(image_mix)[0]
- width = tf.shape(image_mix)[1]
- dw = jitter*tf.cast(width, tf.float32)
- dh = jitter*tf.cast(height, tf.float32)
- new_ar = tf.truediv(
- tf.add(
- tf.cast(width, tf.float32),
- tf.random.uniform([1], minval=tf.math.negative(dw), maxval=dw)),
- tf.add(
- tf.cast(height, tf.float32),
- tf.random.uniform([1], minval=tf.math.negative(dh), maxval=dh)))
- nh, nw = tf.cond(
- tf.less(new_ar[0],1), \
- lambda:(image_height, tf.cast(tf.cast(image_height, tf.float32)*new_ar[0], tf.int32)), \
- lambda:(tf.cast(tf.cast(image_width, tf.float32)/new_ar[0], tf.int32), image_width)
- )
- dx = tf.cond(
- tf.equal(image_width, nw), \
- lambda:tf.constant([0]), \
- lambda:tf.random.uniform([1], minval=0, maxval=(image_width-nw), dtype=tf.int32)
- )
- dy = tf.cond(
- tf.equal(image_height, nh), \
- lambda:tf.constant([0]), \
- lambda:tf.random.uniform([1], minval=0, maxval=(image_height-nh), dtype=tf.int32)
- )
- image_resize = tf.image.resize(image_mix, [nh, nw])
- image_padded = tf.image.pad_to_bounding_box(image_resize, dy[0], dx[0], image_height, image_width)
- #Adjust the boxes
- xmin_new = tf.cast(tf.truediv(nw, width) * tf.cast(bboxes_all[:,0:1],tf.float64), tf.int32) + dx
- xmax_new = tf.cast(tf.truediv(nw, width) * tf.cast(bboxes_all[:,2:3],tf.float64), tf.int32) + dx
- ymin_new = tf.cast(tf.truediv(nh, height) * tf.cast(bboxes_all[:,1:2],tf.float64), tf.int32) + dy
- ymax_new = tf.cast(tf.truediv(nh, height) * tf.cast(bboxes_all[:,3:4],tf.float64), tf.int32) + dy
- # Random flip flag
- random_flip_flag = tf.random.uniform([1], minval=0, maxval=1, dtype=tf.float32)
- def flip_box():
- xmax_flip = image_width - xmin_new
- xmin_flip = image_width - xmax_new
- image_flip = tf.image.flip_left_right(image_padded)
- return xmin_flip, xmax_flip, image_flip
- def notflip():
- return xmin_new, xmax_new, image_padded
- xmin_flip, xmax_flip, image_flip = tf.cond(tf.less(random_flip_flag[0], 0.5), notflip, flip_box)
- boxes_width = xmax_flip-xmin_flip
- boxes_height = ymax_new-ymin_new
- boxes_area = boxes_width*boxes_height
- # Determine the anchor
- iou_list = []
- for i in range(9):
- intersect_area = tf.minimum(boxes_width, anchors[2*i])*tf.minimum(boxes_height, anchors[2*i+1])
- union_area = boxes_area+anchors[2*i]*anchors[2*i+1]-intersect_area
- iou_list.append(intersect_area/union_area)
- iou = tf.concat(iou_list, axis=1)
- anchor_id = tf.reshape(tf.argmax(iou, axis=1), [-1,1])
- # Random distort the image
- distorted = tf.image.random_hue(image_flip, max_delta=0.3)
- distorted = tf.image.random_saturation(distorted, lower=0.6, upper=1.4)
- distorted = tf.image.random_brightness(distorted, max_delta=0.3)
- # Add PCA noice
- alpha = tf.random.normal([3], mean=0.0, stddev=0.1)
- pca_noice = tf.reshape(tf.matmul(tf.multiply(eigvec,alpha), eigval), [3])
- distorted = tf.add(distorted, pca_noice)
- # Normalize RGB
- distorted = tf.subtract(distorted, mean_RGB)
- distorted = tf.divide(distorted, std_RGB)
- # Get the adjusted boxes
- xmin_flip = tf.cast(xmin_flip, tf.float32)
- xmax_flip = tf.cast(xmax_flip, tf.float32)
- ymin_new = tf.cast(ymin_new, tf.float32)
- ymax_new = tf.cast(ymax_new, tf.float32)
- anchor_id = tf.cast(anchor_id, tf.float32)
- boxes_new = tf.concat([xmin_flip,ymin_new,xmax_flip,ymax_new,bboxes_all[:,4:5],anchor_id,bboxes_all[:,-1:]], axis=1)
- # Remove the boxes that height or width less than 5 pixels
- boxes_mask = tf.math.logical_and(
- tf.math.greater((boxes_new[:,2]-boxes_new[:,0]), 5),
- tf.math.greater((boxes_new[:,3]-boxes_new[:,1]), 5))
- boxes_new = tf.boolean_mask(boxes_new, boxes_mask)
- boxes_new = tf.cast(boxes_new, tf.float32)
- # Generate the labels
- labels = tf.py_function(_label_fn, [boxes_new], [tf.float64])
- labels = tf.cast(labels, tf.float32)
-
- image_train = tf.transpose(distorted, perm=[2, 0, 1])
- #features = {'images':image_train, 'bboxes':boxes_new, 'images_flip':image_flip, 'image_id':imageid}
- features = {'images':image_train, 'bboxes':boxes_new}
- return features, labels[0]
然后就可以构造训练的数据集了
- def train_input_fn():
- global image_size
- train_files = tf.data.Dataset.list_files("../dataset/coco/train2017_tf/*.tfrecord")
- dataset_train = train_files.interleave(tf.data.TFRecordDataset, num_parallel_calls=tf.data.experimental.AUTOTUNE)
- dataset_train = dataset_train.shuffle(buffer_size=1000, reshuffle_each_iteration=True)
- dataset_train = dataset_train.repeat(8)
- dataset_train = dataset_train.map(_parse_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
- if mixup_flag:
- dataset_train = dataset_train.window(2)
- dataset_train = dataset_train.flat_map(_flatmap_function)
- dataset_train = dataset_train.map(_mixup_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
- dataset_train = dataset_train.padded_batch(batch_size, \
- padded_shapes=(
- {
- 'images':[3,image_size,image_size],
- 'bboxes':[None,7]
- },
- [None, label_vector_size*3]
- )
- )
- dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)
- return dataset_train
数据增广后的处理效果可见下图:
4. 定义损失函数
这是整个训练中最具挑战性的部分。因为按照YOLO V3的源代码,损失函数由两部分组成:
具体的代码如下:
- # Predicts, combination of three dimention, [batch, 52*52+26*26+13*13, 85*3]
- # Labels, combination of three dimention, [batch, 52*52+26*26+13*13, 87*3]
- def new_loss_func(predict, label, gt_box, grids_property):
- global image_size
- predict = tf.reshape(predict, [batch_size,-1,vector_size]) #[batch, (52*52+26*26+13*13)*3, 85]
- label = tf.reshape(label, [batch_size,-1,label_vector_size]) #[batch, (52*52+26*26+13*13)*3, 87]
- noobj_mask = tf.cast(label[...,1:2]==0.0, tf.float32)
- obj_mask = tf.cast(label[...,1:2]==1.0, tf.float32)
- #Get the predict box center xy
- predict_xy = (grids_property[...,0:2]+tf.nn.sigmoid(predict[...,1:3]))*grids_property[...,-2:]
- #Get the predict box wh, only caluculate the noobj wh
- predict_half_wh = tf.exp(predict[...,3:5])*grids_property[...,2:4]/2
- predict_xmin = tf.clip_by_value((predict_xy[...,0:1]-predict_half_wh[...,0:1]), 0, image_size)
- predict_xmax = tf.clip_by_value((predict_xy[...,0:1]+predict_half_wh[...,0:1]), 0, image_size)
- predict_ymin = tf.clip_by_value((predict_xy[...,1:2]-predict_half_wh[...,1:2]), 0, image_size)
- predict_ymax = tf.clip_by_value((predict_xy[...,1:2]+predict_half_wh[...,1:2]), 0, image_size)
- predict_boxes_area = (predict_xmax-predict_xmin)*(predict_ymax-predict_ymin) #[-batch, (52*52+26*26+13*13)*3, 1]
- #Assemble the predict box coords and expand dim, shape: [batch, (52*52+26*26+13*13)*3, 1, 4]
- predict_boxes = tf.concat([predict_xmin,predict_ymin,predict_xmax,predict_ymax], axis=-1)
- predict_boxes = tf.expand_dims(predict_boxes, 2)
- #Expand ground boxes dim for broadcast, shape: [batch, 1, V, 4]
- gt_box = tf.expand_dims(gt_box, 1)
- gt_box = tf.cast(gt_box, tf.float32)
- #gt_box_area = (gt_box[...,2:3]-gt_box[...,0:1])*(gt_box[...,3:4]-gt_box[...,1:2]) #[batch, 1, V, 1]
- gt_box_area = (gt_box[...,2]-gt_box[...,0])*(gt_box[...,3]-gt_box[...,1]) #[batch, 1, V]
- #Broadcast calculation, intersect_boxes_width shape [batch, noobjs_num, V, 1]
- intersect_boxes_width = tf.minimum(predict_boxes[...,2:3], gt_box[...,2:3])-tf.maximum(predict_boxes[...,0:1], gt_box[...,0:1])
- intersect_boxes_width = tf.clip_by_value(intersect_boxes_width, clip_value_min=0, clip_value_max=image_size)
- intersect_boxes_height = tf.minimum(predict_boxes[...,3:4], gt_box[...,3:4])-tf.maximum(predict_boxes[...,1:2], gt_box[...,1:2])
- intersect_boxes_height = tf.clip_by_value(intersect_boxes_height, clip_value_min=0, clip_value_max=image_size)
- intersect_boxes_area = intersect_boxes_width * intersect_boxes_height # [batch, (52*52+26*26+13*13)*3, V, 1]
- intersect_boxes_area = tf.squeeze(intersect_boxes_area) # [batch, (52*52+26*26+13*13)*3, V]
- #Calculate the noobj predict box IOU with ground truth boxes, shape:[batch, (52*52+26*26+13*13)*3, V]
- iou_boxes = intersect_boxes_area/(predict_boxes_area+gt_box_area-intersect_boxes_area) #
- iou_max = tf.reduce_max(iou_boxes, axis=2, keepdims=True) #[batch, (52*52+26*26+13*13)*3, 1]
- #iou_max = tf.expand_dims(iou_max, 2)
- #Ignore the noobj loss for the IOU larger than threshold
- no_ignore_mask = tf.cast(iou_max[...,0:1]<iou_threshold, tf.float32)
- noobj_loss_mask = tf.cast(noobj_mask*no_ignore_mask, tf.bool)
- noobj_loss_mask = tf.reshape(noobj_loss_mask, [batch_size, -1])
- noobj_predict = tf.boolean_mask(predict, noobj_loss_mask)[...,0:1]
- #noobj_predict_topk, _ = tf.nn.top_k(noobj_predict[...,0], k=500)
- #Calculate the noobj predict loss
- #loss_noobj = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(tf.zeros_like(noobj_predict_topk), noobj_predict_topk))
- loss_noobj = tf.reduce_sum(
- tf.nn.sigmoid_cross_entropy_with_logits(tf.zeros_like(noobj_predict), noobj_predict))
- loss_noobj = loss_noobj/batch_size
-
- #Calculate the scale for object coord loss
- coord_scale = 2.0-\
- (tf.exp(label[...,4:5])*grids_property[...,2:3])*\
- (tf.exp(label[...,5:6])*grids_property[...,3:4])/\
- image_size**2
- loss_xy = tf.reduce_sum(
- 0.5*coord_scale*label[...,6:7]*obj_mask*
- tf.nn.sigmoid_cross_entropy_with_logits(
- labels=label[...,2:4],
- logits=predict[...,1:3]
- )
- )
- loss_wh = tf.reduce_sum(
- 0.5*coord_scale*label[...,6:7]*obj_mask*
- tf.square(
- label[...,4:6]-
- predict[...,3:5]
- )
- )
- #Calculate the conf loss
- loss_conf = tf.reduce_sum(
- object_scale*label[...,6:7]*obj_mask*
- tf.nn.sigmoid_cross_entropy_with_logits(
- tf.ones_like(predict[...,0:1]),
- predict[...,0:1]
- )
- )
- #Calculate the predict class loss
- loss_class = tf.reduce_sum(
- class_scale*label[...,6:7]*obj_mask*
- tf.nn.sigmoid_cross_entropy_with_logits(
- labels=label[...,7:],
- logits=predict[...,5:]
- )
- )
- loss_obj = (loss_xy+loss_wh+loss_conf+loss_class)/batch_size
- final_loss = loss_obj + loss_noobj
- return final_loss
- tf_new_loss_func = tf.function(new_loss_func, experimental_relax_shapes=True)
5. 模型的训练
模型的训练过程,我是采用了自定义训练的方式来做的。YOLO论文提到训练时可以随机采用多种图片尺度,例如416*416, 608*608,352*352等,这样的好处是模型能够更好的适应不同尺寸大小的图片的检测。
随机采用多种图片尺度的代码如下:
- def random_image():
- global image_size_list,image_size
- global grid_wh_array
- global anchors_base
-
- image_size = image_size_list[random.randint(0,9)]
- #image_size = 608
- image_ratio = image_size/416
- grids_list = [image_size//8, image_size//16, image_size//32]
- anchors = [round(a*image_ratio) for a in anchors_base]
- grids_x_list = [np.reshape(np.arange(a**2)%a,[-1,1]) for a in grids_list]
- grids_x = np.vstack(grids_x_list)
- grids_x = np.reshape(np.hstack([grids_x,grids_x,grids_x]),[-1,1])
- grids_y_list = [np.reshape(np.arange(a**2)//a,[-1,1]) for a in grids_list]
- grids_y = np.vstack(grids_y_list)
- grids_y = np.reshape(np.hstack([grids_y,grids_y,grids_y]),[-1,1])
- anchors_all = np.vstack(
- [
- np.reshape(np.tile(np.reshape(np.array(anchors[:6]),[-1,6]),[grids_list[0]**2,1]),[-1,2]),
- np.reshape(np.tile(np.reshape(np.array(anchors[6:12]),[-1,6]),[grids_list[1]**2,1]),[-1,2]),
- np.reshape(np.tile(np.reshape(np.array(anchors[12:]),[-1,6]),[grids_list[2]**2,1]),[-1,2])
- ]
- )
- grid_wh_all = np.vstack(
- [
- np.tile(grid_wh_array[:1,:], (grids_list[0]**2*3,1)),
- np.tile(grid_wh_array[1:2,:], (grids_list[1]**2*3,1)),
- np.tile(grid_wh_array[2:3,:], (grids_list[2]**2*3,1))
- ]
- )
- grids_property = np.concatenate([grids_x, grids_y, anchors_all, grid_wh_all], axis=-1)
- grids_property_all = tf.constant(grids_property, dtype=tf.float32)
- grids_property_all = tf.expand_dims(grids_property_all, 0)
- grids_property_all = tf.tile(grids_property_all, [batch_size,1,1])
- return grids_property_all
自定义训练过程的代码如下:
- model_base = tf.keras.models.load_model('darknet53_20200228/epoch_42.h5')
- model_base.trainable = False
- image = tf.keras.Input(shape=(3,None,None))
- _, route1, route2, route3 = model_base(image, training=False)
- predict = darknet53_yolov3()([route1, route2, route3])
- model_yolo = tf.keras.Model(inputs=image, outputs=predict, name='model_yolo')
-
- START_EPOCH = 0
- NUM_EPOCH = 1
- STEPS_EPOCH = epoch_batch
- STEPS_OFFSET = STEPS_EPOCH*START_EPOCH
- initial_warmup_steps = 1000
- initial_lr = 0.0005
-
- optimizer=tf.keras.optimizers.SGD(learning_rate=0.00001, momentum=0.9)
- mp_opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer)
- def train_step(images, bbox, labels, grids_property_all):
- with tf.GradientTape() as tape:
- predict = model_yolo(images, training=True)
- regularization_loss = tf.math.add_n(model_yolo.losses)
- pred_loss = tf_new_loss_func(predict, labels, bbox, grids_property_all)
- total_loss = pred_loss + regularization_loss
- gradients = tape.gradient(total_loss, model_yolo.trainable_variables)
- mp_opt.apply_gradients(zip(gradients, model_yolo.trainable_variables))
- return total_loss, predict
- tf_train_step = tf.function(train_step, experimental_relax_shapes=True)
- #Loss rate step decay
- boundaries = [STEPS_EPOCH*4, STEPS_EPOCH*10, STEPS_EPOCH*13, STEPS_EPOCH*16]
- values = [0.0005, 0.0001, 0.00005, 0.00001, 0.00005]
- learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(boundaries, values)
-
- steps = STEPS_OFFSET
- for epoch in range(NUM_EPOCH):
- loss_sum = 0
- start_time = time.time()
- grids_property_all = new_random_image()
- train_data = iter(train_input_fn())
- #for features, labels in train_data:
- while(True):
- if steps < initial_warmup_steps:
- newlr = (initial_lr/initial_warmup_steps)*steps
- tf.keras.backend.set_value(optimizer.lr, newlr)
- features, data_labels = train_data.next()
- loss_temp, predict_temp = tf_train_step(features['images'], features['bboxes'], data_labels, grids_property_all)
- loss_sum += loss_temp
- steps += 1
- if steps%100 == 0:
- elasp_time = time.time()-start_time
- lr = tf.keras.backend.get_value(optimizer.lr)
- print("Step:{}, Image_size:{:d}, Loss:{:4.2f}, LR:{:5f}, Time:{:3.1f}s".format(steps, image_size, loss_sum/100, lr, elasp_time))
- loss_sum = 0
- if steps > initial_warmup_steps:
- tf.keras.backend.set_value(optimizer.lr, learning_rate_fn(steps))
- start_time = time.time()
- if steps%STEPS_EPOCH == 0:
- START_EPOCH += 1
- model_yolo.save('model_yolov3/yolo_v10_'+str(START_EPOCH)+'.h5')
- break
模型的训练非常耗时,在我的电脑(2080Ti)的配置下,训练一个Epoch大概要花2个小时,我训练了14个EPOCH,mAP .50的准确度大概为32%,和论文提到的57.9%还有比较大的差距。不过Darknet的源码是训练了200多个EPOCH的,可能继续训练会进一步提高准确度。这个有待以后继续验证。
6. 评价模型的性能指标
目标检测一般采用mAP来评价性能,这个指标的计算比较复杂,我是直接采用了COCO API来进行计算,这个也是和论文中的计算方法保持一致。
首先是构造COCO测试集,代码如下:
- def _parse_val_function(example_proto):
- global val_image_size
- features = {
- "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
- "height": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
- "width": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
- "channels": tf.io.FixedLenFeature([1], tf.int64, default_value=[3]),
- "colorspace": tf.io.FixedLenFeature([], tf.string, default_value=""),
- "img_format": tf.io.FixedLenFeature([], tf.string, default_value=""),
- "label": tf.io.VarLenFeature(tf.int64),
- "bbox_xmin": tf.io.VarLenFeature(tf.int64),
- "bbox_xmax": tf.io.VarLenFeature(tf.int64),
- "bbox_ymin": tf.io.VarLenFeature(tf.int64),
- "bbox_ymax": tf.io.VarLenFeature(tf.int64),
- "filename": tf.io.FixedLenFeature([], tf.string, default_value="")
- }
- parsed_features = tf.io.parse_single_example(example_proto, features)
- label = tf.expand_dims(parsed_features["label"].values, 0)
- label = tf.cast(label, tf.int32)
- channels = parsed_features["channels"]
- filename = parsed_features["filename"]
- #Get the coco image id as we need to use COCO API to evaluate
- image_id = tf.strings.to_number(
- tf.strings.substr(filename, 0, 12),
- tf.int32
- )
- #Decode the image
- image_raw = tf.image.decode_jpeg(parsed_features["image"], channels=3)
- image_decoded = tf.cast(image_raw, dtype=tf.float32)
- image_h = tf.constant(val_image_size)
- image_w = tf.constant(val_image_size)
- height = tf.shape(image_decoded)[0]
- width = tf.shape(image_decoded)[1]
- original_size = tf.stack([height, width], axis=0)
- original_size = tf.cast(original_size, tf.float32)
- ratio = tf.truediv(tf.cast(height, tf.float32), tf.cast(width, tf.float32))
- nh, nw = tf.cond(
- tf.less(ratio,1),
- lambda:(tf.cast(tf.cast(image_h, tf.float32)*ratio, tf.int32), image_w),
- lambda:(image_h, tf.cast(tf.cast(image_w, tf.float32)/ratio, tf.int32)))
- dx = tf.cond(
- tf.equal(image_w, nw), \
- lambda:tf.constant(0), \
- lambda:tf.cast((image_w-nw)/2, tf.int32))
- dy = tf.cond(
- tf.equal(image_h, nh), \
- lambda:0, \
- lambda:tf.cast((image_h-nh)/2, tf.int32))
- image_resize = tf.image.resize(image_decoded, [nh, nw])
- image_padded = tf.image.pad_to_bounding_box(image_resize, dy, dx, image_h, image_w)
- image_normalize = tf.subtract(image_padded, mean_RGB)
- image_normalize = tf.divide(image_normalize, std_RGB)
- image_val = tf.transpose(image_normalize, perm=[2, 0, 1])
- features = {'images':image_val, 'image_id':image_id, 'original_size':original_size}
- return features
-
- def val_input_fn():
- val_files = tf.data.Dataset.list_files("../dataset/coco/val2017_tf/*.tfrecord")
- dataset_val = val_files.interleave(tf.data.TFRecordDataset, cycle_length=12, num_parallel_calls=12)
- dataset_val = dataset_val.map(_parse_val_function, num_parallel_calls=12)
- dataset_val = dataset_val.batch(val_batch_size)
- dataset_val = dataset_val.prefetch(1)
- return dataset_val
解码预测的结果,转换为相应的BBOX,如以下代码:
- def predict_func(predict, image_id, original_size):
- global val_image_size, anchors_base
- val_grids_list = [val_image_size//8, val_image_size//16, val_image_size//32]
- image_ratio = val_image_size/416
- val_anchors = [round(a*image_ratio) for a in anchors_base]
- val_grids_x_list = [np.reshape(np.arange(a**2)%a,[-1,1]) for a in val_grids_list]
- val_grids_x = np.vstack(val_grids_x_list)
- val_grids_y_list = [np.reshape(np.arange(a**2)//a,[-1,1]) for a in val_grids_list]
- val_grids_y = np.vstack(val_grids_y_list)
- val_anchors_all = np.vstack(
- [
- np.tile(np.reshape(np.array(val_anchors[:6]),[-1,6]),[val_grids_list[0]**2,1]),
- np.tile(np.reshape(np.array(val_anchors[6:12]),[-1,6]),[val_grids_list[1]**2,1]),
- np.tile(np.reshape(np.array(val_anchors[12:]),[-1,6]),[val_grids_list[2]**2,1])
- ]
- )
- grid_wh_all = np.vstack(
- [
- np.tile(grid_wh_array[:1,:], (val_grids_list[0]**2,1)),
- np.tile(grid_wh_array[1:2,:], (val_grids_list[1]**2,1)),
- np.tile(grid_wh_array[2:3,:], (val_grids_list[2]**2,1))
- ]
- )
- val_grids_property = np.concatenate([val_grids_x, val_grids_y, val_anchors_all, grid_wh_all], axis=-1)
- val_grids_property_all = tf.constant(val_grids_property, dtype=tf.float32)
- val_grids_property_all = tf.expand_dims(val_grids_property_all, 0)
- val_grids_property_all = tf.tile(val_grids_property_all, [predict.shape[0],1,1])
- result_json = []
- original_height = original_size[...,0]
- original_width = original_size[...,1]
- hw_ratio = original_height/original_width
- hw_ratio_mask = tf.cast(tf.less(hw_ratio, 1.), tf.float32)
- ratio = \
- hw_ratio_mask*(original_width/val_image_size) + \
- (1.-hw_ratio_mask)*(original_height/val_image_size)
- dx = (1.-hw_ratio_mask)*((original_height-original_width)//2)
- dy = hw_ratio_mask*((original_width-original_height)//2)
-
- confidence_threshold = 0.2
- probabilty_threshold = 0.5
- predict_boxes_list = []
- for i in range(3):
- predict_conf = tf.nn.sigmoid(predict[...,i*vector_size:(i*vector_size+1)])
- predict_xy = tf.nn.sigmoid(predict[...,(i*vector_size+1):(i*vector_size+3)])
- predict_xy = predict_xy + val_grids_property_all[...,0:2]
- predict_x = predict_xy[...,0:1] * val_grids_property_all[...,-2:-1]
- predict_y = predict_xy[...,1:] * val_grids_property_all[...,-1:]
- predict_w = tf.exp(predict[...,(i*vector_size+3):(i*vector_size+4)])
- predict_w = predict_w * val_grids_property_all[...,(2+i*2):(2+i*2+1)]
- predict_h = tf.exp(predict[...,(i*vector_size+4):(i*vector_size+5)])
- predict_h = predict_h * val_grids_property_all[...,(2+i*2+1):(2+i*2+2)]
- min_x = tf.clip_by_value((predict_x-predict_w/2), 0, val_image_size)
- max_x = tf.clip_by_value((predict_x + predict_w/2), 0, val_image_size)
- min_y = tf.clip_by_value((predict_y - predict_h/2), 0, val_image_size)
- max_y = tf.clip_by_value((predict_y + predict_h/2), 0, val_image_size)
- predict_class = tf.argmax(predict[...,(i*vector_size+5):((i+1)*vector_size)], axis=-1)
- predict_class = tf.cast(predict_class, tf.float32)
- predict_class = tf.expand_dims(predict_class, 2)
- predict_proba = tf.nn.sigmoid(
- tf.reduce_max(
- predict[...,(i*vector_size+5):((i+1)*vector_size)], axis=-1, keepdims=True
- )
- )
- predict_box = tf.concat([predict_conf, min_x, min_y, max_x, max_y, predict_class, predict_proba], axis=-1)
- predict_boxes_list.append(predict_box)
- predict_boxes = tf.concat(predict_boxes_list, axis=1)
-
- for i in range(predict.shape[0]):
- obj_mask = tf.logical_and(
- predict_boxes[i,:,0]>=confidence_threshold,
- predict_boxes[i,:,-1]>=probabilty_threshold)
- predict_true_box = tf.boolean_mask(predict_boxes[i], obj_mask)
- predict_classes, _ = tf.unique(predict_true_box[:,5])
- predict_classes_list = tf.unstack(predict_classes)
- for class_id in predict_classes_list:
- class_mask = tf.math.equal(predict_true_box[:, 5], class_id)
- predict_true_box_class = tf.boolean_mask(predict_true_box, class_mask)
- predict_true_box_xy = predict_true_box_class[:, 1:5]
- predict_true_box_score = predict_true_box_class[:, 6]*predict_true_box_class[:, 0]
- #predict_true_box_score = predict_true_box_class[:, 0]
- selected_indices = tf.image.non_max_suppression(
- predict_true_box_xy,
- predict_true_box_score,
- 100,
- iou_threshold=0.2
- #score_threshold=confidence_threshold
- )
- #Shape [box_num, 7]
- selected_boxes = tf.gather(predict_true_box_class, selected_indices)
- original_bbox_xmin = tf.clip_by_value(
- selected_boxes[:,1:2]*ratio[i]-dx[i], 0, original_width[i])
- original_bbox_xmax = tf.clip_by_value(
- selected_boxes[:,3:4]*ratio[i]-dx[i], 0, original_width[i])
- original_bbox_ymin = tf.clip_by_value(
- selected_boxes[:,2:3]*ratio[i]-dy[i], 0, original_height[i])
- original_bbox_ymax = tf.clip_by_value(
- selected_boxes[:,4:5]*ratio[i]-dy[i], 0, original_height[i])
- original_bbox_width = original_bbox_xmax - original_bbox_xmin
- original_bbox_height = original_bbox_ymax - original_bbox_ymin
- original_bbox = tf.concat(
- [
- selected_boxes[:,0:1],
- original_bbox_xmin,
- original_bbox_ymin,
- original_bbox_width,
- original_bbox_height,
- selected_boxes[:,5:]
- ], axis=-1
- )
- original_bbox_list = tf.unstack(original_bbox)
- for item in original_bbox_list:
- result = {}
- result['image_id'] = int(image_id.numpy()[i])
- result['category_id'] = cocoid_mapping_labels[int(class_id.numpy())]
- result['bbox'] = item[1:5].numpy().tolist()
- result['bbox'] = [int(a*10)/10 for a in result['bbox']]
- result['score'] = int((item[0]*item[6]).numpy()*1000)/1000
- result['conf'] = str(int(item[0].numpy()*1000)/1000)
- result['prop'] = str(int(item[6].numpy()*1000)/1000)
- result_json.append(result)
- return result_json
利用COCO API来计算mAP:
- START_EPOCH = 14
- val_image_size = 608
- dataset_val = val_input_fn()
- all_result_json = []
- i = 0
- for val_features in dataset_val:
- predict = model_yolo(val_features['images'], training=False)
- result_json = predict_func(
- predict, val_features['image_id'], val_features['original_size']
- )
- all_result_json.extend(result_json)
- i +=1
- all_result_str = ','.join([json.dumps(item) for item in all_result_json])
- all_result_str = '['+all_result_str+']'
- result_filename = 'test_v11_epoch_'+str(START_EPOCH)+'_result.json'
- result_file = open(result_filename, 'w')
- result_file.write(all_result_str)
- result_file.close()
- cocodt = coco.loadRes(result_filename)
- annType = 'bbox'
- imgIds=sorted(coco.getImgIds())
- cocoEval = COCOeval(coco,cocodt,annType)
- cocoEval.params.imgIds = imgIds
- cocoEval.evaluate()
- cocoEval.accumulate()
- cocoEval.summarize()
最后我们来看一下模型的预测效果如何,首先是Kite图片:
再看看Darknet YOLOV3官方模型的预测结果:
看样子我的模型的预测效果还好一些,出乎意料:),官方模型有几个风筝没有检测到。不过我的模型则错误的把一个浪花检测为人。总体好像还是我的模型预测的准确一些。
再看看另外一张图片dog,以下是我的检测效果:
官方模型的检测结果:
总体来看检测结果基本一致,官方模型的IOU检测的更精确一些。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。