当前位置:   article > 正文

YOLO V3基于Tensorflow 2.0的完整实现_tensorflow yolov3

tensorflow yolov3

如果对Tensorflow实现最新的Yolo v7算法感兴趣的朋友,可以参见我最新发布的文章,Yolo v7的最简TensorFlow实现_gzroy的博客-CSDN博客

YOLO V3版本是一个强大和快速的物体检测模型,同时原理上也相对简单。我之前的博客中已经介绍了如何用Tensorflow来实现YOLO V1版本,之后我自己也用Tensorflow 1.X版本实现了YOLO V3,现在Tensorflow演进到了2.0版本,相比较1.X版本做了很大的改进,也更加易用了,因此我记录一下如何用Tensorflow 2.0版本来实现YOLO V3。网上能找到的很多Tensorflow YOLO V3的代码都没有完整的一个训练过程,基本上都是转换和加载YOLO的作者在Darknet上发布的训练好的权重数据,直接进行检测的。我的这个代码实现了完整的训练流程,包括了搭建基础架构网络Darknet53进行Imagenet预训练,以及增加YOLO V3网络模块进行物体检测训练,对模型的训练效果进行评测,以及用训练好的模型进行物体检测的过程。

训练数据的准备

需要准备两份训练数据,一个是Imagenent的物体分类数据,包括了1000种类别的物体的数据,共128万张图片。数据集需要预先处理为TFRECORD格式,具体过程可以参见我之前的博客基于Tensorflow的Imagenet数据集的完整处理过程(包括物体标识框BBOX的处理)_valid_classes_gzroy的博客-CSDN博客。 第二个训练数据是物体检测的数据,目前有很多个数据集可以采用,例如COCO数据集(包括80种物体的检测框),OpenImage,Pascal VOC等等,比较流行的是COCO数据集,大部分物体检测的论文都会基于这个数据集来提供性能指标。我也采用COCO数据集,同样也是预处理为TFRECORD格式,具体过程可以参见我的另一篇博客基于Tensorflow对COCO目标检测数据进行预处理_gzroy的博客-CSDN博客

网络模型的搭建

按照YOLO V3论文的描述,基础网络架构是一个叫做Darknet53的网络模型,共有53个卷积层,其网络架构如下:

用Tensorflow可以很方便的构建一个Darknet53模型,代码如下:

  1. import tensorflow as tf
  2. from tensorflow.keras import Model
  3. l=tf.keras.layers
  4. def _conv(inputs, filters, kernel_size, strides, padding, bias=False, normalize=True, activation='relu', last=False):
  5. output = inputs
  6. padding_str = 'same'
  7. if padding>0:
  8. output = l.ZeroPadding2D(padding=padding, data_format='channels_first')(output)
  9. padding_str = 'valid'
  10. output = l.Conv2D(filters, kernel_size, strides, padding_str, \
  11. 'channels_first', use_bias=bias, \
  12. kernel_initializer='he_normal', \
  13. kernel_regularizer=tf.keras.regularizers.l2(l=5e-4))(output)
  14. if normalize:
  15. if not last:
  16. output = l.BatchNormalization(axis=1)(output)
  17. else:
  18. output = l.BatchNormalization(axis=1, gamma_initializer='zeros')(output)
  19. if activation=='relu':
  20. output = l.ReLU()(output)
  21. if activation=='relu6':
  22. output = l.ReLU(max_value=6)(output)
  23. if activation=='leaky_relu':
  24. output = l.LeakyReLU(alpha=0.1)(output)
  25. return output
  26. def _residual(inputs, out_channels, activation='relu', name=None):
  27. output1 = _conv(inputs, out_channels//2, 1, 1, 0, False, True, 'leaky_relu', False)
  28. output2 = _conv(output1, out_channels, 3, 1, 1, False, True, 'leaky_relu', True)
  29. output = l.Add(name=name)([inputs, output2])
  30. return output
  31. def darknet53_base():
  32. image = tf.keras.Input(shape=(3,None,None))
  33. net = _conv(image, 32, 3, 1, 1, False, True, 'leaky_relu') #32*H*W
  34. net = _conv(net, 64, 3, 2, 1, False, True, 'leaky_relu') #64*H/2*W/2
  35. net = _residual(net, 64, 'leaky_relu') #64*H/2*W/2
  36. net = _conv(net, 128, 3, 2, 1, False, True, 'leaky_relu') #128*H/4*W/4
  37. net = _residual(net, 128, 'leaky_relu') #128*H/4*W/4
  38. net = _residual(net, 128, 'leaky_relu') #128*H/4*W/4
  39. net = _conv(net, 256, 3, 2, 1, False, True, 'leaky_relu') #256*H/8*W/8
  40. net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
  41. net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
  42. net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
  43. net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
  44. net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
  45. net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
  46. net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
  47. net = _residual(net, 256, 'leaky_relu') #256*H/8*W/8
  48. route1 = l.Activation('linear', dtype='float32', name='route1')(net)
  49. net = _conv(net, 512, 3, 2, 1, False, True, 'leaky_relu') #512*H/16*W/16
  50. net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
  51. net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
  52. net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
  53. net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
  54. net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
  55. net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
  56. net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
  57. net = _residual(net, 512, 'leaky_relu') #512*H/16*W/16
  58. route2 = l.Activation('linear', dtype='float32', name='route2')(net)
  59. net = _conv(net, 1024, 3, 2, 1, False, True, 'leaky_relu') #1024*H/32*W/32
  60. net = _residual(net, 1024, 'leaky_relu') #1024*H/32*W/32
  61. net = _residual(net, 1024, 'leaky_relu') #1024*H/32*W/32
  62. net = _residual(net, 1024, 'leaky_relu') #1024*H/32*W/32
  63. net = _residual(net, 1024, 'leaky_relu') #1024*H/32*W/32
  64. route3 = l.Activation('linear', dtype='float32', name='route3')(net)
  65. net = tf.reduce_mean(net, axis=[2,3], keepdims=True)
  66. net = _conv(net, 1000, 1, 1, 0, True, False, 'linear') #1000
  67. net = l.Flatten(data_format='channels_first', name='logits')(net)
  68. net = l.Activation('linear', dtype='float32', name='output')(net)
  69. model = tf.keras.Model(inputs=image, outputs=[net, route1, route2, route3])
  70. return model

我们需要先基于这个骨干网络架构来进行Imagenet的预训练,以提取有效的图片内容的特征数据。我用这个网络训练了30个EPOCH,最终到达Top-1 71%,Top-5 91%的准确率。具体的训练过程可以见我的博客Imagenet图像分类训练总结(基于Tensorflow 2.0实现)_keras .image_gzroy的博客-CSDN博客

训练好了骨干网络之后,我们就可以在这个网络的基础上再增加相应的卷积层,实现图像特征金字塔(FPN)的架构,这里我们会用到骨干网络输出的route1, route2, route3这几个不同图像分辨率的特征值,最终构建一个可以对图片进行下采样8倍,16倍和32倍的基于网格的检测系统,例如训练图片的分辨率为416*416,那么将输出52*52, 26*26, 13*13这三个不同维度的检测结果。具体的原理可以参见网上的一些文章,例如:我这里参照Darknet的源代码来搭建了一个YOLO V3的网络,代码如下:

  1. category_num = 80
  2. vector_size = 3*(1+4+category_num)
  3. def darknet53_yolov3():
  4. route1 = tf.keras.Input(shape=(256,None,None), name='input1') #256*H/8*W/8
  5. route2 = tf.keras.Input(shape=(512,None,None), name='input2') #256*H/16*W/16
  6. route3 = tf.keras.Input(shape=(1024,None,None), name='input3') #256*H/32*W/32
  7. net = _conv(route3, 512, 1, 1, 0, False, True, 'leaky_relu') #512*H/32*W/32
  8. net = _conv(net, 1024, 3, 1, 1, False, True, 'leaky_relu') #1024*H/32*W/32
  9. net = _conv(net, 512, 1, 1, 0, False, True, 'leaky_relu') #512*H/32*W/32
  10. net = _conv(net, 1024, 3, 1, 1, False, True, 'leaky_relu') #1024*H/32*W/32
  11. net = _conv(net, 512, 1, 1, 0, False, True, 'leaky_relu') #512*H/32*W/32
  12. route4 = tf.identity(net, 'route4')
  13. net = _conv(net, 1024, 3, 1, 1, False, True, 'leaky_relu') #1024*H/32*W/32
  14. predict1 = _conv(net, vector_size, 1, 1, 0, True, False, 'linear') #vector_size*H/32*W/32
  15. predict1 = l.Activation('linear', dtype='float32')(predict1)
  16. predict1 = l.Reshape((vector_size, imageHeight//32*imageWidth//32))(predict1)
  17. net = _conv(route4, 256, 1, 1, 0, False, True, 'leaky_relu') #256*H/32*W/32
  18. net = l.UpSampling2D((2,2),"channels_first",'nearest')(net) #256*H/16*W/16
  19. net = l.Concatenate(axis=1)([route2, net]) #768*H/16*W/16
  20. net = _conv(net, 256, 1, 1, 0, False, True, 'leaky_relu') #256*H/16*W/16
  21. net = _conv(net, 512, 3, 1, 1, False, True, 'leaky_relu') #512*H/16*W/16
  22. net = _conv(net, 256, 1, 1, 0, False, True, 'leaky_relu') #256*H/16*W/16
  23. net = _conv(net, 512, 3, 1, 1, False, True, 'leaky_relu') #512*H/16*W/16
  24. net = _conv(net, 256, 1, 1, 0, False, True, 'leaky_relu') #256*H/16*W/16
  25. route5 = tf.identity(net, 'route5')
  26. net = _conv(net, 512, 3, 1, 1, False, True, 'leaky_relu') #512*H/16*W/16
  27. predict2 = _conv(net, vector_size, 1, 1, 0, True, False, 'linear') #vector_size*H/16*W/16
  28. predict2 = l.Activation('linear', dtype='float32')(predict2)
  29. predict2 = l.Reshape((vector_size, imageHeight//16*imageWidth//16))(predict2)
  30. net = _conv(route5, 128, 1, 1, 0, False, True, 'leaky_relu') #128*H/16*W/16
  31. net = l.UpSampling2D((2,2),"channels_first",'nearest')(net) #128*H/8*W/8
  32. net = l.Concatenate(axis=1)([route1, net]) #384*H/8*W/8
  33. net = _conv(net, 128, 1, 1, 0, False, True, 'leaky_relu') #128*H/8*W/8
  34. net = _conv(net, 256, 3, 1, 1, False, True, 'leaky_relu') #256*H/8*W/8
  35. net = _conv(net, 128, 1, 1, 0, False, True, 'leaky_relu') #128*H/8*W/8
  36. net = _conv(net, 256, 3, 1, 1, False, True, 'leaky_relu') #256*H/8*W/8
  37. net = _conv(net, 128, 1, 1, 0, False, True, 'leaky_relu') #128*H/8*W/8
  38. net = _conv(net, 256, 3, 1, 1, False, True, 'leaky_relu') #256*H/8*W/8
  39. predict3 = _conv(net, vector_size, 1, 1, 0, True, False, 'linear') #vector_size*H/8*W/8
  40. predict3 = l.Activation('linear', dtype='float32')(predict3)
  41. predict3 = l.Reshape((vector_size, imageHeight//8*imageWidth//8))(predict3)
  42. predict = l.Concatenate()([predict3, predict2, predict1])
  43. predict = tf.transpose(predict, perm=[0, 2, 1], name='predict')
  44. model = tf.keras.Model(inputs=[route1, route2, route3], outputs=predict, name='darknet53_yolo')
  45. return model

可以看到,这个网络模型是以骨干网络的三个输出route1, route2, route3作为输入的,最终输出的Predict是三个不同维度的预测结果。

YOLO V3训练过程

有了训练数据和搭建好网络模型之后,我们就可以开始训练了。整个训练过程分为如下几步:

1. 对骨干网络进行更高分辨率的训练

因为我们的骨干网络是基于224*224这个分辨率来进行训练和提取图片特征的,但是在物体检测中,这个分辨率太低,不利于检测小物体,因此我们需要基于更加高的分辨率,例如416*416来进行训练。我们可以把骨干网络基于这个高的分辨率再多训练一些次数,让网络适应高分辨率,这样可以最终提升物体检测的性能。为此我们可以重新加载之前训练好的骨干网络来进行训练。

2. 骨干网络和检测网络组合

把预训练完成后的骨干网络和检测网络组合起来,构成一个YOLO V3的网络模型。训练图片先通过骨干网络进行特征提取,输出route1,route2,route3这三个不同维度的图像特征数据,然后作为输入进到检测网络中进行训练,最终得到三个维度的预测结果。这个组合网络中,需要设置骨干网络的参数为不可训练,只训练检测网络的参数。代码如下:

  1. #Load the pretrained backbone model
  2. model_base = tf.keras.models.load_model('darknet53/epoch_60.h5')
  3. model_base.trainable = False
  4. image = tf.keras.Input(shape=(3,image_height,image_width))
  5. _, route1, route2, route3 = model_base(image, training=False)
  6. #The detect model will accept the backmodel output as input
  7. predict = darknet53_yolov3(image_height,image_width)([route1, route2, route3])
  8. #Construct the combined yolo model
  9. model_yolo = tf.keras.Model(inputs=image, outputs=predict, name='model_yolo')

3. 读取训练数据并进行预处理

读取COCO训练集的图片和检测框的数据,并进行数据增广,生成数据标签等预处理。这里的数据增广除了按照Darknet源码的处理方式之外,还参照论文https://arxiv.org/pdf/1902.04103.pdf中提出的数据增广处理流程,增加了Mixup的处理,即一次取两张图片,通过随机的透明度的处理之后,同时叠加在一起。

因为涉及到检测框的位置,在做数据增广时,需要相应调整检测框的位置。包括以下几个步骤:

  1. 图像缩放:随机缩放图像的宽和高(缩放系数为0.7-1.3之间的一个随机数),并计算缩放后的宽高的比例,然后以缩放后的宽和高的长边为准,缩放为图像输入维度416,并按照比例来缩放短边。
  2. 图像的填充:因为上一步完成后,图像的长边为416像素,需要对短边进行填充使其也达到416像素。
  3. 检测框的调整:根据以上图像的变换,相应调整检测框的位置。
  4. 随机反转图像
  5. 再次调整检测框
  6. 随机调整图像的饱和度,明亮度等
  7. 添加PCA噪声
  8. 标准化图像的RGB通道值。
  9. 根据检测框的大小判断其应由哪个Anchor来负责预测
  10. 图像数据和检测框的数据作为Feature
  11. 根据检测框的数据生成Label,其维度为(1+1+4+1+80)*3=258,其中第1位为grid id,第2位表示是否存在Object,第3-6位表示如果Object的中央点的坐标和宽高,第7位表示Mixup的比例,最后的80位标识这个Object属于哪一类物体。

首先是定义模型的一些参数,代码如下:

  1. mixup_flag = True
  2. #Parameters for PCA noice
  3. eigvec = tf.constant(
  4. [
  5. [-0.5675, 0.7192, 0.4009],
  6. [-0.5808, -0.0045, -0.8140],
  7. [-0.5836, -0.6948, 0.4203]
  8. ],
  9. shape=[3,3],
  10. dtype=tf.float32
  11. )
  12. eigval = tf.constant([55.46, 4.794, 1.148], shape=[3,1], dtype=tf.float32)
  13. #Parameters for normalization
  14. mean_RGB = tf.constant([123.68, 116.779, 109.939], dtype=tf.float32)
  15. std_RGB = tf.constant([58.393, 57.12, 57.375], dtype=tf.float32)
  16. #Train and valid batch size
  17. batch_size = 16
  18. val_batch_size = 10
  19. epoch_size = 118287
  20. epoch_batch = int(epoch_size/batch_size)
  21. #Parameters for yolo loss scale
  22. no_object_scale = 1.0
  23. iou_threshold = 0.7
  24. object_scale=3.0
  25. class_scale=1.0
  26. jitter = 0.3
  27. #Label and prediction vector size
  28. category_num = 80
  29. label_vector_size = 1+1+4+1+category_num #index 0:grid_id,1:obj_conf,2-5:(x,y,w,h),6:mixup weight,7-86:category
  30. vector_size = 1+4+category_num #index 0:obj_conf,1-4:(x,y,w,h),5-84:category
  31. #Images parameter
  32. image_size_list = [320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
  33. image_size = image_size_list[random.randint(0,9)]
  34. val_image_size = 608
  35. #Grids parameter
  36. grid_wh_array = np.array([[8.,8.],[16.,16.],[32.,32.]])
  37. grid_size = [8.,16.,32.]
  38. #The Anchor size for image_size 416*416
  39. anchors_base = [10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326]

定义读取训练文件的函数:

  1. def _parse_function(example_proto):
  2. features = {
  3. "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
  4. "height": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
  5. "width": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
  6. "channels": tf.io.FixedLenFeature([1], tf.int64, default_value=[3]),
  7. "colorspace": tf.io.FixedLenFeature([], tf.string, default_value=""),
  8. "img_format": tf.io.FixedLenFeature([], tf.string, default_value=""),
  9. "label": tf.io.VarLenFeature(tf.int64),
  10. "bbox_xmin": tf.io.VarLenFeature(tf.int64),
  11. "bbox_xmax": tf.io.VarLenFeature(tf.int64),
  12. "bbox_ymin": tf.io.VarLenFeature(tf.int64),
  13. "bbox_ymax": tf.io.VarLenFeature(tf.int64),
  14. "filename": tf.io.FixedLenFeature([], tf.string, default_value="")
  15. }
  16. parsed_features = tf.io.parse_single_example(example_proto, features)
  17. label = tf.expand_dims(parsed_features["label"].values, 0)
  18. label = tf.cast(label, tf.float32)
  19. image_raw = tf.image.decode_jpeg(parsed_features["image"], channels=3)
  20. image_decoded = tf.cast(image_raw, dtype=tf.float32)
  21. filename = parsed_features["filename"]
  22. #Get the coco image id as we need to use COCO API to evaluate
  23. image_id = tf.strings.to_number(tf.strings.substr(filename, 0, 12), tf.int32)
  24. image_id = tf.expand_dims(image_id, 0)
  25. #Get the bbox
  26. xmin = tf.cast(tf.expand_dims(parsed_features["bbox_xmin"].values, 0), tf.float32)
  27. xmax = tf.cast(tf.expand_dims(parsed_features["bbox_xmax"].values, 0), tf.float32)
  28. ymin = tf.cast(tf.expand_dims(parsed_features["bbox_ymin"].values, 0), tf.float32)
  29. ymax = tf.cast(tf.expand_dims(parsed_features["bbox_ymax"].values, 0), tf.float32)
  30. mixup_w = tf.ones_like(xmin)
  31. boxes = tf.concat([xmin,ymin,xmax,ymax,label,mixup_w], axis=0)
  32. boxes = tf.transpose(boxes, [1, 0])
  33. return {'image':image_decoded, 'bbox':boxes, 'imageid':image_id}

定义一个Flatmap函数,每次读取两张图片,通过Flatmap函数来把这两张图片组合在一起

  1. def _flatmap_function(feature):
  2. dataset_image = feature['image'].padded_batch(2, [-1,-1,3])
  3. dataset_bbox = feature['bbox'].padded_batch(2, [-1,6])
  4. dataset_combined = tf.data.Dataset.zip({'image':dataset_image, 'bbox':dataset_bbox})
  5. return dataset_combined

Mixup函数把组合后的两张图片进行数据增广处理,同时生成训练的Label

  1. def _label_fn(bbox):
  2. global image_size,grid_wh_array,anchors_base
  3. grids_list = [image_size//8, image_size//16, image_size//32]
  4. image_ratio = image_size/416
  5. anchors = [round(a*image_ratio) for a in anchors_base]
  6. labels_list = [np.zeros([a**2,label_vector_size]) for a in grids_list]
  7. for i in range(3):
  8. labels_list[i][:,0] = np.arange(grids_list[i]**2)
  9. labels_list = [np.tile(a,3) for a in labels_list]
  10. box_num, _ = bbox.shape
  11. for i in range(box_num):
  12. center_x = (bbox[i,0]+bbox[i,2])/2
  13. center_y = (bbox[i,1]+bbox[i,3])/2
  14. if (center_x==0 and center_y==0):
  15. continue
  16. box_width = bbox[i,2]-bbox[i,0]
  17. box_height = bbox[i,3]-bbox[i,1]
  18. label = np.int(bbox[i,4].numpy())
  19. anchor_id = np.int(bbox[i,5].numpy())
  20. featuremap_id = anchor_id//3
  21. anchorid_offset = anchor_id%3
  22. g_h = grid_wh_array[featuremap_id,1]
  23. g_w = grid_wh_array[featuremap_id,0]
  24. grid_id = np.int((center_y//g_h*grids_list[featuremap_id] + center_x//g_w).numpy())
  25. index = anchorid_offset*label_vector_size
  26. #set the object exist flag
  27. labels_list[featuremap_id][grid_id, index+1] = 1.
  28. #set the center_x_offset
  29. labels_list[featuremap_id][grid_id, index+2]=(center_x%g_w)/g_w
  30. #set the center_y_offset
  31. labels_list[featuremap_id][grid_id, index+3]=(center_y%g_h)/g_h
  32. #set the width
  33. labels_list[featuremap_id][grid_id, index+4]=math.log(box_width/anchors[2*anchor_id])
  34. #set the height
  35. labels_list[featuremap_id][grid_id, index+5]=math.log(box_height/anchors[2*anchor_id+1])
  36. #set the mixup weight
  37. labels_list[featuremap_id][grid_id, index+6]=bbox[i,6]
  38. #set the class label, using label smoothing
  39. labels_list[featuremap_id][grid_id, (index+7):(index+label_vector_size)]=0.1/(category_num-1)
  40. labels_list[featuremap_id][grid_id, index+7+label]=0.9
  41. #labels_list[featuremap_id][grid_id, index+7+label]=1.0
  42. return tf.concat(labels_list, axis=0)
  43. def _mixup_function(features):
  44. global anchors_base,image_size,mixup_flag,grid_size
  45. image_ratio = image_size/416
  46. anchors = [round(a*image_ratio) for a in anchors_base]
  47. image_height = image_size
  48. image_width = image_size
  49. images = features['image']
  50. bboxes = features['bbox']
  51. #imageid = features['imageid']
  52. if mixup_flag:
  53. lam = np.random.beta(1.5,1.5,1)
  54. lam_all = np.vstack([lam,1.-lam])
  55. lam_all = np.expand_dims(lam_all, 1)
  56. #bboxes = tf.cast(bboxes, tf.float32)
  57. mixup_w = bboxes[...,-1:] + lam_all
  58. bboxes_mixup = tf.concat([bboxes[...,:-1], mixup_w], axis=-1)
  59. bboxes_mixup = tf.reshape(bboxes_mixup, [-1,6])
  60. true_box_mask = tf.logical_or(
  61. bboxes_mixup[:,1]>0,
  62. bboxes_mixup[:,1]>0
  63. )
  64. bboxes_all = tf.boolean_mask(bboxes_mixup, true_box_mask)
  65. image_mix = (images[0]*lam[0] + images[1]*(1.-lam[0]))
  66. else:
  67. image_mix = images
  68. bboxes_all = bboxes
  69. #Random jitter and resize the image
  70. height = tf.shape(image_mix)[0]
  71. width = tf.shape(image_mix)[1]
  72. dw = jitter*tf.cast(width, tf.float32)
  73. dh = jitter*tf.cast(height, tf.float32)
  74. new_ar = tf.truediv(
  75. tf.add(
  76. tf.cast(width, tf.float32),
  77. tf.random.uniform([1], minval=tf.math.negative(dw), maxval=dw)),
  78. tf.add(
  79. tf.cast(height, tf.float32),
  80. tf.random.uniform([1], minval=tf.math.negative(dh), maxval=dh)))
  81. nh, nw = tf.cond(
  82. tf.less(new_ar[0],1), \
  83. lambda:(image_height, tf.cast(tf.cast(image_height, tf.float32)*new_ar[0], tf.int32)), \
  84. lambda:(tf.cast(tf.cast(image_width, tf.float32)/new_ar[0], tf.int32), image_width)
  85. )
  86. dx = tf.cond(
  87. tf.equal(image_width, nw), \
  88. lambda:tf.constant([0]), \
  89. lambda:tf.random.uniform([1], minval=0, maxval=(image_width-nw), dtype=tf.int32)
  90. )
  91. dy = tf.cond(
  92. tf.equal(image_height, nh), \
  93. lambda:tf.constant([0]), \
  94. lambda:tf.random.uniform([1], minval=0, maxval=(image_height-nh), dtype=tf.int32)
  95. )
  96. image_resize = tf.image.resize(image_mix, [nh, nw])
  97. image_padded = tf.image.pad_to_bounding_box(image_resize, dy[0], dx[0], image_height, image_width)
  98. #Adjust the boxes
  99. xmin_new = tf.cast(tf.truediv(nw, width) * tf.cast(bboxes_all[:,0:1],tf.float64), tf.int32) + dx
  100. xmax_new = tf.cast(tf.truediv(nw, width) * tf.cast(bboxes_all[:,2:3],tf.float64), tf.int32) + dx
  101. ymin_new = tf.cast(tf.truediv(nh, height) * tf.cast(bboxes_all[:,1:2],tf.float64), tf.int32) + dy
  102. ymax_new = tf.cast(tf.truediv(nh, height) * tf.cast(bboxes_all[:,3:4],tf.float64), tf.int32) + dy
  103. # Random flip flag
  104. random_flip_flag = tf.random.uniform([1], minval=0, maxval=1, dtype=tf.float32)
  105. def flip_box():
  106. xmax_flip = image_width - xmin_new
  107. xmin_flip = image_width - xmax_new
  108. image_flip = tf.image.flip_left_right(image_padded)
  109. return xmin_flip, xmax_flip, image_flip
  110. def notflip():
  111. return xmin_new, xmax_new, image_padded
  112. xmin_flip, xmax_flip, image_flip = tf.cond(tf.less(random_flip_flag[0], 0.5), notflip, flip_box)
  113. boxes_width = xmax_flip-xmin_flip
  114. boxes_height = ymax_new-ymin_new
  115. boxes_area = boxes_width*boxes_height
  116. # Determine the anchor
  117. iou_list = []
  118. for i in range(9):
  119. intersect_area = tf.minimum(boxes_width, anchors[2*i])*tf.minimum(boxes_height, anchors[2*i+1])
  120. union_area = boxes_area+anchors[2*i]*anchors[2*i+1]-intersect_area
  121. iou_list.append(intersect_area/union_area)
  122. iou = tf.concat(iou_list, axis=1)
  123. anchor_id = tf.reshape(tf.argmax(iou, axis=1), [-1,1])
  124. # Random distort the image
  125. distorted = tf.image.random_hue(image_flip, max_delta=0.3)
  126. distorted = tf.image.random_saturation(distorted, lower=0.6, upper=1.4)
  127. distorted = tf.image.random_brightness(distorted, max_delta=0.3)
  128. # Add PCA noice
  129. alpha = tf.random.normal([3], mean=0.0, stddev=0.1)
  130. pca_noice = tf.reshape(tf.matmul(tf.multiply(eigvec,alpha), eigval), [3])
  131. distorted = tf.add(distorted, pca_noice)
  132. # Normalize RGB
  133. distorted = tf.subtract(distorted, mean_RGB)
  134. distorted = tf.divide(distorted, std_RGB)
  135. # Get the adjusted boxes
  136. xmin_flip = tf.cast(xmin_flip, tf.float32)
  137. xmax_flip = tf.cast(xmax_flip, tf.float32)
  138. ymin_new = tf.cast(ymin_new, tf.float32)
  139. ymax_new = tf.cast(ymax_new, tf.float32)
  140. anchor_id = tf.cast(anchor_id, tf.float32)
  141. boxes_new = tf.concat([xmin_flip,ymin_new,xmax_flip,ymax_new,bboxes_all[:,4:5],anchor_id,bboxes_all[:,-1:]], axis=1)
  142. # Remove the boxes that height or width less than 5 pixels
  143. boxes_mask = tf.math.logical_and(
  144. tf.math.greater((boxes_new[:,2]-boxes_new[:,0]), 5),
  145. tf.math.greater((boxes_new[:,3]-boxes_new[:,1]), 5))
  146. boxes_new = tf.boolean_mask(boxes_new, boxes_mask)
  147. boxes_new = tf.cast(boxes_new, tf.float32)
  148. # Generate the labels
  149. labels = tf.py_function(_label_fn, [boxes_new], [tf.float64])
  150. labels = tf.cast(labels, tf.float32)
  151. image_train = tf.transpose(distorted, perm=[2, 0, 1])
  152. #features = {'images':image_train, 'bboxes':boxes_new, 'images_flip':image_flip, 'image_id':imageid}
  153. features = {'images':image_train, 'bboxes':boxes_new}
  154. return features, labels[0]

然后就可以构造训练的数据集了

  1. def train_input_fn():
  2. global image_size
  3. train_files = tf.data.Dataset.list_files("../dataset/coco/train2017_tf/*.tfrecord")
  4. dataset_train = train_files.interleave(tf.data.TFRecordDataset, num_parallel_calls=tf.data.experimental.AUTOTUNE)
  5. dataset_train = dataset_train.shuffle(buffer_size=1000, reshuffle_each_iteration=True)
  6. dataset_train = dataset_train.repeat(8)
  7. dataset_train = dataset_train.map(_parse_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
  8. if mixup_flag:
  9. dataset_train = dataset_train.window(2)
  10. dataset_train = dataset_train.flat_map(_flatmap_function)
  11. dataset_train = dataset_train.map(_mixup_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
  12. dataset_train = dataset_train.padded_batch(batch_size, \
  13. padded_shapes=(
  14. {
  15. 'images':[3,image_size,image_size],
  16. 'bboxes':[None,7]
  17. },
  18. [None, label_vector_size*3]
  19. )
  20. )
  21. dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)
  22. return dataset_train

数据增广后的处理效果可见下图:

4. 定义损失函数

这是整个训练中最具挑战性的部分。因为按照YOLO V3的源代码,损失函数由两部分组成:

  1. 没有对应检测物体的网格的损失函数,因为这部分网格没有对应的物体,只要计算其预测的物体存在的概率值与Label值的方差,对于其预测的物体的检测框的位置以及物体类别的概率不作惩罚。但是论文中也提到,这些网格虽然不负责预测,但是如果其预测的检测框与真实的检测框之间的IOU大于某个阈值(0.5)时,应忽略惩罚其预测物体存在概率与Label值的方差。由于每张图片的真实物体的检测框的数量不确定,因此如何计算每个真实物体检测框与这些网格的预测框的IOU是一个问题。这里我是采用广播的方式来进行匹配计算。我传入Feature的真实物体的BBOX的维度是[batch, V, 4],其中V代表不定长度,取这个Batch中的最大值,4表示BBOX包括了xmin,ymin,xmax,ymax。例如这个Batch中,某一张图片拥有最多的物体检测框(12个), 那么V=12,其他图片的BBOX也填充为12个。然后把BBOX的维度扩展为[batch, 1, V, 6],预测值计算出来的BBOX维度为[batch, 52**2+26**2+13**2, 4*3],第2个维度是三种不同大小的网格的总数量,第三个维度是每种大小的网格预测3个BBOX。把预测的BBOX和真实BBOX进行IOU的计算,然后取最大值,并判断是否超过阈值,如超过则不惩罚其预测概率。这里的损失函数采用的是交叉熵。
  2. 对应预测物体的网格的损失函数。这部分比较简单,对于物体的中心点坐标,只要直接计算预测值(对中心点坐标需要先进行sigmoid函数激活)与Label的交叉熵,对于物体的宽高计算预测值与Label的方差,对于物体的存在概率,以及物体所属类别的概率,需要用交叉熵来计算。另外对于不同部分的方差还要与不同的系数进行相乘。

具体的代码如下:

  1. # Predicts, combination of three dimention, [batch, 52*52+26*26+13*13, 85*3]
  2. # Labels, combination of three dimention, [batch, 52*52+26*26+13*13, 87*3]
  3. def new_loss_func(predict, label, gt_box, grids_property):
  4. global image_size
  5. predict = tf.reshape(predict, [batch_size,-1,vector_size]) #[batch, (52*52+26*26+13*13)*3, 85]
  6. label = tf.reshape(label, [batch_size,-1,label_vector_size]) #[batch, (52*52+26*26+13*13)*3, 87]
  7. noobj_mask = tf.cast(label[...,1:2]==0.0, tf.float32)
  8. obj_mask = tf.cast(label[...,1:2]==1.0, tf.float32)
  9. #Get the predict box center xy
  10. predict_xy = (grids_property[...,0:2]+tf.nn.sigmoid(predict[...,1:3]))*grids_property[...,-2:]
  11. #Get the predict box wh, only caluculate the noobj wh
  12. predict_half_wh = tf.exp(predict[...,3:5])*grids_property[...,2:4]/2
  13. predict_xmin = tf.clip_by_value((predict_xy[...,0:1]-predict_half_wh[...,0:1]), 0, image_size)
  14. predict_xmax = tf.clip_by_value((predict_xy[...,0:1]+predict_half_wh[...,0:1]), 0, image_size)
  15. predict_ymin = tf.clip_by_value((predict_xy[...,1:2]-predict_half_wh[...,1:2]), 0, image_size)
  16. predict_ymax = tf.clip_by_value((predict_xy[...,1:2]+predict_half_wh[...,1:2]), 0, image_size)
  17. predict_boxes_area = (predict_xmax-predict_xmin)*(predict_ymax-predict_ymin) #[-batch, (52*52+26*26+13*13)*3, 1]
  18. #Assemble the predict box coords and expand dim, shape: [batch, (52*52+26*26+13*13)*3, 1, 4]
  19. predict_boxes = tf.concat([predict_xmin,predict_ymin,predict_xmax,predict_ymax], axis=-1)
  20. predict_boxes = tf.expand_dims(predict_boxes, 2)
  21. #Expand ground boxes dim for broadcast, shape: [batch, 1, V, 4]
  22. gt_box = tf.expand_dims(gt_box, 1)
  23. gt_box = tf.cast(gt_box, tf.float32)
  24. #gt_box_area = (gt_box[...,2:3]-gt_box[...,0:1])*(gt_box[...,3:4]-gt_box[...,1:2]) #[batch, 1, V, 1]
  25. gt_box_area = (gt_box[...,2]-gt_box[...,0])*(gt_box[...,3]-gt_box[...,1]) #[batch, 1, V]
  26. #Broadcast calculation, intersect_boxes_width shape [batch, noobjs_num, V, 1]
  27. intersect_boxes_width = tf.minimum(predict_boxes[...,2:3], gt_box[...,2:3])-tf.maximum(predict_boxes[...,0:1], gt_box[...,0:1])
  28. intersect_boxes_width = tf.clip_by_value(intersect_boxes_width, clip_value_min=0, clip_value_max=image_size)
  29. intersect_boxes_height = tf.minimum(predict_boxes[...,3:4], gt_box[...,3:4])-tf.maximum(predict_boxes[...,1:2], gt_box[...,1:2])
  30. intersect_boxes_height = tf.clip_by_value(intersect_boxes_height, clip_value_min=0, clip_value_max=image_size)
  31. intersect_boxes_area = intersect_boxes_width * intersect_boxes_height # [batch, (52*52+26*26+13*13)*3, V, 1]
  32. intersect_boxes_area = tf.squeeze(intersect_boxes_area) # [batch, (52*52+26*26+13*13)*3, V]
  33. #Calculate the noobj predict box IOU with ground truth boxes, shape:[batch, (52*52+26*26+13*13)*3, V]
  34. iou_boxes = intersect_boxes_area/(predict_boxes_area+gt_box_area-intersect_boxes_area) #
  35. iou_max = tf.reduce_max(iou_boxes, axis=2, keepdims=True) #[batch, (52*52+26*26+13*13)*3, 1]
  36. #iou_max = tf.expand_dims(iou_max, 2)
  37. #Ignore the noobj loss for the IOU larger than threshold
  38. no_ignore_mask = tf.cast(iou_max[...,0:1]<iou_threshold, tf.float32)
  39. noobj_loss_mask = tf.cast(noobj_mask*no_ignore_mask, tf.bool)
  40. noobj_loss_mask = tf.reshape(noobj_loss_mask, [batch_size, -1])
  41. noobj_predict = tf.boolean_mask(predict, noobj_loss_mask)[...,0:1]
  42. #noobj_predict_topk, _ = tf.nn.top_k(noobj_predict[...,0], k=500)
  43. #Calculate the noobj predict loss
  44. #loss_noobj = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(tf.zeros_like(noobj_predict_topk), noobj_predict_topk))
  45. loss_noobj = tf.reduce_sum(
  46. tf.nn.sigmoid_cross_entropy_with_logits(tf.zeros_like(noobj_predict), noobj_predict))
  47. loss_noobj = loss_noobj/batch_size
  48. #Calculate the scale for object coord loss
  49. coord_scale = 2.0-\
  50. (tf.exp(label[...,4:5])*grids_property[...,2:3])*\
  51. (tf.exp(label[...,5:6])*grids_property[...,3:4])/\
  52. image_size**2
  53. loss_xy = tf.reduce_sum(
  54. 0.5*coord_scale*label[...,6:7]*obj_mask*
  55. tf.nn.sigmoid_cross_entropy_with_logits(
  56. labels=label[...,2:4],
  57. logits=predict[...,1:3]
  58. )
  59. )
  60. loss_wh = tf.reduce_sum(
  61. 0.5*coord_scale*label[...,6:7]*obj_mask*
  62. tf.square(
  63. label[...,4:6]-
  64. predict[...,3:5]
  65. )
  66. )
  67. #Calculate the conf loss
  68. loss_conf = tf.reduce_sum(
  69. object_scale*label[...,6:7]*obj_mask*
  70. tf.nn.sigmoid_cross_entropy_with_logits(
  71. tf.ones_like(predict[...,0:1]),
  72. predict[...,0:1]
  73. )
  74. )
  75. #Calculate the predict class loss
  76. loss_class = tf.reduce_sum(
  77. class_scale*label[...,6:7]*obj_mask*
  78. tf.nn.sigmoid_cross_entropy_with_logits(
  79. labels=label[...,7:],
  80. logits=predict[...,5:]
  81. )
  82. )
  83. loss_obj = (loss_xy+loss_wh+loss_conf+loss_class)/batch_size
  84. final_loss = loss_obj + loss_noobj
  85. return final_loss
  86. tf_new_loss_func = tf.function(new_loss_func, experimental_relax_shapes=True)

5. 模型的训练

模型的训练过程,我是采用了自定义训练的方式来做的。YOLO论文提到训练时可以随机采用多种图片尺度,例如416*416, 608*608,352*352等,这样的好处是模型能够更好的适应不同尺寸大小的图片的检测。

随机采用多种图片尺度的代码如下:

  1. def random_image():
  2. global image_size_list,image_size
  3. global grid_wh_array
  4. global anchors_base
  5. image_size = image_size_list[random.randint(0,9)]
  6. #image_size = 608
  7. image_ratio = image_size/416
  8. grids_list = [image_size//8, image_size//16, image_size//32]
  9. anchors = [round(a*image_ratio) for a in anchors_base]
  10. grids_x_list = [np.reshape(np.arange(a**2)%a,[-1,1]) for a in grids_list]
  11. grids_x = np.vstack(grids_x_list)
  12. grids_x = np.reshape(np.hstack([grids_x,grids_x,grids_x]),[-1,1])
  13. grids_y_list = [np.reshape(np.arange(a**2)//a,[-1,1]) for a in grids_list]
  14. grids_y = np.vstack(grids_y_list)
  15. grids_y = np.reshape(np.hstack([grids_y,grids_y,grids_y]),[-1,1])
  16. anchors_all = np.vstack(
  17. [
  18. np.reshape(np.tile(np.reshape(np.array(anchors[:6]),[-1,6]),[grids_list[0]**2,1]),[-1,2]),
  19. np.reshape(np.tile(np.reshape(np.array(anchors[6:12]),[-1,6]),[grids_list[1]**2,1]),[-1,2]),
  20. np.reshape(np.tile(np.reshape(np.array(anchors[12:]),[-1,6]),[grids_list[2]**2,1]),[-1,2])
  21. ]
  22. )
  23. grid_wh_all = np.vstack(
  24. [
  25. np.tile(grid_wh_array[:1,:], (grids_list[0]**2*3,1)),
  26. np.tile(grid_wh_array[1:2,:], (grids_list[1]**2*3,1)),
  27. np.tile(grid_wh_array[2:3,:], (grids_list[2]**2*3,1))
  28. ]
  29. )
  30. grids_property = np.concatenate([grids_x, grids_y, anchors_all, grid_wh_all], axis=-1)
  31. grids_property_all = tf.constant(grids_property, dtype=tf.float32)
  32. grids_property_all = tf.expand_dims(grids_property_all, 0)
  33. grids_property_all = tf.tile(grids_property_all, [batch_size,1,1])
  34. return grids_property_all

自定义训练过程的代码如下:

  1. model_base = tf.keras.models.load_model('darknet53_20200228/epoch_42.h5')
  2. model_base.trainable = False
  3. image = tf.keras.Input(shape=(3,None,None))
  4. _, route1, route2, route3 = model_base(image, training=False)
  5. predict = darknet53_yolov3()([route1, route2, route3])
  6. model_yolo = tf.keras.Model(inputs=image, outputs=predict, name='model_yolo')
  7. START_EPOCH = 0
  8. NUM_EPOCH = 1
  9. STEPS_EPOCH = epoch_batch
  10. STEPS_OFFSET = STEPS_EPOCH*START_EPOCH
  11. initial_warmup_steps = 1000
  12. initial_lr = 0.0005
  13. optimizer=tf.keras.optimizers.SGD(learning_rate=0.00001, momentum=0.9)
  14. mp_opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer)
  15. def train_step(images, bbox, labels, grids_property_all):
  16. with tf.GradientTape() as tape:
  17. predict = model_yolo(images, training=True)
  18. regularization_loss = tf.math.add_n(model_yolo.losses)
  19. pred_loss = tf_new_loss_func(predict, labels, bbox, grids_property_all)
  20. total_loss = pred_loss + regularization_loss
  21. gradients = tape.gradient(total_loss, model_yolo.trainable_variables)
  22. mp_opt.apply_gradients(zip(gradients, model_yolo.trainable_variables))
  23. return total_loss, predict
  24. tf_train_step = tf.function(train_step, experimental_relax_shapes=True)
  25. #Loss rate step decay
  26. boundaries = [STEPS_EPOCH*4, STEPS_EPOCH*10, STEPS_EPOCH*13, STEPS_EPOCH*16]
  27. values = [0.0005, 0.0001, 0.00005, 0.00001, 0.00005]
  28. learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(boundaries, values)
  29. steps = STEPS_OFFSET
  30. for epoch in range(NUM_EPOCH):
  31. loss_sum = 0
  32. start_time = time.time()
  33. grids_property_all = new_random_image()
  34. train_data = iter(train_input_fn())
  35. #for features, labels in train_data:
  36. while(True):
  37. if steps < initial_warmup_steps:
  38. newlr = (initial_lr/initial_warmup_steps)*steps
  39. tf.keras.backend.set_value(optimizer.lr, newlr)
  40. features, data_labels = train_data.next()
  41. loss_temp, predict_temp = tf_train_step(features['images'], features['bboxes'], data_labels, grids_property_all)
  42. loss_sum += loss_temp
  43. steps += 1
  44. if steps%100 == 0:
  45. elasp_time = time.time()-start_time
  46. lr = tf.keras.backend.get_value(optimizer.lr)
  47. print("Step:{}, Image_size:{:d}, Loss:{:4.2f}, LR:{:5f}, Time:{:3.1f}s".format(steps, image_size, loss_sum/100, lr, elasp_time))
  48. loss_sum = 0
  49. if steps > initial_warmup_steps:
  50. tf.keras.backend.set_value(optimizer.lr, learning_rate_fn(steps))
  51. start_time = time.time()
  52. if steps%STEPS_EPOCH == 0:
  53. START_EPOCH += 1
  54. model_yolo.save('model_yolov3/yolo_v10_'+str(START_EPOCH)+'.h5')
  55. break

模型的训练非常耗时,在我的电脑(2080Ti)的配置下,训练一个Epoch大概要花2个小时,我训练了14个EPOCH,mAP .50的准确度大概为32%,和论文提到的57.9%还有比较大的差距。不过Darknet的源码是训练了200多个EPOCH的,可能继续训练会进一步提高准确度。这个有待以后继续验证。

6. 评价模型的性能指标

目标检测一般采用mAP来评价性能,这个指标的计算比较复杂,我是直接采用了COCO API来进行计算,这个也是和论文中的计算方法保持一致。

首先是构造COCO测试集,代码如下:

  1. def _parse_val_function(example_proto):
  2. global val_image_size
  3. features = {
  4. "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
  5. "height": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
  6. "width": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
  7. "channels": tf.io.FixedLenFeature([1], tf.int64, default_value=[3]),
  8. "colorspace": tf.io.FixedLenFeature([], tf.string, default_value=""),
  9. "img_format": tf.io.FixedLenFeature([], tf.string, default_value=""),
  10. "label": tf.io.VarLenFeature(tf.int64),
  11. "bbox_xmin": tf.io.VarLenFeature(tf.int64),
  12. "bbox_xmax": tf.io.VarLenFeature(tf.int64),
  13. "bbox_ymin": tf.io.VarLenFeature(tf.int64),
  14. "bbox_ymax": tf.io.VarLenFeature(tf.int64),
  15. "filename": tf.io.FixedLenFeature([], tf.string, default_value="")
  16. }
  17. parsed_features = tf.io.parse_single_example(example_proto, features)
  18. label = tf.expand_dims(parsed_features["label"].values, 0)
  19. label = tf.cast(label, tf.int32)
  20. channels = parsed_features["channels"]
  21. filename = parsed_features["filename"]
  22. #Get the coco image id as we need to use COCO API to evaluate
  23. image_id = tf.strings.to_number(
  24. tf.strings.substr(filename, 0, 12),
  25. tf.int32
  26. )
  27. #Decode the image
  28. image_raw = tf.image.decode_jpeg(parsed_features["image"], channels=3)
  29. image_decoded = tf.cast(image_raw, dtype=tf.float32)
  30. image_h = tf.constant(val_image_size)
  31. image_w = tf.constant(val_image_size)
  32. height = tf.shape(image_decoded)[0]
  33. width = tf.shape(image_decoded)[1]
  34. original_size = tf.stack([height, width], axis=0)
  35. original_size = tf.cast(original_size, tf.float32)
  36. ratio = tf.truediv(tf.cast(height, tf.float32), tf.cast(width, tf.float32))
  37. nh, nw = tf.cond(
  38. tf.less(ratio,1),
  39. lambda:(tf.cast(tf.cast(image_h, tf.float32)*ratio, tf.int32), image_w),
  40. lambda:(image_h, tf.cast(tf.cast(image_w, tf.float32)/ratio, tf.int32)))
  41. dx = tf.cond(
  42. tf.equal(image_w, nw), \
  43. lambda:tf.constant(0), \
  44. lambda:tf.cast((image_w-nw)/2, tf.int32))
  45. dy = tf.cond(
  46. tf.equal(image_h, nh), \
  47. lambda:0, \
  48. lambda:tf.cast((image_h-nh)/2, tf.int32))
  49. image_resize = tf.image.resize(image_decoded, [nh, nw])
  50. image_padded = tf.image.pad_to_bounding_box(image_resize, dy, dx, image_h, image_w)
  51. image_normalize = tf.subtract(image_padded, mean_RGB)
  52. image_normalize = tf.divide(image_normalize, std_RGB)
  53. image_val = tf.transpose(image_normalize, perm=[2, 0, 1])
  54. features = {'images':image_val, 'image_id':image_id, 'original_size':original_size}
  55. return features
  56. def val_input_fn():
  57. val_files = tf.data.Dataset.list_files("../dataset/coco/val2017_tf/*.tfrecord")
  58. dataset_val = val_files.interleave(tf.data.TFRecordDataset, cycle_length=12, num_parallel_calls=12)
  59. dataset_val = dataset_val.map(_parse_val_function, num_parallel_calls=12)
  60. dataset_val = dataset_val.batch(val_batch_size)
  61. dataset_val = dataset_val.prefetch(1)
  62. return dataset_val

解码预测的结果,转换为相应的BBOX,如以下代码:

  1. def predict_func(predict, image_id, original_size):
  2. global val_image_size, anchors_base
  3. val_grids_list = [val_image_size//8, val_image_size//16, val_image_size//32]
  4. image_ratio = val_image_size/416
  5. val_anchors = [round(a*image_ratio) for a in anchors_base]
  6. val_grids_x_list = [np.reshape(np.arange(a**2)%a,[-1,1]) for a in val_grids_list]
  7. val_grids_x = np.vstack(val_grids_x_list)
  8. val_grids_y_list = [np.reshape(np.arange(a**2)//a,[-1,1]) for a in val_grids_list]
  9. val_grids_y = np.vstack(val_grids_y_list)
  10. val_anchors_all = np.vstack(
  11. [
  12. np.tile(np.reshape(np.array(val_anchors[:6]),[-1,6]),[val_grids_list[0]**2,1]),
  13. np.tile(np.reshape(np.array(val_anchors[6:12]),[-1,6]),[val_grids_list[1]**2,1]),
  14. np.tile(np.reshape(np.array(val_anchors[12:]),[-1,6]),[val_grids_list[2]**2,1])
  15. ]
  16. )
  17. grid_wh_all = np.vstack(
  18. [
  19. np.tile(grid_wh_array[:1,:], (val_grids_list[0]**2,1)),
  20. np.tile(grid_wh_array[1:2,:], (val_grids_list[1]**2,1)),
  21. np.tile(grid_wh_array[2:3,:], (val_grids_list[2]**2,1))
  22. ]
  23. )
  24. val_grids_property = np.concatenate([val_grids_x, val_grids_y, val_anchors_all, grid_wh_all], axis=-1)
  25. val_grids_property_all = tf.constant(val_grids_property, dtype=tf.float32)
  26. val_grids_property_all = tf.expand_dims(val_grids_property_all, 0)
  27. val_grids_property_all = tf.tile(val_grids_property_all, [predict.shape[0],1,1])
  28. result_json = []
  29. original_height = original_size[...,0]
  30. original_width = original_size[...,1]
  31. hw_ratio = original_height/original_width
  32. hw_ratio_mask = tf.cast(tf.less(hw_ratio, 1.), tf.float32)
  33. ratio = \
  34. hw_ratio_mask*(original_width/val_image_size) + \
  35. (1.-hw_ratio_mask)*(original_height/val_image_size)
  36. dx = (1.-hw_ratio_mask)*((original_height-original_width)//2)
  37. dy = hw_ratio_mask*((original_width-original_height)//2)
  38. confidence_threshold = 0.2
  39. probabilty_threshold = 0.5
  40. predict_boxes_list = []
  41. for i in range(3):
  42. predict_conf = tf.nn.sigmoid(predict[...,i*vector_size:(i*vector_size+1)])
  43. predict_xy = tf.nn.sigmoid(predict[...,(i*vector_size+1):(i*vector_size+3)])
  44. predict_xy = predict_xy + val_grids_property_all[...,0:2]
  45. predict_x = predict_xy[...,0:1] * val_grids_property_all[...,-2:-1]
  46. predict_y = predict_xy[...,1:] * val_grids_property_all[...,-1:]
  47. predict_w = tf.exp(predict[...,(i*vector_size+3):(i*vector_size+4)])
  48. predict_w = predict_w * val_grids_property_all[...,(2+i*2):(2+i*2+1)]
  49. predict_h = tf.exp(predict[...,(i*vector_size+4):(i*vector_size+5)])
  50. predict_h = predict_h * val_grids_property_all[...,(2+i*2+1):(2+i*2+2)]
  51. min_x = tf.clip_by_value((predict_x-predict_w/2), 0, val_image_size)
  52. max_x = tf.clip_by_value((predict_x + predict_w/2), 0, val_image_size)
  53. min_y = tf.clip_by_value((predict_y - predict_h/2), 0, val_image_size)
  54. max_y = tf.clip_by_value((predict_y + predict_h/2), 0, val_image_size)
  55. predict_class = tf.argmax(predict[...,(i*vector_size+5):((i+1)*vector_size)], axis=-1)
  56. predict_class = tf.cast(predict_class, tf.float32)
  57. predict_class = tf.expand_dims(predict_class, 2)
  58. predict_proba = tf.nn.sigmoid(
  59. tf.reduce_max(
  60. predict[...,(i*vector_size+5):((i+1)*vector_size)], axis=-1, keepdims=True
  61. )
  62. )
  63. predict_box = tf.concat([predict_conf, min_x, min_y, max_x, max_y, predict_class, predict_proba], axis=-1)
  64. predict_boxes_list.append(predict_box)
  65. predict_boxes = tf.concat(predict_boxes_list, axis=1)
  66. for i in range(predict.shape[0]):
  67. obj_mask = tf.logical_and(
  68. predict_boxes[i,:,0]>=confidence_threshold,
  69. predict_boxes[i,:,-1]>=probabilty_threshold)
  70. predict_true_box = tf.boolean_mask(predict_boxes[i], obj_mask)
  71. predict_classes, _ = tf.unique(predict_true_box[:,5])
  72. predict_classes_list = tf.unstack(predict_classes)
  73. for class_id in predict_classes_list:
  74. class_mask = tf.math.equal(predict_true_box[:, 5], class_id)
  75. predict_true_box_class = tf.boolean_mask(predict_true_box, class_mask)
  76. predict_true_box_xy = predict_true_box_class[:, 1:5]
  77. predict_true_box_score = predict_true_box_class[:, 6]*predict_true_box_class[:, 0]
  78. #predict_true_box_score = predict_true_box_class[:, 0]
  79. selected_indices = tf.image.non_max_suppression(
  80. predict_true_box_xy,
  81. predict_true_box_score,
  82. 100,
  83. iou_threshold=0.2
  84. #score_threshold=confidence_threshold
  85. )
  86. #Shape [box_num, 7]
  87. selected_boxes = tf.gather(predict_true_box_class, selected_indices)
  88. original_bbox_xmin = tf.clip_by_value(
  89. selected_boxes[:,1:2]*ratio[i]-dx[i], 0, original_width[i])
  90. original_bbox_xmax = tf.clip_by_value(
  91. selected_boxes[:,3:4]*ratio[i]-dx[i], 0, original_width[i])
  92. original_bbox_ymin = tf.clip_by_value(
  93. selected_boxes[:,2:3]*ratio[i]-dy[i], 0, original_height[i])
  94. original_bbox_ymax = tf.clip_by_value(
  95. selected_boxes[:,4:5]*ratio[i]-dy[i], 0, original_height[i])
  96. original_bbox_width = original_bbox_xmax - original_bbox_xmin
  97. original_bbox_height = original_bbox_ymax - original_bbox_ymin
  98. original_bbox = tf.concat(
  99. [
  100. selected_boxes[:,0:1],
  101. original_bbox_xmin,
  102. original_bbox_ymin,
  103. original_bbox_width,
  104. original_bbox_height,
  105. selected_boxes[:,5:]
  106. ], axis=-1
  107. )
  108. original_bbox_list = tf.unstack(original_bbox)
  109. for item in original_bbox_list:
  110. result = {}
  111. result['image_id'] = int(image_id.numpy()[i])
  112. result['category_id'] = cocoid_mapping_labels[int(class_id.numpy())]
  113. result['bbox'] = item[1:5].numpy().tolist()
  114. result['bbox'] = [int(a*10)/10 for a in result['bbox']]
  115. result['score'] = int((item[0]*item[6]).numpy()*1000)/1000
  116. result['conf'] = str(int(item[0].numpy()*1000)/1000)
  117. result['prop'] = str(int(item[6].numpy()*1000)/1000)
  118. result_json.append(result)
  119. return result_json

利用COCO API来计算mAP:

  1. START_EPOCH = 14
  2. val_image_size = 608
  3. dataset_val = val_input_fn()
  4. all_result_json = []
  5. i = 0
  6. for val_features in dataset_val:
  7. predict = model_yolo(val_features['images'], training=False)
  8. result_json = predict_func(
  9. predict, val_features['image_id'], val_features['original_size']
  10. )
  11. all_result_json.extend(result_json)
  12. i +=1
  13. all_result_str = ','.join([json.dumps(item) for item in all_result_json])
  14. all_result_str = '['+all_result_str+']'
  15. result_filename = 'test_v11_epoch_'+str(START_EPOCH)+'_result.json'
  16. result_file = open(result_filename, 'w')
  17. result_file.write(all_result_str)
  18. result_file.close()
  19. cocodt = coco.loadRes(result_filename)
  20. annType = 'bbox'
  21. imgIds=sorted(coco.getImgIds())
  22. cocoEval = COCOeval(coco,cocodt,annType)
  23. cocoEval.params.imgIds = imgIds
  24. cocoEval.evaluate()
  25. cocoEval.accumulate()
  26. cocoEval.summarize()

模型的预测效果

最后我们来看一下模型的预测效果如何,首先是Kite图片:

再看看Darknet YOLOV3官方模型的预测结果:

看样子我的模型的预测效果还好一些,出乎意料:),官方模型有几个风筝没有检测到。不过我的模型则错误的把一个浪花检测为人。总体好像还是我的模型预测的准确一些。

再看看另外一张图片dog,以下是我的检测效果:

官方模型的检测结果:

总体来看检测结果基本一致,官方模型的IOU检测的更精确一些。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/295655
推荐阅读
相关标签
  

闽ICP备14008679号