OpenCV+yolov2-tiny实现目标检测(C++)_tiny yolov2网络图

tiny yolov2网络图


    目标检测算法主要分为两类:一类是基于Region Proposal(候选区域)的算法,如R-CNN系算法(R-CNN,Fast R-CNN, Faster R-CNN),它们是two-stage(两步法)的,需要先使用Selective search或者CNN网络(RPN)产生Region Proposal,然后再在Region Proposal上做分类与回归。而另一类是Yolo,SSD这类one-stage算法(一步法),其仅仅使用一个CNN网络直接预测不同目标的类别与位置。第一类方法是准确度高一些,但是速度慢,而第二类算法是速度快,但是准确性要低一些。

    YOLO是一种比SSD还要快的目标检测网络模型,作者在其论文中说FPS是Fast R-CNN的100倍,这里首先简单的介绍一下YOLO网络基本结构,然后通过OpenCV C++调用Darknet的,实现目标检测。OpenCV在3.3.1的版本中开始正式支持Darknet网络框架并且支持YOLO1与YOLO2以及YOLO Tiny网络模型的导入与使用。后面测试,OpenCV3.4.2也支持YOLO3。另外,OpenCV dnn模块目前支持Caffe、TensorFlow、Torch、PyTorch等深度学习框架,关于《OpenCV调用TensorFlow预训练模型》可参考鄙人的另一份博客:https://blog.csdn.net/guyuealian/article/details/80570120




Deep Learning based Object Detection using YOLOv3 with OpenCV ( Python / C++ )》:

《YOLOv3 + OpenCV 实现目标检测(Python / C ++)》:https://blog.csdn.net/haoqimao_hard/article/details/82081285


 darknt yolo官网:https://pjreddie.com/darknet/yolo/














   YOLO全称YOU ONLY  Look Once表达的意思只要看一眼就能感知识别的物体了。YOLO的核心思想:就是利用整张图作为网络的输入,直接在输出层回归物体的bounding box位置和所属的类别。


   实现过程:首先把输入图像448×448划分成S×S的格子,然后对每个格子都预测BBounding Boxes(物体框),每个Bounding Boxes都包含5个预测值:x,y,w,hconfidence置信度,另外每个格子都预测C个类别的概率分数,但是这个概率分数和物体框的confidence置信度分数是不相关的。这样,每个单元格需要预测(B×5+C)个值。如果将输入图片划分为S×S个网格,那么最终预测值为S×S×(B×5+C)大小的张量。整个模型的预测值结构如下图所示。


  • 1、将一幅图像分成SxS个网格(grid cell),如果某个object的中心 落在这个网格中,则这个网格就负责预测这个object。
  • 2、每个网格要预测B个bounding box,每个bounding box除了要回归自身的位置(x,y,w,h)之外,还要附带预测一个confidence值(每个bounding box要预测(x, y, w, h)和confidence共5个值)。这个confidence代表了所预测的box中含有object的置信度和这个box预测的有多准两重信息,其值是这样计算的:


说明:如果有object落在一个grid cell里,第一项取1,否则取0。 第二项是预测的bounding box和实际的ground truth之间的IOU值因此,confidence就是预测的bounding box和ground truth box的IOU值。 

  • 3、每个网格还要预测一个类别概率信息,记为C类。这样所有网格的类别概率就构成了class probability map

注意:class信息是针对每个网格的,confidence信息是针对每个bounding box的。


      举个栗子在PASCAL VOC中,图像输入为448x448,取S=7(将图像成7x7个网格(grid cell)),B=2(每个网格要预测2个bounding box),一共有C=20个类别(PASCAL VOC共有20类别)。则输出就是S x S x (5*B+C)=7x7x30的一个张量tensor。整个网络结构如下图所示:

Yolo采用卷积网络来提取特征,然后使用全连接层来得到预测值。网络结构参考GooLeNet模型,包含24个卷积层和2个全连接层,如图所示。对于卷积层,主要使用1x1卷积来做channle reduction,然后紧跟3x3卷积。对于卷积层和全连接层,采用Leaky ReLU激活函数:max(x,0.1x)。但是最后一层却采用线性激活函数。除了上面这个结构,文章还提出了一个轻量级版本Fast Yolo,其仅使用9个卷积层,并且卷积层中使用更少的卷积核。

  • 4、在test的时候,每个网格预测的class信息和bounding box预测的confidence信息相乘,就得到每个bounding box的class-specific confidence score:

等式左边第一项就是每个网格预测的类别信息,第二三项就是每个bounding box预测的confidence。这个乘积即encode了预测的box属于某一类的概率,也有该box准确度的信息。

  • 5、得到每个box的class-specific confidence score以后,设置阈值,滤掉得分低的boxes,对保留的boxes进行NMS处理,就得到最终的检测结果。

















  • 4个位置信息x、y、w、h

  • 1个置信分数

  • 基于VOC数据集的20个目标类别

    所以对每个BOX来说,每个BOX有5+20=25个参数,5个BOX共有 5x25=125个参数。所以,tiny-YOLO网络模型最后一层卷积层深度是125。


   OpenCV使用YOLO实现目标检测的代码如下,注意 OpenCV只是前馈网络,只支持预测,不能训练。


    这里提供图片测试image_detection()和视频测试 video_detection()测试方法:

  1. /**
  2. * @brief YOLO模型视频测试.
  3. * @param cfgFile path to the .cfg file with text description of the network architecture.
  4. * @param weight path to the .weights file with learned network.
  5. * @param clsNames 种类标签文件
  6. * @param video_path 视频文件
  7. * @returns void
  8. */
  9. void video_detection(string cfgFile, string weight, string clsNames, string video_path);
  10. /**
  11. * @brief YOLO模型图像测试.
  12. * @param cfgFile path to the .cfg file with text description of the network architecture.
  13. * @param weight path to the .weights file with learned network.
  14. * @param clsNames 种类标签文件
  15. * @param image_path 图像文件
  16. * @returns void
  17. */
  18. void image_detection(string cfgFile, string weight, string clsNames, string image_path);

   1、需要调用OpenCV DNN模块,所以头文件必须添加:opencv2/dnn.hpp,头文件和命名空间如下:

  1. #include <opencv2/opencv.hpp>
  2. #include <opencv2/dnn.hpp>
  3. #include <fstream>
  4. #include <iostream>
  5. #include <algorithm>
  6. #include <cstdlib>
  7. using namespace std;
  8. using namespace cv;
  9. using namespace cv::dnn;


  1. float confidenceThreshold = 0.25;
  2. string pro_dir = "E:/opencv-learning-tutorials/"; //项目根目录
  3. int main(int argc, char** argv)
  4. {
  5. String cfgFile = pro_dir + "data/models/yolov2-tiny-voc/yolov2-tiny-voc.cfg";
  6. String weight = pro_dir + "data/models/yolov2-tiny-voc/yolov2-tiny-voc.weights";
  7. string clsNames = pro_dir + "data/models/yolov2-tiny-voc/voc.names";
  8. string image_path = pro_dir + "data/images/1.jpg";
  9. image_detection(cfgFile, weight, clsNames, image_path);//图片测试
  10. //string video_path = pro_dir + "data/images/lane.avi";
  11. //video_detection(cfgFile, weight, clsNames,video_path);//视频测试
  12. }


  1. // 加载网络模型
  2. dnn::Net net = readNetFromDarknet(cfgFile, weight);
  3. if (net.empty())
  4. {
  5. printf("Could not load net...\n");
  6. return;
  7. }


  1. // 加载分类信息
  2. vector<string> classNamesVec;
  3. ifstream classNamesFile(clsNames);
  4. if (classNamesFile.is_open())
  5. {
  6. string className = "";
  7. while (std::getline(classNamesFile, className))
  8. classNamesVec.push_back(className);
  9. }


  1. // 加载图像
  2. Mat frame = imread(image_path);
  3. Mat inputBlob = blobFromImage(frame, 1 / 255.F, Size(416, 416), Scalar(), true, false);
  4. net.setInput(inputBlob, "data");


  1. // 进行目标检测
  2. Mat detectionMat = net.forward("detection_out");
  3. vector<double> layersTimings;
  4. double freq = getTickFrequency() / 1000;
  5. double time = net.getPerfProfile(layersTimings) / freq;
  6. ostringstream ss;
  7. ss << "detection time: " << time << " ms";
  8. putText(frame, ss.str(), Point(20, 20), 0, 0.5, Scalar(0, 0, 255));


  1. // 输出结果
  2. for (int i = 0; i < detectionMat.rows; i++)
  3. {
  4. const int probability_index = 5;
  5. const int probability_size = detectionMat.cols - probability_index;
  6. float *prob_array_ptr = &detectionMat.at<float>(i, probability_index);
  7. size_t objectClass = max_element(prob_array_ptr, prob_array_ptr + probability_size) - prob_array_ptr;
  8. float confidence = detectionMat.at<float>(i, (int)objectClass + probability_index);
  9. if (confidence > confidenceThreshold)
  10. {
  11. float x = detectionMat.at<float>(i, 0);
  12. float y = detectionMat.at<float>(i, 1);
  13. float width = detectionMat.at<float>(i, 2);
  14. float height = detectionMat.at<float>(i, 3);
  15. int xLeftBottom = static_cast<int>((x - width / 2) * frame.cols);
  16. int yLeftBottom = static_cast<int>((y - height / 2) * frame.rows);
  17. int xRightTop = static_cast<int>((x + width / 2) * frame.cols);
  18. int yRightTop = static_cast<int>((y + height / 2) * frame.rows);
  19. Rect object(xLeftBottom, yLeftBottom,
  20. xRightTop - xLeftBottom,
  21. yRightTop - yLeftBottom);
  22. rectangle(frame, object, Scalar(0, 0, 255), 2, 8);
  23. if (objectClass < classNamesVec.size())
  24. {
  25. ss.str("");
  26. ss << confidence;
  27. String conf(ss.str());
  28. String label = String(classNamesVec[objectClass]) + ": " + conf;
  29. int baseLine = 0;
  30. Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
  31. rectangle(frame, Rect(Point(xLeftBottom, yLeftBottom),
  32. Size(labelSize.width, labelSize.height + baseLine)),
  33. Scalar(255, 255, 255), CV_FILLED);
  34. putText(frame, label, Point(xLeftBottom, yLeftBottom + labelSize.height),
  35. FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 0, 0));
  36. }
  37. }
  38. }
  39. imshow("YOLO-Detections", frame);
  40. waitKey(0);
  41. return;


   这里提供图片测试image_detection()和视频测试 video_detection()的方法,完整 是项目代码如下:

  1. #include <opencv2/opencv.hpp>
  2. #include <opencv2/dnn.hpp>
  3. #include <fstream>
  4. #include <iostream>
  5. #include <algorithm>
  6. #include <cstdlib>
  7. using namespace std;
  8. using namespace cv;
  9. using namespace cv::dnn;
  10. float confidenceThreshold = 0.25;
  11. string pro_dir = "E:/opencv-learning-tutorials/"; //项目根目录
  12. /**
  13. * @brief YOLO模型视频测试.
  14. * @param cfgFile path to the .cfg file with text description of the network architecture.
  15. * @param weight path to the .weights file with learned network.
  16. * @param clsNames 种类标签文件
  17. * @param video_path 视频文件
  18. * @returns void
  19. */
  20. void video_detection(string cfgFile, string weight, string clsNames, string video_path);
  21. /**
  22. * @brief YOLO模型图像测试.
  23. * @param cfgFile path to the .cfg file with text description of the network architecture.
  24. * @param weight path to the .weights file with learned network.
  25. * @param clsNames 种类标签文件
  26. * @param image_path 图像文件
  27. * @returns void
  28. */
  29. void image_detection(string cfgFile, string weight, string clsNames, string image_path);
  30. int main(int argc, char** argv)
  31. {
  32. String cfgFile = pro_dir + "data/models/yolov2-tiny-voc/yolov2-tiny-voc.cfg";
  33. String weight = pro_dir + "data/models/yolov2-tiny-voc/yolov2-tiny-voc.weights";
  34. string clsNames = pro_dir + "data/models/yolov2-tiny-voc/voc.names";
  35. string image_path = pro_dir + "data/images/1.jpg";
  36. image_detection(cfgFile, weight, clsNames, image_path);//图片测试
  37. string video_path = pro_dir + "data/images/lane.avi";
  38. video_detection(cfgFile, weight, clsNames,video_path);//视频测试
  39. }
  40. void video_detection(string cfgFile, string weight,string clsNames, string video_path) {
  41. dnn::Net net = readNetFromDarknet(cfgFile, weight);
  42. if (net.empty())
  43. {
  44. printf("Could not load net...\n");
  45. return;
  46. }
  47. vector<string> classNamesVec;
  48. ifstream classNamesFile(clsNames);
  49. if (classNamesFile.is_open())
  50. {
  51. string className = "";
  52. while (std::getline(classNamesFile, className))
  53. classNamesVec.push_back(className);
  54. }
  55. // VideoCapture capture(0);
  56. VideoCapture capture;
  57. capture.open(video_path);
  58. if (!capture.isOpened()) {
  59. printf("could not open the camera...\n");
  60. return;
  61. }
  62. Mat frame;
  63. while (capture.read(frame))
  64. {
  65. if (frame.empty())
  66. if (frame.channels() == 4)
  67. cvtColor(frame, frame, COLOR_BGRA2BGR);
  68. Mat inputBlob = blobFromImage(frame, 1 / 255.F, Size(416, 416), Scalar(), true, false);
  69. net.setInput(inputBlob, "data");
  70. Mat detectionMat = net.forward("detection_out");
  71. vector<double> layersTimings;
  72. double freq = getTickFrequency() / 1000;
  73. double time = net.getPerfProfile(layersTimings) / freq;
  74. ostringstream ss;
  75. ss << "FPS: " << 1000 / time << " ; time: " << time << " ms";
  76. putText(frame, ss.str(), Point(20, 20), 0, 0.5, Scalar(0, 0, 255));
  77. for (int i = 0; i < detectionMat.rows; i++)
  78. {
  79. const int probability_index = 5;
  80. const int probability_size = detectionMat.cols - probability_index;
  81. float *prob_array_ptr = &detectionMat.at<float>(i, probability_index);
  82. size_t objectClass = max_element(prob_array_ptr, prob_array_ptr + probability_size) - prob_array_ptr;
  83. float confidence = detectionMat.at<float>(i, (int)objectClass + probability_index);
  84. if (confidence > confidenceThreshold)
  85. {
  86. float x = detectionMat.at<float>(i, 0);
  87. float y = detectionMat.at<float>(i, 1);
  88. float width = detectionMat.at<float>(i, 2);
  89. float height = detectionMat.at<float>(i, 3);
  90. int xLeftBottom = static_cast<int>((x - width / 2) * frame.cols);
  91. int yLeftBottom = static_cast<int>((y - height / 2) * frame.rows);
  92. int xRightTop = static_cast<int>((x + width / 2) * frame.cols);
  93. int yRightTop = static_cast<int>((y + height / 2) * frame.rows);
  94. Rect object(xLeftBottom, yLeftBottom,
  95. xRightTop - xLeftBottom,
  96. yRightTop - yLeftBottom);
  97. rectangle(frame, object, Scalar(0, 255, 0));
  98. if (objectClass < classNamesVec.size())
  99. {
  100. ss.str("");
  101. ss << confidence;
  102. String conf(ss.str());
  103. String label = String(classNamesVec[objectClass]) + ": " + conf;
  104. int baseLine = 0;
  105. Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
  106. rectangle(frame, Rect(Point(xLeftBottom, yLeftBottom),
  107. Size(labelSize.width, labelSize.height + baseLine)),
  108. Scalar(255, 255, 255), CV_FILLED);
  109. putText(frame, label, Point(xLeftBottom, yLeftBottom + labelSize.height),
  110. FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 0, 0));
  111. }
  112. }
  113. }
  114. imshow("YOLOv3: Detections", frame);
  115. if (waitKey(1) >= 0) break;
  116. }
  117. }
  118. void image_detection(string cfgFile, string weight, string clsNames, string image_path) {
  119. // 加载网络模型
  120. dnn::Net net = readNetFromDarknet(cfgFile, weight);
  121. if (net.empty())
  122. {
  123. printf("Could not load net...\n");
  124. return;
  125. }
  126. // 加载分类信息
  127. vector<string> classNamesVec;
  128. ifstream classNamesFile(clsNames);
  129. if (classNamesFile.is_open())
  130. {
  131. string className = "";
  132. while (std::getline(classNamesFile, className))
  133. classNamesVec.push_back(className);
  134. }
  135. // 加载图像
  136. Mat frame = imread(image_path);
  137. Mat inputBlob = blobFromImage(frame, 1 / 255.F, Size(416, 416), Scalar(), true, false);
  138. net.setInput(inputBlob, "data");
  139. // 进行目标检测
  140. Mat detectionMat = net.forward("detection_out");
  141. vector<double> layersTimings;
  142. double freq = getTickFrequency() / 1000;
  143. double time = net.getPerfProfile(layersTimings) / freq;
  144. ostringstream ss;
  145. ss << "detection time: " << time << " ms";
  146. putText(frame, ss.str(), Point(20, 20), 0, 0.5, Scalar(0, 0, 255));
  147. // 输出结果
  148. for (int i = 0; i < detectionMat.rows; i++)
  149. {
  150. const int probability_index = 5;
  151. const int probability_size = detectionMat.cols - probability_index;
  152. float *prob_array_ptr = &detectionMat.at<float>(i, probability_index);
  153. size_t objectClass = max_element(prob_array_ptr, prob_array_ptr + probability_size) - prob_array_ptr;
  154. float confidence = detectionMat.at<float>(i, (int)objectClass + probability_index);
  155. if (confidence > confidenceThreshold)
  156. {
  157. float x = detectionMat.at<float>(i, 0);
  158. float y = detectionMat.at<float>(i, 1);
  159. float width = detectionMat.at<float>(i, 2);
  160. float height = detectionMat.at<float>(i, 3);
  161. int xLeftBottom = static_cast<int>((x - width / 2) * frame.cols);
  162. int yLeftBottom = static_cast<int>((y - height / 2) * frame.rows);
  163. int xRightTop = static_cast<int>((x + width / 2) * frame.cols);
  164. int yRightTop = static_cast<int>((y + height / 2) * frame.rows);
  165. Rect object(xLeftBottom, yLeftBottom,
  166. xRightTop - xLeftBottom,
  167. yRightTop - yLeftBottom);
  168. rectangle(frame, object, Scalar(0, 0, 255), 2, 8);
  169. if (objectClass < classNamesVec.size())
  170. {
  171. ss.str("");
  172. ss << confidence;
  173. String conf(ss.str());
  174. String label = String(classNamesVec[objectClass]) + ": " + conf;
  175. int baseLine = 0;
  176. Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
  177. rectangle(frame, Rect(Point(xLeftBottom, yLeftBottom),
  178. Size(labelSize.width, labelSize.height + baseLine)),
  179. Scalar(255, 255, 255), CV_FILLED);
  180. putText(frame, label, Point(xLeftBottom, yLeftBottom + labelSize.height),
  181. FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 0, 0));
  182. }
  183. }
  184. }
  185. imshow("YOLO-Detections", frame);
  186. waitKey(0);
  187. return;
  188. }





  • YOLO对相互靠的很近的物体,还有很小的群体 检测效果不好,这是因为一个网格中只预测了两个框,并且只属于一类。
  • 对测试图像中,同一类物体出现的新的不常见的长宽比和其他情况是。泛化能力偏弱。
  • 由于损失函数的问题,定位误差是影响检测效果的主要原因。尤其是大小物体的处理上,还有待加强。


