赞
踩
You Only Look Once: Unified, Real-Time Object Detection
你只看一次:统一、实时的目标检测
Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
我们提出了YOLO,一种检测目标的新方法。之前关于目标检测的工作重新利用分类器来进行检测。相反,我们把目标检测看作是一个回归问题,回归到空间上分离的边界框和相关的类别概率。一个单一的神经网络在一次评估中直接从完整的图像中预测边界框和类别概率。由于整个检测管道是一个单一的网络,它可以直接对检测性能进行端到端的优化。
Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
我们的统一架构是非常快的。我们的基本YOLO模型以每秒45帧的速度实时处理图像。该网络的一个较小的版本,即Fast YOLO,每秒处理惊人的155帧,同时仍然实现了其他实时检测器的两倍的mAP。与最先进的检测系统相比,YOLO会出现更多的定位错误,但在背景上预测假阳性的可能性较小。最后,YOLO学习了非常普遍的物体表征。当从自然图像泛化到艺术品等其他领域时,它优于其他检测方法,包括DPM和R-CNN。
Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
人类看一眼图像,就能立即知道图像中的物体是什么,它们在哪里,以及它们如何互动。人类的视觉系统是快速和准确的,使我们能够在几乎没有意识的情况下完成复杂的任务,如驾驶。快速、准确的目标检测算法将使计算机能够在没有专门传感器的情况下驾驶汽车,使辅助设备能够向人类用户传达实时场景信息,并释放出通用的、反应灵敏的机器人系统的潜力。
Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].
目前的检测系统重新利用分类器来进行检测。为了检测一个物体,这些系统采用了该物体的分类器,并在测试图像的不同位置和比例上对其进行评估。像可变形部件模型(DPM)这样的系统使用滑动窗口方法,在整个图像上以均匀间隔的位置运行分类器[10]。
More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.
最近的方法如R-CNN使用region proposal方法,首先在图像中生成潜在的边界框,然后在这些proposal的框上运行分类器。在分类之后,后处理被用来细化边界盒,消除重复检测,并根据场景中的其他目标对盒子重新评分[13]。这些复杂的管道很慢,而且很难优化,因为每个单独的组件都必须单独训练。
We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
我们将目标检测重塑为一个单一的回归问题,直接从图像像素到边界盒坐标和类别概率。使用我们的系统,你只需看一次(YOLO)图像,就能预测哪些目标存在以及它们在哪里。
YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.
YOLO简单得令人耳目一新:见图1。一个卷积网络同时预测多个边界框和这些框的类别概率。YOLO在完整的图像上进行训练,直接优化检测性能。与传统的目标检测方法相比,这种统一的模型有几个好处。
First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.
首先,YOLO的速度非常快。由于我们把检测作为一个回归问题,我们不需要一个复杂的管道。我们只需在测试时在新的图像上运行我们的神经网络来预测检测结果。我们的基本网络在Titan X GPU上以每秒45帧的速度运行,没有批量处理,快速版本的运行速度超过150帧。这意味着我们可以实时处理流媒体视频,延迟时间不到25毫秒。此外,YOLO达到了其他实时系统平均精度的两倍以上。关于我们的系统在网络摄像头上实时运行的演示,请看我们的项目网页。http://pjreddie.com/yolo/。
Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.
第二,YOLO在进行预测时对图像进行全局推理。与滑动窗口和基于region proposal的技术不同,YOLO在训练和测试期间看到了整个图像,因此它隐含地编码了关于类别和外观的上下文信息。Fast R-CNN是一种顶级的检测方法[14],它将图像中的背景斑块误认为是目标,因为它不能看到更大的背景。与Fast R-CNN相比,YOLO的背景错误数量不到一半。
Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
第三,YOLO学习了目标的可概括性表征。当对自然图像进行训练并对艺术品进行测试时,YOLO的性能远远超过了DPM和R-CNN等顶级检测方法。由于YOLO具有高度的通用性,它在应用于新领域或意外输入时不太可能崩溃。
YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.
YOLO在准确性方面仍然落后于最先进的检测系统。虽然它能快速识别图像中的目标,但却难以准确定位一些目标,尤其是小目标。我们在实验中进一步研究这些权衡。
All of our training and testing code is open source. A variety of pretrained models are also available to download.
我们所有的训练和测试代码都是开源的。各种预训练的模型也可以下载。
We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.
我们将目标检测的独立组件统一到一个神经网络中。我们的网络使用整个图像的特征来预测每个边界框。它还同时预测一个图像的所有类别的所有边界框。这意味着我们的网络对整个图像和图像中的所有目标进行全局推理。YOLO设计实现了端到端的训练和实时速度,同时保持了高平均精度。
Our system divides the input image into an
S
×
S
S × S
S×S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
我们的系统将输入图像划分为一个
S
×
S
S×S
S×S的网格。如果一个目标的中心落入一个网格单元,该网格单元就负责检测该目标。
Each grid cell predicts
B
B
B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as
P
r
(
O
b
j
e
c
t
)
∗
I
O
U
t
r
u
t
h
p
r
e
d
Pr(Object) ∗ IOU_{truth}^{pred}
Pr(Object)∗IOUtruthpred .Ifno object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
每个网格单元都预测了
B
B
B个bounding boxes(下用bbox替代)和这些boxes的置信度分数。这些置信度分数反映了模型对该box包含一个目标的置信度,也反映了它认为它所预测的box有多准确。形式上,我们将置信度定义为
P
r
(
O
b
j
e
c
t
)
∗
I
O
U
t
r
u
t
h
p
r
e
d
Pr(Object) ∗ IOU_{truth}^{pred}
Pr(Object)∗IOUtruthpred 。如果该单元中没有目标存在,置信度分数应该为零。否则,我们希望置信度得分等于预测的box和ground truth(下用GT替代)之间的交集(IOU)。
Each bounding box consists of 5 predictions:
x
,
y
,
w
,
h
x, y, w, h
x,y,w,h, and confidence. The
(
x
,
y
)
(x, y)
(x,y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.
每个bbox由5个预测值组成:
x
,
y
,
w
,
h
x,y,w,h
x,y,w,h和置信度。
(
x
,
y
)
(x, y)
(x,y)坐标代表box的中心相对于网格单元的边界(见图2,如第五行第二个小格,假设狗的中心位于这个格子的中心,此时
(
x
,
y
)
=
(
0.5
,
0.5
)
(x,y)=(0.5, 0.5)
(x,y)=(0.5,0.5))。宽度和高度是相对于整个图像的预测(假设此时狗的长、宽分别为3个格子,此时的宽度和高度都是3/7)。最后,置信度预测表示预测的box和任何GT box之间的IOU。
Each grid cell also predicts
C
C
C conditional class probabilities,
P
r
(
C
l
a
s
s
i
∣
O
b
j
e
c
t
)
Pr(Class_i|Object)
Pr(Classi∣Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes
B
B
B.
每个网格单元还预测了
C
C
C个条件类概率,
P
r
(
C
l
a
s
s
i
∣
O
b
j
e
c
t
)
Pr(Class_i|Object)
Pr(Classi∣Object)。这些概率是以包含一个目标的网格单元为条件的。我们只预测每个网格单元的一组类别概率,而不考虑boxes的数量
B
B
B。
At test time we multiply the conditional class probabilities and the individual box confidence predictions,
在测试时,我们将条件类概率和单个box的置信度预测相乘。
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
这给我们提供了每个box的特定类别的置信度分数。这些分数既是对该类出现在box里的概率的编码,也是对预测的box与目标的匹配程度的编码。
For evaluating YOLO on PASCAL VOC, we use S =7, B =2.PASCAL VOC has 20 labelled classes so C =20. Our final prediction is a 7 × 7 × 30 tensor.
为了在PASCAL VOC上评估YOLO,我们使用S=7,B=2。PASCAL VOC有20个标记的类,所以C=20。我们的最终预测是一个7×7×30的张量。
We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.
我们将这个模型用卷积神经网络实现,并在PASCAL VOC检测数据集[9]上对其进行评估。网络的初始卷积层从图像中提取特征,而全连接层则预测输出概率和坐标。
Our network architecture is inspired by the GoogLeNet model for image classification [33]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.
我们的网络结构是受用于图像分类的GoogLeNet模型的启发[33]。我们的网络有24个卷积层,然后是2个全连接层。我们没有使用GoogLeNet使用的inception模块,而是与Lin等人[22]类似,简单地使用1×1 reduction层,然后是3×3卷积层。完整的网络显示在图3中。
We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
我们还训练了一个快速版本的YOLO,旨在推动快速目标检测的界限。Fast YOLO使用的神经网络具有较少的卷积层(9层而不是24层),这些层中的过滤器也较少。除了网络的大小,所有的训练和测试参数在YOLO和Fast YOLO之间都是一样的。
The final output of our network is the 7 × 7 × 30 tensor of predictions.
我们网络的最终输出是7×7×30的预测张量。
We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [29]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24].
我们在ImageNet 1000个类别的竞赛数据集[29]上预训练我们的卷积层。对于预训练,我们使用了图3中的前20个卷积层,然后是一个平均池化层和一个全连接层。我们对这个网络进行了大约一周的训练,并在ImageNet 2012验证集上取得了88%的single crop top-5的准确率,与Caffe的Model Zoo[24]中的GoogLeNet模型相当。
We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [28]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
然后我们转换模型来进行检测。Ren等人的研究表明,在预训练的网络中同时添加卷积层和全连接层可以提高性能[28]。按照他们的例子,我们添加了四个卷积层和两个全连接层,并随机初始化了权重。检测通常需要精细的视觉信息,所以我们将网络的输入分辨率从224×224提高到448×448。
Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
我们的最后一层预测了类别概率和边界框坐标。我们通过图像的宽度和高度对边界框的宽度和高度进行归一化处理,使其介于0和1之间。
We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:
我们在最后一层使用线性激活函数,所有其他层都使用以下的leaky rectified linear activation(Leaky ReLU)。
We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.
我们对模型输出的平方误差之和进行优化。我们使用平方误差,因为它很容易优化,然而它并不完全符合我们最大化平均精度的目标。它对定位误差和分类误差的权重相同,这可能不是很理想。另外,在每张图像中,许多网格单元不包含任何目标。这就把这些单元的"置信度"分数推向了零,往往压倒了含有目标的单元的梯度。这可能导致模型不稳定,使训练在早期就出现偏差。
To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters,
λ
c
o
o
r
d
λ_{coord}
λcoord and
λ
n
o
o
b
j
λ_{noobj}
λnoobj to accomplish this. We set
λ
c
o
o
r
d
=
5
λ_{coord} =5
λcoord=5 and
λ
n
o
o
b
j
=
.
5
λ_{noobj} = .5
λnoobj=.5.
为了解决这个问题,我们增加了bbox坐标预测的损失,减少了不包含目标的box的置信度预测的损失。我们使用两个参数,
λ
c
o
o
r
d
λ_{coord}
λcoord和
λ
n
o
o
b
j
λ_{noobj}
λnoobj来实现这一目标。我们设定
λ
c
o
o
r
d
=
5
λ_{coord}=5
λcoord=5,
λ
n
o
o
b
j
=
.
5
λ_{noobj}=.5
λnoobj=.5。
Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
误差总和也同样权衡大boxes和小boxes中的误差。我们的误差度量应该反映出大boxes里的小偏差比小boxes里的小偏差更重要。为了部分解决这个问题,我们预测了bbox宽度和高度的平方根,而不是直接预测宽度和高度。
YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
YOLO为每个网格单元预测多个边界框。在训练时,我们只想让一个边界框预测器对每个目标负责。我们指定一个预测器 "负责 "预测一个目标,依据是哪个预测器与GT的当前IOU最高。这就导致了bbox预测器之间的专业化。每个预测器都能更好地预测某些尺寸、长宽比或目标的类别,从而提高整体召回率。
During training we optimize the following, multi-part loss function:
在训练过程中,我们优化以下的多部分损失函数:
where
l
i
o
b
j
l_i^{obj}
liobj denotes if object appears in cell
i
i
i and
l
i
j
o
b
j
l_{ij}^{obj}
lijobj denotes that the
j
j
jth bounding box predictor in cell
i
i
i is “responsible” for that prediction.
其中
l
o
b
j
i
l_{obj}^i
lobji表示目标是否出现在单元格
i
i
i中,
l
o
b
j
i
j
l_{obj}^{ij}
lobjij表示单元格
i
i
i中的第
j
j
j个bbox预测器对该预测 “负责”。
Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
请注意,损失函数只在目标出现在该网格单元时惩罚分类错误(因此前面讨论的条件类概率)。它也只在该预测器对GT "负责"时惩罚bbox坐标错误(即在该网格单元的任何预测器中具有最高的IOU(不是所有的bbox都参与loss的计算,必须是第i个单元格中存在object,并且该单元格中的第j个bbox和ground truth box有最大的IoU值,那么这个bbox j才参与loss的计算,其他的不满足条件的bbox不参与))。
We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
我们在PASCAL VOC 2007和2012的训练和验证数据集上训练网络约135个epochs。在2012年测试时,我们还包括VOC 2007的测试数据进行训练。在整个训练过程中,我们使用了64的batch size,0.9的动量和0.0005的衰减。
Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from
1
0
−
3
10^{−3}
10−3 to
1
0
−
2
10^{−2}
10−2. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with
1
0
−
2
10^{−2}
10−2 for 75 epochs, then
1
0
−
3
10^{−3}
10−3 for 30 epochs, and finally
1
0
−
4
10^{−4}
10−4 for 30 epochs.
我们的学习率计划如下。在最初的epochs中,我们将学习率从
1
0
−
3
10^{-3}
10−3慢慢提高到
1
0
−
2
10^{-2}
10−2。如果我们从一个高的学习率开始,我们的模型往往会因为梯度不稳定而发生分歧。我们继续用
1
0
−
2
10^{-2}
10−2训练75个epochs,然后用
1
0
−
3
10^{-3}
10−3训练30个epochs,最后用
1
0
−
4
10^{-4}
10−4训练30个epochs。
To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.
为了避免过拟合,我们使用了dropout和广泛的数据增强。在第一个连接层之后有一个ratio=0.5的dropout层,以防止层与层之间的co-adaptation[18]。对于数据增强,我们引入了随机缩放和平移,最多为原始图像大小的20%。我们还随机调整图像的曝光度和饱和度,在HSV色彩空间中最多为1.5倍。
Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.
就像在训练中一样,预测测试图像的检测结果只需要一次网络评估。在PASCAL VOC上,网络预测了每幅图像的98个bbox和每个框的类别概率。YOLO在测试时速度极快,因为它只需要一次网络评估,而不像基于分类器的方法。
The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 23% in mAP.
网格设计在bbox预测中强制执行空间多样性。通常情况下,一个目标落在哪个网格单元是很清楚的,网络只为每个目标预测一个box。然而,一些大型目标或靠近多个单元边界的目标可以被多个单元很好地定位。非极大值抑制可以用来修复这些多重检测。虽然非极大值抑制对性能的贡献并不像R-CNN或DPM那样关键,但非极大值抑制使得mAP中增加了23%。
YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.
YOLO对bbox的预测施加了强烈的空间约束,因为每个网格单元只能预测两个框,而且只能有一个类别。这种空间约束限制了我们的模型所能预测的附近目标的数量。我们的模型在处理成群出现的小目标时很吃力,比如说鸟群。
Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
由于我们的模型从数据中学习预测bbox,它很难归纳出新的或不寻常的长宽比或配置的目标。我们的模型还使用相对粗糙的特征来预测bbox,因为我们的架构有多个来自输入图像的下采样层。
Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.
最后,虽然我们在一个近似于检测性能的损失函数上进行训练,但我们的损失函数对小bbox和大bbox中的错误的处理是一样的。大box中的小错误通常是良性的,但小box中的小错误对IOU的影响要大得多。我们的主要错误来源是不正确的定位。
Object detection is a core problem in computer vision. Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]). Then, classifiers [35, 21, 13, 10] or localizers [1, 31] are used to identify objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [34, 15, 38]. We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.
目标检测是计算机视觉中的一个核心问题。检测管道通常从输入图像中提取一组健壮的特征(Haar[25]、SIFT[23]、HOG[4]、卷积特征[6])。然后,分类器[35, 21, 13, 10]或定位器[1, 31]被用来识别特征空间中的目标。这些分类器或定位器在整个图像上以滑动窗口的方式运行,或者在图像中的一些区域子集上运行[34, 15, 38]。我们将YOLO检测系统与几个顶级的检测框架进行比较,突出关键的相似性和差异性。
Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection [10]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, nonmaximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.
可变形部件模型。可变形部件模型(DPM)使用滑动窗口方法进行目标检测[10]。DPM使用一个不相干的管道来提取静态特征,对区域进行分类,预测高分区域的边界框,等等。我们的系统用一个单一的卷积神经网络取代了所有这些不相干的部分。该网络同时进行特征提取、边界框预测、非极大值抑制和上下文推理等工作。该网络不使用静态特征,而是在线训练特征并为检测任务优化它们。我们的统一架构带来了比DPM更快、更准确的模型。
R-CNN. R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [34] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].
R-CNN。R-CNN及其变种使用region proposals而不是滑动窗口来寻找图像中的物体。选择性搜索[34]产生潜在的边界框,卷积网络提取特征,SVM对这些框进行评分,线性模型调整边界框,非极大值抑制消除重复检测。这个复杂的管道的每个阶段都必须独立地进行精确调整,所产生的系统非常慢,在测试时每张图像需要40多秒[14]。
YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.
YOLO与R-CNN有一些相似之处。每个网格单元提出潜在的边界框,并使用卷积特征对这些框进行评分。然而,我们的系统对网格单元的proposals进行了空间限制,这有助于减少对同一目标的多次检测。我们的系统还提出了少得多的边界框,每张图像只有98个,而选择性搜索则有大约2000个。最后,我们的系统将这些单独的组件组合成一个单一的、共同优化的模型。
Other Fast Detectors. Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search [14][27]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.
其他快速检测器。Fast R-CNN和Faster R-CNN专注于通过共享计算和使用神经网络来propose regions而不是选择性搜索来加快R-CNN框架的速度[14][27]。虽然它们比R-CNN提供了速度和准确度上的改进,但两者仍然达不到实时性能。
Many research efforts focus on speeding up the DPM pipeline [30][37][5]. They speed up HOG computation, use cascades, and push computation to GPUs. However, only 30Hz DPM [30] actually runs in real-time.
许多研究工作的重点是加快DPM管道的速度[30][37][5]。他们加快了HOG的计算,使用级联,并将计算推给GPU。然而,只有30Hz的DPM[30]实际上是实时运行的。
Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.
YOLO没有试图优化一个大型检测管道的各个组成部分,而是完全抛开了管道,通过设计实现了快速。
Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [36]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.
像人脸或人这样的单一类别的检测器可以被高度优化,因为它们需要处理的变化要少得多[36]。YOLO是一个通用的检测器,可以学习同时检测各种目标。
Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, MultiBox cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.
Deep MultiBox。与R-CNN不同,Szegedy等人训练一个卷积神经网络来预测感兴趣的区域[8],而不是使用选择性搜索。MultiBox也可以通过用单类预测代替置信度预测来进行单一目标检测。然而,MultiBox不能进行一般的目标检测,仍然只是更大的检测管道中的一个部分,需要进一步的图像补丁分类。YOLO和MultiBox都使用卷积网络来预测图像中的边界框,但YOLO是一个完整的检测系统。
OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [31]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. OverFeat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.
OverFeat。Sermanet等人训练一个卷积神经网络来执行定位,并使该定位器适应执行检测[31]。OverFeat有效地执行了滑动窗口检测,但它仍然是一个不连续的系统。OverFeat优化的是定位,而不是检测性能。与DPM一样,定位器在进行预测时只看到局部信息。OverFeat不能推理全局背景,因此需要大量的后处理来产生连贯的检测。
MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [26]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image.
MultiGrasp。我们的工作在设计上与Redmon等人[26]的grasp检测工作相似。我们的边界框预测的网格方法是基于MultiGrasp系统对grasp的回归。然而,grasp检测是一个比目标检测更简单的任务。MultiGrasp只需要为包含一个目标的图像预测一个可grasp的区域。它不需要估计目标的大小、位置或边界,也不需要预测它的类别,只需要找到一个适合grasp的区域。YOLO同时预测图像中多个类别的目标的边界框和类别概率。
First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.
首先,我们在PASCAL VOC 2007上将YOLO与其他实时检测系统进行比较。为了了解YOLO和R-CNN变体之间的差异,我们探讨了YOLO和Fast R-CNN(R-CNN的最高性能版本之一)在VOC 2007上的错误[14]。基于不同的错误情况,我们表明YOLO可以用来对Fast R-CNN的检测进行重新评分,并减少来自背景假阳性的错误,使性能得到明显提升。我们还介绍了VOC 2012的结果,并将mAP与目前最先进的方法进行了比较。最后,我们展示了在两个艺术品数据集上,YOLO对新领域的通用性优于其他检测器。
Many research efforts in object detection focus on making standard detection pipelines fast. [5][37][30][14][17] [27] However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) [30]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz. While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.
目标检测方面的许多研究工作集中在使标准检测管道快速化。[5][37][30][14][17][27]然而,只有Sadeghi等人实际产生了一个实时运行的检测系统(每秒30帧或更好)[30]。我们将YOLO与他们的DPM的GPU实现进行了比较,DPM可以在30Hz或100Hz下运行。虽然其他的努力没有达到实时的里程碑,但我们也比较了他们的相对mAP和速度,以检查目标检测系统中可用的精度-性能权衡。
Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance.
Fast YOLO是PASCAL上最快的目标检测方法;就我们所知,它是现存最快的目标检测器。它的mAP为52.7%,比之前的实时检测工作要准确两倍以上。YOLO将mAP推高到63.4%,同时仍保持实时性能。
We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models.
我们还使用VGG-16训练YOLO。这个模型更准确,但也比YOLO慢得多。它对于与其他依赖VGG-16的检测系统进行比较是很有用的,但由于它比实时性慢,本文的其余部分集中在我们更快的模型上。
Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 [37]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches.
Fastest DPM有效地加快了DPM的速度,而没有牺牲很多mAP,但它仍然比实时性能差2倍[37]。与神经网络方法相比,它还受到DPM相对较低的检测精度的限制。
R-CNN minus R replaces Selective Search with static bounding box proposals [20]. While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.
R-CNN减去R,用静态边界框建议取代了选择性搜索[20]。虽然它比R-CNN快得多,但它仍然达不到实时性,而且由于没有好的proposals,准确性受到很大打击。
Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals. Thus it has high mAP but at 0.5 fps it is still far from realtime.
快速R-CNN加快了R-CNN的分类阶段,但它仍然依赖于选择性搜索,每张图像产生边界框建议需要2秒左右。因此,它有很高的mAP,但在0.5fps的情况下,它仍然离实时性很远。
The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8] In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The ZeilerFergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.
最近的Faster R-CNN用神经网络代替了选择性搜索来提出边界框,与Szegedy等人[8]类似。在我们的测试中,他们最准确的模型达到了7 fps,而一个较小的、不太准确的模型运行速度为18 fps。VGG-16版本的Faster R-CNN高出10 mAP,但也比YOLO慢了6倍。ZeilerFergus的Faster R-CNN只比YOLO慢2.5倍,但也没有那么准确。
To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown of results on VOC 2007. We compare YOLO to Fast RCNN since Fast R-CNN is one of the highest performing detectors on PASCAL and it’s detections are publicly available.
为了进一步研究YOLO和最先进的检测器之间的差异,我们看一下VOC 2007上的结果的详细分类。我们将YOLO与Fast RCNN进行比较,因为Fast R-CNN是PASCAL上性能最高的检测器之一,它的检测结果是公开的。
We use the methodology and tools of Hoiem et al. [19] For each category at test time we look at the top N predictions for that category. Each prediction is either correct or it is classified based on the type of error:
• Correct: correct class and IOU >.5
• Localization: correct class, .1 < IOU <.5
• Similar: class is similar, IOU >.1
• Other: class is wrong, IOU >.1
• Background: IOU <.1 for any object
我们使用Hoiem等人[19]的方法和工具,对于测试时的每个类别,我们看该类别的前N个预测。每个预测要么是正确的,要么是根据错误的类型来分类。
- 正确:正确的类别和IOU>.5
- 定位:正确的类别,.1 < IOU <.5
- 类似:类是类似的,IOU>.1
- 其他:类是错误的,IOU>.1
- 背景。任何对象的IOU<.1
Figure 4 shows the breakdown of each error type averaged across all 20 classes. YOLO struggles to localize objects correctly. Localization errors account for more of YOLO’s errors than all other sources combined. Fast R-CNN makes much fewer localization errors but far more background errors. 13.6% of it’s top detections are false positives that don’t contain any objects. Fast R-CNN is almost 3x more likely to predict background detections than YOLO.
图4显示了在所有20个类中平均每个错误类型的分类。YOLO在正确定位物体方面很努力。定位错误在YOLO的错误中占的比例超过了所有其他来源的总和。Fast R-CNN的定位错误要少得多,但背景错误却多得多。13.6%的顶级检测结果是不包含任何目标的假阳性。Fast R-CNN预测背景检测的可能性几乎是YOLO的3倍。
YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes.
YOLO所犯的背景错误比Fast R-CNN少得多。通过使用YOLO来消除快速R-CNN的背景检测,我们在性能上得到了很大的提升。对于R-CNN预测的每一个边界盒,我们检查YOLO是否预测了一个类似的盒子。如果有,我们根据YOLO预测的概率和两个盒子之间的重叠,给这个预测一个提升。
The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN. Those ensembles produced small increases in mAP between .3 and .6%, see Table 2 for details.
最好的Fast R-CNN模型在VOC 2007测试集上的mAP达到了71.8%。当与YOLO结合时,其mAP增加了3.2%,达到75.0%。我们还尝试将顶级的Fast R-CNN模型与其他几个版本的Fast R-CNN相结合。这些组合产生了0.3%至0.6%的mAP小幅增长,详见表2。
The boost from YOLO is not simply a byproduct of model ensembling since there is little benefit from combining different versions of Fast R-CNN. Rather, it is precisely because YOLO makes different kinds of mistakes at test time that it is so effective at boosting Fast R-CNN’s performance.
YOLO的提升并不是简单的模型集合的副产品,因为结合不同版本的Fast R-CNN并没有什么好处。相反,正是因为YOLO在测试时犯了不同类型的错误,所以它在提高Fast R-CNN的性能方面如此有效。
Unfortunately, this combination doesn’t benefit from the speed of YOLO since we run each model seperately and then combine the results. However, since YOLO is so fast it doesn’t add any significant computational time compared to Fast R-CNN.
不幸的是,这种组合并没有受益于YOLO的速度,因为我们单独运行每个模型,然后将结果合并。然而,由于YOLO是如此之快,与Fast R-CNN相比,它并没有增加任何重要的计算时间。
On the VOC 2012 test set, YOLO scores 57.9% mAP. This is lower than the current state of the art, closer to the original R-CNN using VGG-16, see Table 3. Our system struggles with small objects compared to its closest competitors. On categories like bottle, sheep, and tv/monitor YOLO scores 8-10% lower than R-CNN or Feature Edit. However, on other categories like cat and train YOLO achieves higher performance.
在VOC 2012测试集上,YOLO的mAP得分是57.9%。这比目前的技术水平低,更接近于使用VGG-16的原始R-CNN,见表3。与其最接近的竞争对手相比,我们的系统在处理小目标时很吃力。在瓶子、羊和电视/显示器等类别上,YOLO的得分比R-CNN或Feature Edit低8-10%。然而,在其他类别如猫和火车上,YOLO取得了更高的性能。
Our combined Fast R-CNN + YOLO model is one of the highest performing detection methods. Fast R-CNN gets a 2.3% improvement from the combination with YOLO, boosting it 5 spots up on the public leaderboard.
我们的Fast R-CNN+YOLO组合模型是性能最高的检测方法之一。Fast R-CNN从与YOLO的结合中获得了2.3%的改进,使其在公共排行榜上提升了5位。
Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and the test data can diverge from what the system has seen before [3]. We compare YOLO to other detection systems on the Picasso Dataset [12] and the People-Art Dataset [3], two datasets for testing person detection on artwork.
用于目标检测的学术数据集从相同的分布中提取训练和测试数据。在现实世界的应用中,很难预测所有可能的用例,而且测试数据可能与系统之前看到的数据有出入[3]。我们在Picasso数据集[12]和People-Art数据集[3]上将YOLO与其他检测系统进行比较,这两个数据集用于测试艺术品上的人员检测。
Figure 5 shows comparative performance between YOLO and other detection methods. For reference, we give VOC 2007 detection AP on person where all models are trained only on VOC 2007 data. On Picasso models are trained on VOC 2012 while on People-Art they are trained on VOC 2010.
图5显示了YOLO和其他检测方法之间的比较性能。作为参考,我们给出了VOC 2007对人物的检测AP,所有的模型都是在VOC 2007的数据上训练的。在Picasso上,模型是在VOC 2012上训练的,而在People-Art上,模型是在VOC 2010上训练的。
R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images. The classifier step in R-CNN only sees small regions and needs good proposals.
R-CNN对VOC 2007有很高的AP。然而,当R-CNN应用于艺术品时,它的性能大大下降了。R-CNN使用选择性搜索进行边界框proposals,这是为自然图像而调整的。R-CNN中的分类器步骤只看到小regions,需要良好的proposals。
DPM maintains its AP well when applied to artwork. Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects. Though DPM doesn’t degrade as much as R-CNN, it starts from a lower AP.
DPM在应用于艺术品时能很好地保持其AP。先前的工作认为,DPM表现良好,因为它有强大的物体形状和布局的空间模型。尽管DPM没有像R-CNN那样退化,但它的AP值较低。
YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork. Like DPM, YOLO models the size and shape of objects, as well as relationships between objects and where objects commonly appear. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.
YOLO在VOC 2007上有很好的表现,当应用于艺术品时,其AP的退化程度比其他方法小。与DPM一样,YOLO对目标的大小和形状以及物体之间的关系和目标通常出现的位置进行建模。艺术品和自然图像在像素层面上有很大的不同,但它们在目标的大小和形状方面是相似的,因此YOLO仍然可以预测良好的边界框和检测。
In The Wild YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance, including the time to fetch images from the camera and display the detections.
在野外,YOLO是一个快速、准确的目标检测器,使其成为计算机视觉应用的理想选择。我们将YOLO连接到一个网络摄像头,并验证它保持实时性能,包括从摄像头获取图像和显示检测结果的时间。
The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a webcam it functions like a tracking system, detecting objects as they move around and change in appearance. A demo of the system and the source code can be found on our project website: http://pjreddie.com/yolo/.
由此产生的系统是互动的、有吸引力的。虽然YOLO单独处理图像,但当它连接到一个网络摄像头时,它的功能就像一个跟踪系统,在物体移动和外观变化时检测它们。该系统的演示和源代码可以在我们的项目网站上找到。http://pjreddie.com/yolo/。
We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
我们介绍了YOLO,一个用于目标检测的统一模型。我们的模型构造简单,可以直接在完整图像上进行训练。与基于分类器的方法不同,YOLO是在直接对应于检测性能的损失函数上训练的,整个模型是联合训练的。
Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.
Fast YOLO是文献中最快的通用目标检测器,YOLO推动了实时目标检测的最先进水平。YOLO还能很好地适用于新的领域,使其成为依赖快速、稳健的目标检测的理想应用。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。