当前位置:   article > 正文

FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding

fsce: few-shot object detection via contrastive proposal encoding

FSCE:通过对比建议编码进行少样本目标检测

https://github.com/MegviiDetection/FSCE

Emerging interests have been brought to recognize previously unseen objects given very few training examples, known as few-shot object detection (FSOD). Recent researches demonstrate that good feature embedding is the key to reach favorable few-shot learning performance.We observe object proposals with different Intersection-ofUnion (IoU) scores are analogous to the intra-image augmentation used in contrastive approaches. And we exploit this analogy and incorporate supervised contrastive learning to achieve more robust objects representations in FSOD.

新出现的兴趣已被用于识别以前看不见的对象,因为训练示例很少,称为少镜头对象检测 (FSOD)。最近的研究表明,良好的特征嵌入是获得良好的少样本学习性能的关键。我们观察到具有不同 Intersection-ofUnion (IoU) 分数的对象提议类似于对比方法中使用的图像内增强。我们利用这种类比并结合监督对比学习,在 FSOD 中实现更稳健的对象表示。

We present Few-Shot object detection via Contrastive proposals Encoding (FSCE), a simple yet effective approach to learning contrastive-aware object proposal encodings that facilitate the classification of detected objects. We notice the degradation of average precision (AP) for rare objects mainly comes from misclassifying novel instances as confusable classes. And we ease the misclassification issues by promoting instance level intra-class compactness and interclass variance via our contrastive proposal encoding loss (CPE loss). Our design outperforms current state-of-theart works in any shot and all data splits, with up to +8:8% on standard benchmark PASCAL VOC and +2:7% on challenging COCO benchmark.

我们通过对比提议编码 (FSCE) 提出 Few-Shot 对象检测,这是一种简单而有效的学习对比感知对象提议编码的方法,有助于对检测到的对象进行分类。我们注意到稀有对象的平均精度(AP)的下降主要来自于将新实例错误分类为易混淆的类。我们通过对比提议编码损失(CPE 损失)提高实例级别的类内紧凑性和类间方差,从而缓解错误分类问题。 我们的设计在任何镜头和所有数据拆分中都优于当前的最新工作,在标准基准 PASCAL VOC 上高达 +8:8%,在具有挑战性的 COCO 基准上 +2:7%。

1. Introduction

Development of modern convolutional neural networks (CNNs) [1, 2, 3] give rise to great advances in general object detection [4, 5, 6]. Deep detectors demand a large amount of annotated training data to saturate its performance [7, 8].
In few-shot learning scenarios, deep detectors suffer severer over-fitting and the gap between few-shot detection and general object detection is larger than the corresponding gap in few-shot image classification [9, 10, 11]. On the contrary,a child can rapidly comprehend new visual concepts and recognize objects from a newly learned category given very few examples. Closing such gap is therefore an important step towards more successful machine perception [12].

现代卷积神经网络 (CNN) [1, 2, 3] 的发展在一般目标检测 [4, 5, 6] 方面取得了巨大进步。深度检测器需要大量带注释的训练数据才能使其性能饱和 [7, 8]。在few-shot学习场景中,深度检测器遭受更严重的过拟合,few-shot检测与一般目标检测之间的差距大于few-shot图像分类中相应的差距[9,10,11]。相反,孩子可以快速理解新的视觉概念,并通过很少的例子从新学习的类别中识别对象。因此,缩小这种差距是朝着更成功的机器感知迈出的重要一步[12]。

Precedented by few-shot image classification, earlier attempts in few-shot object detection utilize meta-learning strategy [13, 14, 15]. Meta-learners are trained with an episode of individual tasks, meta-task samples from common objects (base class) to pair with rare objects (novel class) to simulate few-shot detection tasks. Recently, the two-stage fine-tune based approach (TFA) reveals more potential in improving few-shot detection. Baseline TFA [16] simply freeze all base class trained parameters and fine-tune only box classifier and box regressor with novel data, yet outperforms previous meta-learners. MPSR [17] improves upon TFA by alleviating the scale bias inherent to few-shot dataset, but their positive refinement branch demands manual selection, which is somewhat less neat. In this work, we observe and address the essential weakness of the finetuning based approach – constantly mislabeling novel instances as confusable categories, and improve the few-shot detection performance to the new state-of-the-art (SOTA).

在少镜头图像分类之前,早先的少镜头目标检测尝试利用元学习策略[13,14,15]。元学习者接受一系列单独任务的训练,来自常见对象(基类)的元任务样本与稀有对象(新类)配对以模拟小样本检测任务。最近,基于两阶段微调的方法 (TFA) 在改进少样本检测方面显示出更大的潜力。基线 TFA [16] 简单地冻结所有基类训练的参数并仅用新数据微调框分类器和框回归器,但优于以前的元学习器。 MPSR [17] 通过减轻少样本数据集固有的尺度偏差来改进 TFA,但它们的正细化分支需要手动选择,这有点不那么整洁。在这项工作中,我们观察并解决了基于微调的方法的基本弱点——不断将新实例错误地标记为易混淆的类别,并将少数样本检测性能提高到新的最先进技术 (SOTA)。

Object detection involves localization and classification of appeared objects. In few-shot detection, one might naturally conjecture the localization of novel objects is going to under-perform its base categories counterpart, with the concern that rare objects would be deemed as background [14, 13, 18]. However, based on our experiments with Faster R-CNN [4], the commonly adopted detector in few-shot detection, class-agonistic region proposal network (RPN) is able to make foreground proposals for novel instances, and the final box regressor can localize novel instances quite accurately. In comparison, as demonstrated in Figure 2, misclassifying detected novel instances as confusable base classes is indeed the main source of error. We visualize the pairwise cosine similarity between class prototypes [19, 20, 21] of a Faster R-CNN box classifier trained with PASCAL VOC [22, 23]. The cosine similarity between prototypes from resembled categories can be 0:39, whereas the similarity between objects and background is on average -0.21. In few-shot setting, the similarity between cluster centers can go as high as 0.59, e.g., between sheep and cow, bicycle and motorbike, making classification for similar objects error-prone. We make a calculation upon baseline TFA, manually correcting misclassified yet accurately localized box predictions can increase novel class average precision (nAP) by over 20 points.

对象检测涉及出现对象的定位和分类。在少镜头检测中,人们可能会自然地推测新物体的定位将低于其基本类别对应物,担心稀有物体会被视为背景 [14,13,18]。然而,基于我们对 Faster R-CNN [4] 的实验,在少样本检测中常用的检测器、类竞争区域提议网络 (RPN) 能够为新实例提出前景提议,并且最终的框回归器可以非常准确地定位新实例。相比之下,如图 2 所示,将检测到的新实例错误分类为可混淆的基类确实是主要的错误来源。我们可视化使用 PASCAL VOC [22, 23] 训练的 Faster R-CNN 盒分类器的类原型 [19, 20, 21] 之间的成对余弦相似度。来自相似类别的原型之间的余弦相似度可以为 0:39,而对象和背景之间的相似度平均为 -0.21。在few-shot设置中,聚类中心之间的相似度可以高达0.59,例如羊和牛、自行车和摩托车之间,使得相似对象的分类容易出错。我们基于基线 TFA 进行计算,手动纠正错误分类但准确定位的框预测可以将新的类平均精度 (nAP) 提高 20 多个点。

在这里插入图片描述
Figure 2. We find in fine-tuning based few-shot object detector, classification is more error-prone than localization. In the fine-tuning stage, RPN is able to make good enough foreground proposals for novel instances, hence novel objects are often accurately localized but mis-classified as confusable base classes. Here shows 20 top-scoring RPN proposals and example detection results from PASCAL VOC Split 1, wherein bird, sofa and cow are novel categories. The left panel shows the pair-wise cosine similarity between the class prototypes learned in the bounding box classifier. For example, the similarity between bus and bird is -0.10, but the similarity between cow and horse is 0.39. Our goal is to decrease the instance-level similarity between similar objects that are from different categories.

图 2. 我们发现在基于微调的小样本目标检测器中,分类比定位更容易出错。在微调阶段,RPN 能够为新实例提供足够好的前景建议,因此新对象通常被准确定位但被错误分类为可混淆的基类。这里展示了来自 PASCAL VOC Split 1 的 20 个得分最高的 RPN 提议和示例检测结果,其中鸟、沙发和牛是新类别。左图显示了在边界框分类器中学习的类原型之间的成对余弦相似度。比如bus和bird的相似度是-0.10,但是cow和horse的相似度是0.39。我们的目标是降低来自不同类别的相似对象之间的实例级相似性。

A common approach to learn well-separated decision boundary is to use a large margin classifier [24], but with our trials, category-level positive-margin based classifiers does not work in this data-hunger setting [20, 25]. To learn instance-level discriminative feature representations, contrastive learning [26, 27] has demonstrated its effectiveness in tasks including recognition [28], identification [29] and the recent successful self-supervised models [30, 31, 32, 33]. In supervised contrastive learning for image classification [34], intra-image augmentations of images from the same class are used to enrich the positive example pairs.We think region proposals with different Intersection-overUnion (IoU) for an object are naturally analogous to the intra-image augmentation cropping, as illustrated in Figure 1. Therefore in this work, we explore to extend the supervised batch contrastive approach [34] to few-shot object detection. We believe the contrastively learned object representations aware of the intra-class compactness and the inter-class difference can ease the misclassification of unseen objects as similar categories.

学习分离良好的决策边界的一种常用方法是使用大边距分类器 [24],但在我们的试验中,基于类别级正边距的分类器在这种数据饥饿环境中不起作用 [20, 25]。为了学习实例级的判别特征表示,对比学习 [26, 27] 已经证明了它在识别 [28]、识别 [29] 和最近成功的自监督模型 [30, 31, 32, 33] 等任务中的有效性。在图像分类的监督对比学习[34]中,来自同一类的图像的图像内增强被用来丰富正样本对。 我们认为对于一个对象具有不同 Intersection-overUnion (IoU) 的区域提议自然类似于图像内增强裁剪,如图 1 所示。因此,在这项工作中,我们探索将监督批量对比方法 [34] 扩展到少镜头目标检测。我们相信对比学习的对象表示意识到类内紧凑性和类间差异可以缓解将看不见的对象误分类为相似类别的问题。

在这里插入图片描述
Figure 1. Conceptualization of our contrastive object proposals encoding. We introduce a score function which measures the semantic similarity between region proposals. Positive proposals (x+) refer to region proposals from the same category or the same object. Negative proposals (

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/403417
推荐阅读
相关标签