赞
踩
We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lowerbounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully supervised methods. To exploit the omni-supervised setting, we propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations. We argue that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging realworld data. Our experimental results show that in the cases of human keypoint detection and general object detection, state-of-the-art models trained with data distillation surpass the performance of using labeled data from the COCO dataset alone.
我们调查omni-supervised learning,这是一种半监督学习的特殊体制,学习者利用所有可用的标记数据加上未标记数据的互联网规模来源。omni-supervised learning在现有标记数据集上的表现下限,提供超越state-of-the-art的完全监督方法的潜力。为了利用全方位监督设置,我们提出了data distillation,一种使用单一模型集成来自多个未标记数据变换的预测的方法,以自动生成新的训练注释。我们认为视觉识别模型最近已经足够准确,现在可以将关于self-training的经典观点应用于具有挑战性的现实世界数据。我们的实验结果表明,在人体keypoints检测和一般物体检测的情况下,用data distillation进行训练的state-of-the-art的模型超过了单独使用来自COCO数据集的标记数据的性能。
This paper investigates omni-supervised learning, a paradigm in which the learner exploits as much wellannotated data as possible (e.g., ImageNet [6], COCO [24]) and is also provided with potentially unlimited unlabeled data (e.g., from internet-scale sources). It is a special regime of semi-supervised learning. However, most research on semi-supervised learning has simulated labeled/unlabeled data by splitting a fully annotated dataset and is therefore likely to be upper-bounded by fully supervised learning with all annotations. On the contrary, omni-supervised learning is lower-bounded by the accuracy of training on all annotated data, and its success can be evaluated by how much it surpasses the fully supervised baseline.
本文研究了omni-supervised learning,这是一种学习者尽可能多地利用热情好评的数据的范例(如ImageNet [6],COCO [24]),并且还提供了潜在无限的未标记数据(例如,来自internet-scale sources)。这是一个半监督学习的特殊制度。然而,大多数关于半监督学习的研究已经通过分割完全注释的数据集来模拟标记/未标记的数据,因此可能通过omni-supervised learning和所有注释变成upper-bounded。相反,omni-supervised learning的训练对所有注释数据的准确性较低,其成功率可以通过它超过完全监督 baseline的多少来评估。
To tackle omni-supervised learning, we propose to perform knowledge distillation from data, inspired by [3, 18] which performed knowledge distillation from models. Our idea is to generate annotations on unlabeled data using a model trained on large amounts of labeled data, and then retrain the model using the extra generated annotations. However, training a model on its own predictions often provides no meaningful information. We address this problem by ensembling the results of a single model run on different transformations (e.g., flipping and scaling) of an unlabeled image. Such transformations are widely known to improve single-model accuracy [20] when applied at test time, indicating that they can provide nontrivial knowledge that is not captured by a single prediction. In other words, in comparison with [18], which distills knowledge from the predictions of multiple models, we distill the knowledge of a single model run on multiple transformed copies of unlabeled data (see Figure 1).
为了解决omni-supervised learning问题,我们提出从数据中进行knowledge distillation,[3,18]从数据模型中进行knowledge distillation。我们的想法是使用训练了大量标记数据的模型生成关于未标记数据的注释,然后使用额外生成的注释重新训练模型。然而,根据自己的预测来训练模型通常不会提供有意义的信息。我们通过将未标记图像的不同变换(例如,翻转和缩放)的单个模型的结果进行整合来解决该问题。这种转换在测试时间应用时可以提高单一模型的准确性[20],这表明它们可以提供未被单一预测捕获的nontrivial知识。换句话说,与[18]相比,它从多个模型的预测中提取知识,我们提取单个模型在未标记数据的多个变换副本上运行的知识(见图1)。
Data distillation is a simple and natural approach based on “self-training” (i.e., making predictions on unlabeled data and using them to update the model), related to which there have been continuous efforts [36, 48, 43, 33, 22, 46, 5, 21] dating back to the 1960s, if not earlier. However, our simple data distillation approach can become realistic largely thanks to the rapid improvement of fully-supervised models [20, 39, 41, 16, 12, 11, 30, 28, 25, 15] in the past few years. In particular, we are now equipped with accurate models that may make fewer errors than correct predictions. This allows us to trust their predictions on unseen data and reduces the requirement for developing data cleaning heuristics. As a result, data distillation does not require one to change the underlying recognition model (e.g., no modification on the loss definitions), and is a scalable solution for processing large-scale unlabeled data sources.
Data distillation是一种基于“self-training”(即对未标记数据进行预测并使用它们来更新模型)的简单而自然的方法,与此有关的不断努力[36,48,43,33,22,46,5,21],如果不是更早,可追溯到20世纪60年代。然而,我们简单的data distillation方法在很大程度上可以归功于完全监督模型的快速改进[20,39,41,16,12,11,30,28,25,15]在过去几年。特别是,我们现在配备了准确的模型,比正确的预测可能产生更少的错误。这使我们能够相信他们对未被看见的数据的预测,并减少了开发data cleaning heuristics技术的需求。因此,data distillation不需要改变潜在的识别模型(例如,不对损失定义进行修改),并且是用于处理大型未标记数据源的可扩展解决方案。
To test data distillation for omni-supervised learning, we evaluate it on the human keypoint detection task of the COCO dataset [24]. We demonstrate promising signals on this real-world, large-scale application. Specifically, we train a Mask R-CNN model [15] using data distillation applied on the original labeled COCO set and another large unlabeled set (e.g., static frames from Sports-1M [19]). Using the distilled annotations on the unlabeled set, we have observed improvement of accuracy on the held-out validation set: e.g., we show an up to 2 points AP improvement over the strong Mask R-CNN baseline. As a reference, this improvement compares favorably to the ∼3 points AP improvement gained from training on a similar amount of extra manually labeled data in [27] (using private annotations). We further explore our method on COCO object detection and show gains over fully-supervised baselines.
为了测试omni-supervised learning的data distillation,我们在COCO数据集的人类keypoints检测任务上进行评估[24]。我们在这个真实世界的大规模应用中展示出有希望的信号。具体来说,我们使用应用于原始标记的COCO集合和另一个大型未标记集合(例如来自Sports-1M的static frames[19])的data distillation来训练Mask R-CNN模型[15]。使用未标记集合上的提取的注释,我们观察到所提出的验证集的准确性有所提高:例如,我们显示,相对于强大的Mask R-CNN baseline,AP的改进达2点。作为参考,这种改进与从[27]中使用类似数量的额外手动标记数据(使用私人注释)进行训练获得的〜3点AP改进相比是有利的。我们进一步探索我们的COCO物体检测方法,并在完全监督的 baseline上显示出收益。
Ensembling [14] multiple models has been a successful method for improving accuracy. Model compression [3] is proposed to improve test-time efficiency of ensembling by compressing an ensemble of models into a single student model. This method is extended in knowledge distillation [18], which uses soft predictions as the student’s target.
将多个模型组合[14]一直是提高准确性的成功方法。通过将模型集合压缩为单个学生模型来提高集成的测试时效率,模型压缩[3]被提出了。这种方法在knowledge distillation[18]中得到了扩展,它将soft predictions作为student的目标。
The idea of distillation has been adopted in various scenarios. FitNet [32] adopts a shallow and wide teacher models to train a deep and thin student model. Cross modal distillation [13] is proposed to address the problem of limited labels in a certain modality. In [26] distillation is unified with privileged information [44]. To avoid explicitly training multiple models, Laine and Aila [21] exploit multiple checkpoints during training to generate the ensemble predictions. Following the success of these existing works, our approach distills knowledge from a lightweight ensemble formed by multiple data transformations.
distillation的想法已经在各种情况下被采用。 FitNet [32]采用浅而宽的teacher模型来训练一个深而薄的student模型。提出cross modal distillation[13]是为了解决某些模式中有限标签的问题。在[26]distillation与 privileged information统一[44]。为了避免显式训练多个模型,Laine和Aila [21]在训练期间利用多个checkpoints来生成集合预测。在这些现有作品取得成功之后,我们的方法从由多个数据转换形成的轻量级集合中提取knowledge。
There is a great volume of work on semi-supervised learning, and comprehensive surveys can be found in [49, 4, 50]. Among semi-supervised methods, our method is most related to self-training, a strategy in which a model’s predictions on unlabeled data are used to train itself [36, 48, 43, 33, 22, 46, 5, 21]. Closely related to our work on keypoint/object detection, Rosenberg et al. [33] demonstrate that self-training can be used for training object detectors. Compared to prior efforts, our method is substantially simpler. Once the predicted annotations are generated, our method leverages them as if they were true labels; it does not require any modifications to the optimization problem or model structure.
关于半监督学习的工作量很大,综合调查可以在[49,4,50]中找到。在半监督方法中,我们的方法与self-training最为相关,在这种方法中,模型对未标记数据的预测用于训练itself[36,48,43,33,22,46,5,21]。与我们关于keypoints/物体检测的工作密切相关,Rosenberg等人[33]证明self-training可以用于训练物体探测器。与之前的工作相比,我们的方法要简单得多。一旦生成了预测的注释,我们的方法将它们视为真正的标签;它不需要对优化问题或模型结构进行任何修改。
Multiple views or perturbations of the data can provide useful signal for semi-supervised learning. In the cotraining framework [2], different views of the data are used to learn two distinct classifiers that are then used to train one another over unlabeled data. Reed et al. [29] use a reconstruction consistency term for training classification and detection models. Bachman et al. [1] employ the pseudoensemble regularization term to train models robust on input perturbations. Sajjadi et al. [35] enforce consistency between outputs computed for different transformations of input examples. Simon et al. [38] utilize multi-view geometry to generate hand keypoint labels from multiple cameras and retrain the detector. In an auto-encoder scenario, Hinton et al. [17] propose to use multiple “capsules” to model multiple geometric transformations. Our method is also based on multiple geometric transformations, but it does not require to modify network structures or impose consistency by adding any extra loss terms.
数据的多个视图或扰动可以为半监督学习提供有用的信号。在cotraining框架[2]中,不同的数据视图被用来学习两个不同的分类器,然后用它们在未标记的数据上彼此进行训练。Reed等人[29]使用reconstruction consistency term来训练分类和检测模型。Bachman等人 [1]采用pseudoensemble regularization term来训练输入扰动的鲁棒模型。 Sajjadi等人 [35]为输入示例的不同转换计算的输出之间强制执行consistency。Simon等人 [38]利用multi-view geometry 从多个摄像机生成手keypoints标签并重新训练检测器。在一个自动编码器场景中,Hinton等人[17]建议使用多个“capsules”来模拟多个几何变换。我们的方法也基于多个几何变换,但它不需要修改网络结构或通过添加任何额外的损失项来强加一致性。
Regarding the large-scale regime, Fergus et al. [9] investigate semi-supervised learning on 80 millions tiny images. A Never Ending Image Learner (NEIL) [5] employs selftraining to perform semi-supervised learning from webscale image data. These methods were developed before the recent renaissance of deep learning. In contrast, our method is evaluated with strong deep neural network baselines, and can be applied to structured prediction problems beyond image-level classification (e.g., keypoints and boxes).
关于large-scale regime,Fergus等人[9]调查半监督学习在8000万个微小的图像。 Never Ending Image Learner(NEIL)[5]采用self-training来从网络比例图像数据中进行半监督学习。这些方法是在最近的深度学习复兴之前开发的。相比之下,我们的方法是用强大的深度神经网络 baseline来评估的,并且可以应用于超出图像级别分类的结构化预测问题(例如keypoints和 boxes)。
We propose data distillation, a general method for omnisupervised learning that distills knowledge from unlabeled data without the requirement of training a large set of models. Data distillation involves four steps: (1) training a model on manually labeled data (just as in normal supervised learning); (2) applying the trained model to multiple transformations of unlabeled data; (3) converting the predictions on the unlabeled data into labels by ensembling the multiple predictions; and (4) retraining the model on the union of the manually labeled data and automatically labeled data. We describe steps 2-4 in more detail below.
我们提出data distillation,一种全知无监督的学习方法,从无标记的数据中提取知识,而不需要训练大量的模型。data distillation涉及四个步骤:(1)在人工标记的数据上训练模型(就像在正常的监督学习中一样); (2)将训练的模型应用于未标记数据的多次变换; (3)通过组合多个预测将未标记数据的预测转换成标记; (4)将模型重新训练在手动标记数据和自动标记数据的联合处。我们在下面更详细地描述步骤2-4。
We show that it is possible to surpass large-scale supervised learning with omni-supervised learning, i.e., using all available supervised data together with large amounts of unlabeled data. We achieve this by applying data distillation to the challenging problems of COCO object and keypoint detection. We hope our work will attract more attention to this practical, large-scale setting.
我们表明,有可能超过omni-supervised learning的大规模监督学习,即使用所有可用的监督数据和大量的未标记数据。我们通过将data distillation应用于COCO物体和keypoints检测的具有挑战性的问题来实现这一目标。我们希望我们的工作能够吸引更多的关注这个实际的,大规模的环境。
Figure 1. Model Distillation [18] vs. Data Distillation. In data distillation, ensembled predictions from a single model applied to multiple transformations of an unlabeled image are used as automatically annotated data for training a student model.
图1.Model Distillation[18]与Data Distillation。在data distillation中,应用于未标记图像的多个变换的单个模型的集合预测被用作训练student模型的自动标注数据。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。