赞
踩
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Advances in neural information processing systems, 2012, 25: 1097-1105.
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5 % 37.5\% 37.5% and 17.0 % 17.0\% 17.0% which is considerably better than the previous state-of-the-art.
我们训练了一个大型的深度卷积神经网络,将 ImageNet LSVRC-2010 竞赛中的 120 万张高分辨率图像分类为 1000 个不同的类别。在测试数据上,我们实现了 37.5 % 37.5\% 37.5% 和 17.0 % 17.0\% 17.0% 的 top-1 和 top-5 错误率,这比之前的最新技术要好得多。
The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation.
该神经网络有 6000 万个参数和 650,000 个神经元,由五个卷积层组成,其中一些是最大池化层,三个全连接层和最终的 1000 路 softmax。为了加快训练速度,我们使用了非饱和神经元和卷积运算的非常高效的 GPU 实现。
To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3 % 15.3\% 15.3% , compared to 26.2 % 26.2\% 26.2% achieved by the second-best entry.
为了减少全连接层中的过度拟合,我们采用了一种最近开发的称为“dropout”的正则化方法,该方法被证明非常有效。我们还在 ILSVRC-2012 竞赛中输入了该模型的一个变体,并获得了 15.3 % 15.3\% 15.3% 的前 5 名测试错误率,而第二名的测试错误率为 26.2 % 26.2\% 26.2%。
Current approaches to object recognition make essential use of machine learning methods. To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting. Until recently, datasets of labeled images were relatively small — on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and CIFAR-10/100 [12]).
当前的对象识别方法主要使用机器学习方法。为了提高它们的性能,我们可以收集更大的数据集,学习更强大的模型,并使用更好的技术来防止过度拟合。直到最近,标记图像的数据集还相对较小——大约有数万张图像(例如,NORB [16]、Caltech-101/256 [8, 9] 和 CIFAR-10/100 [12])。
Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations. For example, the current-best error rate on the MNIST digit-recognition task ( < 0.3 % <0.3\% <0.3%) approaches human performance [4].
使用这种大小的数据集可以很好地解决简单的识别任务,特别是如果它们增加了标签保留转换。例如,MNIST 数字识别任务的当前最佳错误率 ( < 0.3 % <0.3\% <0.3%) 接近人类表现 [4]。
But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets. And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to collect labeled datasets with millions of images.
但是现实环境中的物体表现出相当大的可变性,因此要学会识别它们,有必要使用更大的训练集。事实上,小型图像数据集的缺点已被广泛认可(例如,Pinto 等人 [21]),但直到最近才有可能收集具有数百万张图像的标记数据集。
The new larger datasets include LabelMe [23], which consists of hundreds of thousands of fully-segmented images, and ImageNet [6], which consists of over 15 million labeled high-resolution images in over 22,000 categories.
新的更大数据集包括 LabelMe [23],它由数十万张完全分割的图像组成,以及 ImageNet [6],它由超过 22,000 个类别的超过 1500 万张标记的高分辨率图像组成。
To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have.
要从数百万张图像中了解数千个对象,我们需要一个具有大学习能力的模型。 然而,物体识别任务的巨大复杂性意味着即使像 ImageNet 这样大的数据集也无法指定这个问题,因此我们的模型还应该有大量的先验知识来弥补我们没有的所有数据。
Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies).
卷积神经网络 (CNN) 构成了这样一类模型 [16、11、13、18、15、22、26]。 它们的容量可以通过改变它们的深度和广度来控制,它们还对图像的性质做出了强有力的、大部分正确的假设(即统计数据的平稳性和像素依赖性的局部性)。
Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.
因此,与具有相似大小层的标准前馈神经网络相比,CNN 的连接和参数要少得多,因此它们更容易训练,而它们理论上的最佳性能可能只是稍差一些。
Despite the attractive qualities of CNNs, and despite the relative efficiency of their local architecture, they have still been prohibitively expensive to apply in large scale to high-resolution images. Luckily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet contain enough labeled examples to train such models without severe overfitting.
尽管 CNN 具有吸引人的品质,并且尽管它们的局部架构相对高效,但它们在大规模应用于高分辨率图像时仍然过于昂贵。 幸运的是,当前的 GPU 与高度优化的 2D 卷积实现相结合,足以促进有趣的大型 CNN 的训练,并且最近的数据集(如 ImageNet)包含足够的标记示例来训练此类模型而不会严重过度拟合。
The specific contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012 competitions [2] and achieved by far the best results ever reported on these datasets. We wrote a highly-optimized GPU implementation of 2D convolution and all the other operations inherent in training convolutional neural networks, which we make available publicly1.
本文的具体贡献如下:我们在 ILSVRC-2010 和 ILSVRC-2012 竞赛 [2] 中使用的 ImageNet 子集上训练了迄今为止最大的卷积神经网络之一,并取得了迄今为止报告的最佳结果这些数据集。我们编写了一个高度优化的 2D 卷积 GPU 实现以及训练卷积神经网络中固有的所有其他操作,我们将其公开发布1。
Our network contains a number of new and unusual features which improve its performance and reduce its training time, which are detailed in Section 3. The size of our network made overfitting a significant problem, even with 1.2 million labeled training examples, so we used several effective techniques for preventing overfitting, which are described in Section 4. Our final network contains five convolutional and three fully-connected layers, and this depth seems to be important: we found that removing any convolutional layer (each of which contains no more than 1 % 1\% 1% of the model’s parameters) resulted in inferior performance.
我们的网络包含许多新的和不寻常的特征,这些特征可以提高其性能并减少其训练时间,详见第 3 节。我们网络的规模使得过度拟合成为一个严重的问题,即使有 120 万个带标签的训练示例,所以我们使用了几个防止过度拟合的有效技术,在第 4 节中进行了描述。我们的最终网络包含五个卷积层和三个全连接层,这个深度似乎很重要:我们发现删除任何卷积层(每个卷积层不超过 1 % 1\% 1% 的模型参数)导致性能较差。
In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate. Our network takes between five and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
最后,网络的大小主要受到当前 GPU 上可用内存量和我们愿意容忍的训练时间量的限制。 我们的网络需要五到六天的时间在两个 GTX 580 3GB GPU 上进行训练。 我们所有的实验都表明,我们的结果可以通过等待更快的 GPU 和更大的数据集变得可用来改进。
ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool.
ImageNet 是一个包含超过 1500 万张标记的高分辨率图像的数据集,这些图像属于大约 22,000 个类别。 这些图像是从网络上收集的,并由人工标注者使用亚马逊的 Mechanical Turk 众包工具进行标注。
Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.
从 2010 年开始,作为 Pascal 视觉对象挑战赛的一部分,每年举办一次名为 ImageNet 大规模视觉识别挑战赛 (ILSVRC) 的比赛。 ILSVRC 使用 ImageNet 的一个子集,在 1000 个类别中的每个类别中包含大约 1000 个图像。 总共有大约 120 万张训练图像、50,000 张验证图像和 150,000 张测试图像。
ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is the version on which we performed most of our experiments. Since we also entered our model in the ILSVRC-2012 competition, in Section 6 we report our results on this version of the dataset as well, for which test set labels are unavailable. On ImageNet, it is customary to report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model.
ILSVRC-2010 是唯一可以使用测试集标签的 ILSVRC 版本,因此这是我们进行大部分实验的版本。 由于我们还在 ILSVRC-2012 竞赛中输入了我们的模型,因此在第 6 节中,我们也报告了此版本数据集的结果,其中测试集标签不可用。 在 ImageNet 上,习惯上报告两个错误率:top-1 和 top-5,其中 top-5 错误率是测试图像中正确标签不在模型认为最可能的五个标签中的部分 .
ImageNet consists of variable-resolution images, while our system requires a constant input dimen- sionality. Therefore, we down-sampled the images to a fixed resolution of 256 × 256 256 \times 256 256×256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256 × 256 256 \times 256 256×256 patch from the resulting image.
ImageNet 由可变分辨率的图像组成,而我们的系统需要恒定的输入维度。 因此,我们将图像下采样到 256 × 256 256 \times 256 256×256 的固定分辨率。 给定一个矩形图像,我们首先重新缩放图像,使短边的长度为 256,然后从结果图像中裁剪出中心的 256 × 256 256 \times 256 256×256 补丁。
We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel. So we trained our network on the (centered) raw RGB values of the pixels.
我们没有以任何其他方式预处理图像,除了从每个像素中减去训练集上的平均活动。 因此,我们在像素的(居中的)原始 RGB 值上训练了我们的网络。
1 http://code.google.com/p/cuda-convnet/ ↩︎
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。