赞
踩
MoCo作为无监督的表征学习的工作,它不仅在分类任务上逼近了有监督的基线模型,而且在很多的主流视觉任务上都超越了有监督预训练模型
MoCo证明无监督学习在视觉领域是可行的,我们有可能真的不需要大规模的有标注的数据去训练
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning [29] as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
翻译:
我们提出动量对比(MoCo)用于无监督视觉表征学习。从对比学习[29]作为字典查找的角度出发,我们构建了一个带有队列和移动平均值编码器的动态字典。这使得建立一个大的并一致的字典,方便对比无监督学习。MoCo在ImageNet分类的通用linear protocol下提供了具有竞争力的结果。更重要的是,MoCo学习到的表征可以很好地转移到下游任务中。在PASCAL VOC、COCO和其他数据集上,MoCo在7个检测/分割任务上的表现优于有监督的预训练对手,有时甚至远远超过它。这表明在许多视觉任务中,无监督和有监督表示学习之间的差距已经很大程度上缩小了。
Unsupervised representation learning is highly successful in natural language processing, e.g., as shown by GPT [50, 51] and BERT [12]. But supervised pre-training is still dominant in computer vision, where unsupervised methods generally lag behind. The reason may stem from differences in their respective signal spaces. Language tasks have discrete signal spaces (words, sub-word units, etc.) for building tokenized dictionaries, on which unsupervised learning can be based. Computer vision, in contrast, further concerns dictionary building [54, 9, 5], as the raw signal is in a continuous, high-dimensional space and is not structured for human communication (e.g., unlike words).
翻译:
无监督表示学习在自然语言处理中非常成功,如GPT[50,51]和BERT[12]所示。但在计算机视觉领域,有监督预训练仍然占主导地位,而无监督方法通常落后于此。其原因可能源于它们各自信号空间的差异。语言任务具有离散的信号空间(词,子词单位等),用于构建标记化字典,无监督学习可以基于此。相比之下,计算机视觉则进一步关注字典构建[54,9,5],因为原始信号处于连续的高维空间中,并且不适合人类交流(例如,不像单词)。
总结:
nlp中,离散的字、词等可以方便地构建tokenize的字典,也就是把某个词变成某个特征;但cv中,原始信号处于连续的高维空间中,包含的语义信息不多且不简洁,导致不适合构建字典
Several recent studies [61, 46, 36, 66, 35, 56, 2] present promising results on unsupervised visual representation learning using approaches related to the contrastive loss [29]. Though driven by various motivations, these methods can be thought of as building dynamic dictionaries. The “keys” (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up: an encoded “query” should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss [29].
翻译:
最近的几项研究[61,46,36,66,35,56,2]在使用与对比损失相关的方法进行无监督视觉表征学习方面取得了令人鼓舞的结果[29]。尽管受到各种动机的驱动,但这些方法可以被认为是构建动态字典。字典中的“键”(token)从数据(例如,图像或patch)中采样,并由编码器网络表示。无监督学习训练编码器执行字典查找:编码的“查询”应该与其匹配的键相似,而与其他键不同。学习被表述为最小化对比损失[29]。
总结:
把所有的对比学习总结成构建字典问题:正样本的特征是query,字典中的负样本是key
From this perspective, we hypothesize that it is desirable to build dictionaries that are: (i) large and (ii) consistent as they evolve during training. Intuitively, a larger dictionary may better sample the underlying continuous, highdimensional visual space, while the keys in the dictionary should be represented by the same or similar encoder so that their comparisons to the query are consistent. However, existing methods that use contrastive losses can be limited in one of these two aspects (discussed later in context).
翻译:
从这个角度来看,我们假设建立字典是可取的需要两个特性:(i)大(ii)在训练过程中保持一致。直观地说,更大的字典可以更好地采样底层连续的高维视觉空间,而字典中的键应该由相同或类似的编码器表示,以便它们与查询的比较是一致的。然而,使用对比损耗的现有方法可能在这两个方面中的一个方面受到限制(稍后将在上下文中讨论)。
总结:
字典要又大又一致
We present Momentum Contrast (MoCo) as a way of building large and consistent dictionaries for unsupervised learning with a contrastive loss (Figure 1). We maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued. The queue decouples the dictionary size from the mini-batch size, allowing it to be large. Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.
翻译:
我们提出了一种称为动量对比(MoCo)的方法,用于构建用于无监督学习的大型且一致的字典(图1)。我们将字典作为数据样本的队列进行维护:将当前小批量的编码表示入队,并将最旧的样本出队。队列将字典大小与小批量大小解耦,使其可以很大。此外,由于字典键来自前几个小批量,为了保持一致性,我们提出了一个慢慢推进的键编码器,它是一个基于动量的移动平均值,与查询编码器相结合。
用队列防止内存爆炸,同时字典大小也跟mini-batch分开了;但队列中会出现不同时刻的encoder的结果,这不就不一致了吗?作者提出动量编码器来解决
动量编码器刚开始是由左边的编码器得到的,当选取一个较大的动量值时,动量编码器受左边编码器的影响大幅减少,尽最大可能的保持了一致性
MoCo is a mechanism for building dynamic dictionaries for contrastive learning, and can be used with various pretext tasks. In this paper, we follow a simple instance discrimination task [61, 63, 2]: a query matches a key if they are encoded views (e.g., different crops) of the same image. Using this pretext task, MoCo shows competitive results under the common protocol of linear classification in the ImageNet dataset [11].
翻译:
MoCo是一种为对比学习构建动态字典的机制,可用于各种代理任务。在本文中,我们遵循一个简单的个体判别任务[61,63,2]:如果它们是同一图像的编码视角(例如,不同的crop),则查询匹配一个键。使用此代理任务,MoCo在ImageNet数据集的常见线性分类protocol下显示了竞争结果[11]。
A main purpose of unsupervised learning is to pre-train representations (i.e., features) that can be transferred to downstream tasks by fine-tuning. We show that in 7 downstream tasks related to detection or segmentation, MoCo unsupervised pre-training can surpass its ImageNet supervised counterpart, in some cases by nontrivial margins. In these experiments, we explore MoCo pre-trained on ImageNet or on a one-billion Instagram image set, demonstrating that MoCo can work well in a more real-world, billionimage scale, and relatively uncurated scenario. These results show that MoCo largely closes the gap between unsupervised and supervised representation learning in many computer vision tasks, and can serve as an alternative to ImageNet supervised pre-training in several applications.
翻译:
无监督学习的一个主要目的是预训练表征(即特征),这些表征可以通过微调转移到下游任务。我们表明,在与检测或分割相关的7个下游任务中,MoCo无监督预训练可以超过其ImageNet有监督的对应项,在某些情况下,这一差距非常大。在这些实验中,我们探索了在ImageNet或10亿张Instagram图像集上预训练的MoCo,证明MoCo可以在更真实的、10亿张图像规模和相对未经策划的场景中很好地工作。这些结果表明,MoCo在很大程度上缩小了许多计算机视觉任务中无监督和有监督表示学习之间的差距,并且可以在一些应用中作为ImageNet监督预训练的替代方案。
Unsupervised/self-supervised1 learning methods generally involve two aspects: pretext tasks and loss functions.The term “pretext” implies that the task being solved is not of genuine interest, but is solved only for the true purpose of learning a good data representation. Loss functions can often be investigated independently of pretext tasks. MoCo focuses on the loss function aspect. Next we discuss related studies with respect to these two aspects.
翻译:
无监督/自监督学习方法一般包括两个方面:代理任务和损失函数。术语“代理”意味着要解决的任务不是真正感兴趣的,而是为了学习良好的数据表示而解决的。损失函数通常可以独立于代理任务进行研究。MoCo侧重于损失函数方面。下面我们就这两个方面的相关研究进行讨论。
A common way of defining a loss function is to measure the difference between a model’s prediction and a fixed target, such as reconstructing the input pixels (e.g., auto-encoders) by L1 or L2 losses, or classifying the input into pre-defined categories (e.g., eight positions [13], color bins [64]) by cross-entropy or margin-based losses.Other alternatives, as described next, are also possible.
Contrastive losses [29] measure the similarities of sample pairs in a representation space. Instead of matching an input to a fixed target, in contrastive loss formulations the target can vary on-the-fly during training and can be defined in terms of the data representation computed by a network [29]. Contrastive learning is at the core of several recent works on unsupervised learning [61, 46, 36, 66, 35, 56, 2], which we elaborate on later in context (Sec. 3.1).
Adversarial losses [24] measure the difference between probability distributions. It is a widely successful technique for unsupervised data generation. Adversarial methods for representation learning are explored in [15, 16]. There are relations (see [24]) between generative adversarial networks and noise-contrastive estimation (NCE) [28].
翻译:
定义损失函数的一种常见方法是测量模型预测与固定目标之间的差异,例如通过L1或L2损失重建输入像素(例如,自编码器),或者通过交叉熵或基于边缘的损失将输入分类为预定义的类别(例如,八个位置[13],颜色箱[64])。
其他选择,如下面所述,也是可能的。
对比损失[29]衡量一个表示空间中样本对的相似性。在对比损失公式中,目标可以在训练过程中动态变化,并且可以根据网络计算的数据表示来定义,而不是将输入与固定目标匹配[29]。对比学习是最近几项关于无监督学习的研究的核心[61,46,36,66,35,56,2],我们将在后面的上下文中详细阐述(第3.1节)。
对抗性损失[24]衡量概率分布之间的差异。它是一种广泛成功的无监督数据生成技术。表征学习的对抗性方法在[15,16]中进行了探讨。生成对抗网络和噪声对比估计(NCE)之间存在关系(参见[24])[28]。
A wide range of pretext tasks have been proposed. Examples include recovering the input under some corruption, e.g., denoising auto-encoders [58], context autoencoders [48], or cross-channel auto-encoders (colorization) [64, 65]. Some pretext tasks form pseudo-labels by, e.g., transformations of a single (“exemplar”) image [17], patch orderings [13, 45], tracking [59] or segmenting objects [47] in videos, or clustering features [3, 4].
翻译:
已经提出了各种各样的代理任务。包括在某些损坏情况下恢复输入,例如去噪自编码器[58],上下文自编码器[48]或跨通道自编码器(着色)[64,65]。一些代理任务形成伪标签,例如,单个(“范例”)图像的变换[17],patch排序[13,45],跟踪[59]或分割视频中的对象[47],或聚类特征[3,4]。
Various pretext tasks can be based on some form of contrastive loss functions. The instance discrimination method [61] is related to the exemplar-based task [17] and NCE [28]. The pretext task in contrastive predictive coding (CPC) [46] is a form of context auto-encoding [48], and in contrastive multiview coding (CMC) [56] it is related to colorization [64].
翻译:
各种代理任务可以基于某种形式的对比损失函数。个体辨别方法[61]与基于范例的任务[17]和NCE[28]有关。对比预测编码(CPC)中的预训练任务是一种上下文自编码的形式,而在对比多视图编码(CMC)中则与着色相关。
Consider an encoded query q and a set of encoded samples {k0; k1; k2; ...} that are the keys of a dictionary. Assume that there is a single key (denoted as k+) in the dictionary that q matches. A contrastive loss [29] is a function whose value is low when q is similar to its positive key k+ and dissimilar to all other keys (considered negative keys for q). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE [46], is considered in this paper:
where τ is a temperature hyper-parameter per [61]. The sum is over one positive and K negative samples. Intuitively, this loss is the log loss of a (K+1)-way softmax-based classifier that tries to classify q as k+. Contrastive loss functions can also be based on other forms [29, 59, 61, 36], such as margin-based losses and variants of NCE losses.
The contrastive loss serves as an unsupervised objective function for training the encoder networks that represent the queries and keys [29]. In general, the query representation is q = fq(xq) where fq is an encoder network and xq is a query sample (likewise, k = fk(xk)). Their instantiations depend on the specific pretext task. The input xq and xk can be images [29, 61, 63], patches [46], or context consisting a set of patches [46]. The networks fq and fk can be identical [29, 59, 63], partially shared [46, 36, 2], or different [56].
翻译:
考虑一个编码查询q和一组编码样本{k0; k1; k2; ...}是字典的键。假设字典中有一个键(记为k+)与q匹配。对比损失[29]是指当q与它的正键k+相似,且与其他所有键不相似时(认为是q的负键),其值较低的函数。通过点积度量相似性,本文考虑了一种对比损失函数InfoNCE [46]:
其中τ为温度超参数per[61]。和是1个正样本和K个负样本。直观地说,这个损失是试图将q分类为K+的(K+1)路基于softmax的分类器的log损失。对比损失函数也可以基于其他形式[29,59,61,36],如基于边际的损失和NCE损失的变体。
对比损失作为无监督目标函数,用于训练表示查询和键的编码器网络[29]。通常,查询表示为q = fq(xq),其中fq是编码器网络,xq是查询样本(同样,k = fk(xk))。它们的实例化取决于特定的借口任务。输入的xq和xk可以是图像[29,61,63],patch[46],或由一组patch组成的上下文[46]。网络fq和fk可以是参数共享的[29,59,63],部分共享的[46,36,2],或者不同的[56]。
总结:
当正样本对匹配和负样本对不匹配时,loss应该低;反之则高,需惩罚
采用InfoNCE对比学习函数,若把括号中看成x,里面的部分就是softmax,在有监督前提下加-log就是交叉熵;但在自监督中,K是类别个数,是非常巨大的数字,softmax无法工作
NCE loss把多分类问题简化成二分类问题,一类是data sample,一类是noisy sample;如何让计算更快一点?挑负样本做近似,所以要求字典尽可能大一些确保近似效果
InfoNCE认为还是把它看成一个多分类问题比较合适;τ控制分布的形状,在这里τ越小分布越集中,会让模型只关注那些特别困难的样本,τ越大会使对比损失对所有负样本一视同仁;K代表负样本数量
从伪代码中也可看出是拿交叉熵实现的
From the above perspective, contrastive learning is a way of building a discrete dictionary on high-dimensional continuous inputs such as images. The dictionary is dynamic in the sense that the keys are randomly sampled, and that the key encoder evolves during training. Our hypothesis is that good features can be learned by a large dictionary that covers a rich set of negative samples, while the encoder for the dictionary keys is kept as consistent as possible despite its evolution. Based on this motivation, we present Momentum Contrast as described next
翻译:
从上面的角度来看,对比学习是一种在高维连续输入(如图像)上构建离散字典的方法。字典是动态的,因为键是随机采样的,并且键编码器在训练过程中不断发展。我们的假设是,好的特征可以通过一个包含丰富负样本集的大字典来学习,而字典键的编码器尽管在进化中仍尽可能保持一致。基于这一动机,我们提出动量对比,如下所述
At the core of our approach is maintaining the dictionary as a queue of data samples. This allows us to reuse the encoded keys from the immediate preceding mini-batches. The introduction of a queue decouples the dictionary size from the mini-batch size. Our dictionary size can be much larger than a typical mini-batch size, and can be flexibly and independently set as a hyper-parameter.
The samples in the dictionary are progressively replaced.The current mini-batch is enqueued to the dictionary, and the oldest mini-batch in the queue is removed. The dictionary always represents a sampled subset of all data, while the extra computation of maintaining this dictionary is manageable. Moreover, removing the oldest mini-batch can be beneficial, because its encoded keys are the most outdated and thus the least consistent with the newest ones.
翻译:
我们方法的核心是将字典维护为数据样本队列。这允许我们重用前面mini-batch的编码密钥。队列的引入将字典大小与mini-batch处理大小解耦。我们的字典大小可以比典型的mini-batch处理大小大得多,并且可以灵活和独立地设置为超参数。
字典中的样本逐渐被替换。当前的mini-batch被加入字典队列,队列中最老的mini-batch被删除。字典总是表示所有数据的一个抽样子集,而维护这个字典的额外计算是可管理的。此外,删除最旧的mini-batch可能是有益的,因为它的编码密钥是最过时的,因此与最新的密钥最不一致。
总结:
队列解决了内存问题,解决了速度问题,还能及时抛弃过期的mini-batch
Using a queue can make the dictionary large, but it also makes it intractable to update the key encoder by back-propagation (the gradient should propagate to all samples in the queue). A na¨ıve solution is to copy the key encoder fk from the query encoder fq, ignoring this gradient. But this solution yields poor results in experiments (Sec. 4.1). We hypothesize that such failure is caused by the rapidly changing encoder that reduces the key representations’ consistency. We propose a momentum update to address this issue.
Formally, denoting the parameters of fk as θk and those of fq as θq, we update θk by:
θk <- mθk + (1−m)θq.
Here m ∈ [0,1) is a momentum coefficient. Only the parameters θq are updated by back-propagation. The momentum update in Eqn.(2) makes θk evolve more smoothly than θq. As a result, though the keys in the queue are encoded by different encoders (in different mini-batches), the difference among these encoders can be made small. In experiments, a relatively large momentum (e.g., m = 0.999, our default) works much better than a smaller value (e.g., m = 0:9), suggesting that a slowly evolving key encoder is a core to making use of a queue.
翻译:
使用队列可以使字典变大,但它也使得通过反向传播(梯度应该传播到队列中的所有样本)更新密钥编码器变得难以处理。一个比较简单的解决方案是从查询编码器fq复制键编码器fk,忽略这个梯度。但是这个解决方案在实验中产生了很差的结果(第4.1节)。我们假设这种失败是由于快速变化的编码器降低了密钥表示的一致性造成的。我们建议对这一问题进行动量更新。
形式上,将fk的参数记为θk, fq的参数记为θq,我们将θk更新为:
θk <- mθk + (1 - m)θq。
其中m∈[0,1]为动量系数。只有参数θq通过反向传播更新。Eqn.(2)中的动量更新使得θk比θq更平滑地演化。因此,尽管队列中的键由不同的编码器编码(在不同的mini-batch中),但这些编码器之间的差异可以很小。在实验中,相对较大的动量(例如m = 0.999,我们的默认值)比较小的动量(例如m = 0:9)要好得多,这表明缓慢发展的密钥编码器是利用队列的核心。
总结:
避免快速变换的编码器影响一致性,使用大的动量系数的动量式更新,使更新更平滑
MoCo is a general mechanism for using contrastive losses. We compare it with two existing general mechanisms in Figure 2. They exhibit different properties on the dictionary size and consistency.
The end-to-end update by back-propagation is a natural mechanism (e.g., [29, 46, 36, 63, 2, 35], Figure 2a). It uses samples in the current mini-batch as the dictionary, so the keys are consistently encoded (by the same set of encoder parameters). But the dictionary size is coupled with the mini-batch size, limited by the GPU memory size. It is also challenged by large mini-batch optimization [25]. Some recent methods [46, 36, 2] are based on pretext tasks driven by local positions, where the dictionary size can be made larger by multiple positions. But these pretext tasks may require special network designs such as patchifying the input [46] or customizing the receptive field size [2], which may complicate the transfer of these networks to downstream tasks.
Another mechanism is the memory bank approach proposed by [61] (Figure 2b). A memory bank consists of the representations of all samples in the dataset. The dictionary for each mini-batch is randomly sampled from the memory bank with no back-propagation, so it can support a large dictionary size. However, the representation of a sample in the memory bank was updated when it was last seen, so the sampled keys are essentially about the encoders at multiple different steps all over the past epoch and thus are less consistent. A momentum update is adopted on the memory bank in [61]. Its momentum update is on the representations of the same sample, not the encoder. This momentum update is irrelevant to our method, because MoCo does not keep track of every sample. Moreover, our method is more memory-efficient and can be trained on billion-scale data, which can be intractable for a memory bank.
翻译:
MoCo是使用对比损耗的一般机制。我们将其与图2中现有的两种通用机制进行比较。它们在字典大小和一致性上表现出不同的属性。
通过反向传播进行的端到端更新是一种自然机制(例如[29,46,36,63,2,35],图2a)。它使用当前mini-batch处理中的样本作为字典,因此键被一致地编码(由相同的编码器参数集)。但是字典大小与mini-batch大小相结合,受到GPU内存大小的限制。它也受到大型mini-batch优化的挑战[25]。最近的一些方法[46,36,2]是基于由局部位置驱动的代理任务,其中多个位置可以使字典大小变大。但这些代理任务可能需要特殊的网络设计,如修补输入[46]或定制接受域大小[2],这可能会使这些网络向下游任务的转移复杂化。
另一种机制是由[61]提出的Memory Bank方法(图2b)。Memory Bank由数据集中所有样本的表示组成。每个mini-batch的字典都是从Memory Bank中随机采样的,没有反向传播,因此它可以支持较大的字典大小。然而,当最后一次看到样本时,Memory Bank中样本的表示是更新的,因此采样的密钥本质上是关于过去一个epoch中多个不同步骤的编码器的,因此不太一致。文献[61]对Memory Bank采用动量更新。它的动量更新是在相同样本的表示上,而不是在编码器上。这个动量更新与我们的方法无关,因为MoCo并没有跟踪每个样本。此外,我们的方法具有更高的内存效率,可以在十亿规模的数据上进行训练,这对于Memory Bank来说是难以处理的。
总结:
端到端的学习方式:编码器都可以通过梯度回传来更新,但字典大小与mini-batch大小是等价的;好处是可以随时更新,key的一致性非常高
Memory Bank:牺牲一些一致性,换取字典变大。字典这边没有编码器,而是去抽取负样本;由于Memory Bank中的负样本更新是不同时刻的编码器得到的,一致性很差;需要训练整个epoch才能更新一遍,导致两次更新离得很远
Contrastive learning can drive a variety of pretext tasks.As the focus of this paper is not on designing a new pretext task, we use a simple one mainly following the instance discrimination task in [61], to which some recent works [63, 2] are related.
Following [61], we consider a query and a key as a positive pair if they originate from the same image, and otherwise as a negative sample pair. Following [63, 2], we take two random “views” of the same image under random data augmentation to form a positive pair. The queries and keys are respectively encoded by their encoders, fq and fk. The encoder can be any convolutional neural network [39].
Algorithm 1 provides the pseudo-code of MoCo for this pretext task. For the current mini-batch, we encode the queries and their corresponding keys, which form the positive sample pairs. The negative samples are from the queue.
翻译:
对比学习可以驱动各种各样的代理任务。由于本文的重点不是设计一个新的代理任务,我们使用了一个简单的代理任务,主要是遵循[61]中的个体判别任务,最近的一些研究[63,2]与个体判别任务有关。
根据[61],如果查询和键来自同一图像,我们将其视为正对,否则将其视为负样本对。按照[63,2],我们在随机数据增强的情况下,对同一张图像随机取两个“视图”,形成正对。查询和键分别由它们的编码器fq和fk编码。编码器可以是任何卷积神经网络[39]。
算法1为这个借口任务提供了MoCo的伪代码。对于当前的小批量,我们对查询及其对应的键进行编码,它们形成正样本对。负样本来自队列。
We adopt a ResNet [33] as the encoder, whose last fully-connected layer (after global average pooling) has a fixed-dimensional output (128-D [61]). This output vector is normalized by its L2-norm [61]. This is the representation of the query or key. The temperature τ in Eqn.(1) is set as 0.07 [61]. The data augmentation setting follows [61]: a 224×224-pixel crop is taken from a randomly resized image, and then undergoes random color jittering, random horizontal flip, and random grayscale conversion, all available in PyTorch’s torchvision package.
翻译:
我们采用ResNet[33]作为编码器,其最后一个全连接层(经过全局平均池化)具有固定维输出(128-D[61])。该输出向量由l2范数归一化[61]。这是查询或键的表示形式。设Eqn.(1)中的温度τ为0.07[61]。数据增强设置如下[61]:从随机调整大小的图像中截取224×224-pixel裁剪,然后经历随机颜色抖动,随机水平翻转和随机灰度转换,这些都可以在PyTorch的torchvision包中使用。
总结:
基本沿用InstDisc的超参
Our encoders fq and fk both have Batch Normalization (BN) [37] as in the standard ResNet [33]. In experiments, we found that using BN prevents the model from learning good representations, as similarly reported in [35] (which avoids using BN). The model appears to “cheat” the pretext task and easily finds a low-loss solution. This is possibly because the intra-batch communication among samples (caused by BN) leaks information.
We resolve this problem by shuffling BN. We train with multiple GPUs and perform BN on the samples independently for each GPU (as done in common practice). For the key encoder fk, we shuffle the sample order in the current mini-batch before distributing it among GPUs (and shuffle back after encoding); the sample order of the mini-batch for the query encoder fq is not altered. This ensures the batch statistics used to compute a query and its positive key come from two different subsets. This effectively tackles the cheating issue and allows training to benefit from BN.
We use shuffled BN in both our method and its end-toend ablation counterpart (Figure 2a). It is irrelevant to the memory bank counterpart (Figure 2b), which does not suffer from this issue because the positive keys are from different mini-batches in the past.
翻译:
我们的编码器fq和fk都与标准ResNet[33]一样具有批处理归一化(BN)[37]。在实验中,我们发现使用BN会阻止模型学习良好的表示,正如[35]中类似的报道(避免使用BN)。该模型似乎“欺骗”了代理任务,并很容易找到低损失的解决方案。这可能是因为样本之间的批内通信(由BN引起)泄露了信息。
我们通过变换BN来解决这个问题。我们使用多个GPU进行训练,并对每个GPU的样本独立执行BN(如常见实践所做的那样)。对于密钥编码器fk,我们在当前的小批量中打乱采样顺序,然后将其分配给gpu(并在编码后打乱);查询编码器fq的mini-batch样本顺序没有改变。这确保了用于计算查询及其正键的批统计信息来自两个不同的子集。这有效地解决了作弊问题,并使培训受益于BN。
我们在我们的方法和其端到端消融对应的方法中都使用了Shuffle BN(图2a)。它与对应的Memory Bank无关(图2b),后者不受此问题的影响,因为正键过去来自不同的mini-batch。
For this classifier, we perform a grid search and find the optimal initial learning rate is 30 and weight decay is 0 (similarly reported in [56]). These hyper-parameters perform consistently well for all ablation entries presented in this subsection. These hyper-parameter values imply that the feature distributions (e.g., magnitudes) can be substantially different from those of ImageNet supervised training, an issue we will revisit in Sec. 4.2.
翻译:
对于这个分类器,我们执行网格搜索,发现最优初始学习率为30,权重衰减为0(类似的报道在[56]中)。这些超参数在本节介绍的所有消融项中表现一致。这些超参数值意味着特征分布(例如,大小)可能与ImageNet监督训练的特征分布有很大不同,我们将在第4.2节中重新讨论这个问题。
横坐标K是负样本数量
端到端内存受限,只能做到1024
The table below shows ResNet-50 accuracy with different MoCo momentum values (m in Eqn.(2)) used in pre-training (K = 4096 here) :
It performs reasonably well when m is in 0.99 ∼ 0.9999, showing that a slowly progressing (i.e., relatively large momentum) key encoder is beneficial. When m is too small (e.g., 0.9), the accuracy drops considerably; at the extreme of no momentum (m is 0), the training loss oscillates and fails to converge. These results support our motivation of building a consistent dictionary.
翻译:
下表显示了预训练中使用的不同MoCo动量值(公式(2)中的m)的ResNet-50精度(这里K = 4096):
当m在0.99 ~ 0.9999之间时,它的性能相当好,这表明缓慢进展(即相对较大的动量)密钥编码器是有益的。当m太小(例如0.9)时,精度会大幅下降;在无动量的极端情况下(m = 0),训练损失振荡,不能收敛。这些结果支持我们构建一致字典的动机。
在ImageNet上linear protocol的结果展示
对无监督学习来说,模型越大效果越好
A main goal of unsupervised learning is to learn features that are transferrable. ImageNet supervised pre-training is most influential when serving as the initialization for finetuning in downstream tasks (e.g., [21, 20, 43, 52]). Next we compare MoCo with ImageNet supervised pre-training, transferred to various tasks including PASCAL VOC [18], COCO [42], etc. As prerequisites, we discuss two important issues involved [31]: normalization and schedules.
翻译:
无监督学习的一个主要目标是学习可迁移的特征。ImageNet有监督预训练在作为下游任务微调的初始化时最具影响力(例如[21,20,43,52])。接下来,我们将MoCo与ImageNet监督的预训练进行比较,将其转移到各种任务中,包括PASCAL VOC [18], COCO[42]等。作为先决条件,我们讨论了涉及的两个重要问题[31]:规范化和调度。
As noted in Sec. 4.1, features produced by unsupervised pre-training can have different distributions compared with ImageNet supervised pre-training. But a system for a downstream task often has hyper-parameters (e.g., learning rates) selected for supervised pre-training. To relieve this problem, we adopt feature normalization during fine-tuning: we fine-tune with BN that is trained (and synchronized across GPUs [49]), instead of freezing it by an affine layer [33]. We also use BN in the newly initialized layers (e.g., FPN [41]), which helps calibrate magnitudes.
We perform normalization when fine-tuning supervised and unsupervised pre-training models. MoCo uses the same hyper-parameters as the ImageNet supervised counterpart.
翻译:
如4.1节所述,与ImageNet监督预训练相比,无监督预训练产生的特征可以具有不同的分布。但是,下游任务的系统通常具有超参数(例如,学习率),用于监督预训练。为了解决这个问题,我们在微调时采用特征归一化:我们对训练好的BN进行微调(并在gpu间同步[49]),而不是用线性层冻结它[33]。我们还在新初始化的层中使用BN(例如,FPN[41]),这有助于校准震级。
我们在微调有监督和无监督预训练模型时执行归一化。MoCo使用与ImageNet监督的对等体相同的超参数。
总结:
linear protocol下学习率为30,代表有监督与无监督学习到的特征的分布其实很不一样。如果想迁移到不同的下游任务,我们不可能针对每一个任务都重新进行网格搜索,因此我们希望它们学到的分布相同,于是需要归一化
If the fine-tuning schedule is long enough, training detectors from random initialization can be strong baselines, and can match the ImageNet supervised counterpart on COCO [31]. Our goal is to investigate transferability of features, so our experiments are on controlled schedules, e.g., the 1× (∼12 epochs) or 2× schedules [22] for COCO, in contrast to 6×∼9× in [31]. On smaller datasets like VOC, training longer may not catch up [31].
Nonetheless, in our fine-tuning, MoCo uses the same schedule as the ImageNet supervised counterpart, and random initialization results are provided as references.
翻译:
如果微调时间表足够长,随机初始化的训练检测器可以是强基线,并且可以在COCO上匹配ImageNet监督的对应物[31]。我们的目标是研究特征的可转移性,因此我们的实验是在受控时间表上进行的,例如,COCO的1×(~ 12个epoch)或2×时间表[22],而不是[31]中的6× ~ 9×时间表。在像VOC这样的小数据集上,长时间的训练可能跟不上[31]。
尽管如此,在我们的微调中,MoCo使用与ImageNet监督对应的相同的调度,并提供随机初始化结果作为参考。
总结:
MoCo更快,短时间内效果更好;训练时间长的时候大家都差不多
Put together, our fine-tuning uses the same setting as the supervised pre-training counterpart. This may place MoCo at a disadvantage. Even so, MoCo is competitive. Doing so also makes it feasible to present comparisons on multiple datasets/tasks, without extra hyper-parameter search.
翻译:
总之,我们的微调使用了与监督预训练相同的设置。这可能会使MoCo处于不利地位。即便如此,MoCo仍具有竞争力。这样做还可以在多个数据集/任务上进行比较,而无需额外的超参数搜索。
MoCo真正做到了超越有监督模型的结果
与之前的工作相比,MoCo在迁移的下游任务效果也更好
In sum, MoCo can outperform its ImageNet supervised pre-training counterpart in 7 detection or segmentation tasks.5 Besides, MoCo is on par on Cityscapes instance segmentation, and lags behind on VOC semantic segmentation; we show another comparable case on iNaturalist [57] in appendix. Overall, MoCo has largely closed the gap between unsupervised and supervised representation learning in multiple vision tasks.
Remarkably, in all these tasks, MoCo pre-trained on IG-1B is consistently better than MoCo pre-trained on IN-1M. This shows that MoCo can perform well on this large-scale, relatively uncurated dataset. This represents a scenario towards real-world unsupervised learning.
翻译:
总之,MoCo可以在7个检测或分割任务中优于其ImageNet有监督预训练对手此外,MoCo在cityscape实例分割上基本持平,在VOC语义分割上落后;我们在iNaturalist[57]的附录中展示了另一个类似的案例。总的来说,MoCo在很大程度上缩小了多视觉任务中无监督和有监督表示学习之间的差距。
值得注意的是,在所有这些任务中,IG-1B预训练的MoCo始终优于IN - 1M预训练的MoCo。这表明MoCo可以在这种大规模的、相对未经整理的数据集上表现良好。这代表了现实世界中无监督学习的场景。
总结:
MoCo在很多下游任务都进行了迁移尝试,效果显著;给更多的数据,MoCo的效果还能提升
Our method has shown positive results of unsupervised learning in a variety of computer vision tasks and datasets.A few open questions are worth discussing. MoCo’s improvement from IN-1M to IG-1B is consistently noticeable but relatively small, suggesting that the larger-scale data may not be fully exploited. We hope an advanced pretext task will improve this. Beyond the simple instance discrimination task [61], it is possible to adopt MoCo for pretext tasks like masked auto-encoding, e.g., in language [12] and in vision [46]. We hope MoCo will be useful with other pretext tasks that involve contrastive learning.
翻译:
我们的方法在各种计算机视觉任务和数据集中显示了无监督学习的积极结果。有几个悬而未决的问题值得讨论。MoCo从IN-1M到IG-1B的改进一直很明显,但相对较小,这表明更大规模的数据可能没有得到充分利用。我们希望一个先进的代理任务将改善这一点。除了简单的个体判别任务[61]外,还可以将MoCo用于伪装自编码等代理任务,例如语言[12]和视觉[46]。我们希望MoCo对其他涉及对比学习的代理任务有用。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。