赞
踩
Motivation
Approach
Proposed Method
在 InvaSpread 中,负样本的个数受限于 batch size,作者也没有 TPU 只选取了 256 作为 batch size,最终限制了模型的性能
CMC 是较早探索多视角对比学习的工作,它不仅证明了对比学习的灵活性,还证明了多模态、多视角的可行性。CMC 的原班人马甚至在这之后还做了一篇模型蒸馏的工作,将 Student 和 Teacher 模型的输出作为一个正样本对进行对比学习。OpenAI 在这之后也很快推出了 CLIP,将图片和描述它的文本当作正样本对来进行多模态的对比学习
A big and consistent dictionary
SimCLR
ImageNet Top-1 accuracy
SimCLR 的负样本采样个数受限于 batch size,作者团队有 “钞能力” 把 batch size 加到很大才会有如此好的效果
SWAV Method overview
Compared to SimCLR, SWAV pretraining converges faster and is less sensitive to the batch size. Moreover, SWAV is not that sensitive to the number of clusters. Typically 3K clusters are used for ImageNet. In general, it is recommended to use approximately one order of magnitude larger than the real class labels. For STL10 which has 10 classes, 512 clusters would be enough.
Definitions
SWAV method Steps
Digging into SWAV’s math: approximating Q Q Q
# Sinkhorn-Knopp
def sinkhorn(scores, eps=0.05, niters=3):
Q = exp(scores / eps).T
Q /= sum(Q)
K, B = Q.shape
r, c = ones(K) / K, ones(B) / B
for _ in range(niters):
Q *= (r / sum(Q, dim=1)).unsqueeze(1) # row norm
Q *= (c / sum(Q, dim=0)).unsqueeze(0) # column norm
return (Q / sum(Q, dim=0, keepdim=True)).T
The multi-crop idea: augmenting views with smaller images
在不加 multi-crop 之前,SWAV 其实还不如 MoCov2,也就是说这种聚类和对比学习结合的方法其实并没有什么优势,真正提点的是 multi-crop,而且由于 multi-crop 的普适性,接下来的很多工作借鉴的基本都是 multi-crop 而非 SWAV 这篇工作本身
Linear classification on ImageNet
Latent: 特征;Bootstrap Your Own Latent: 自己跟自己学,左脚踩右脚就上天了!
sg: stop graient
Projection heads in SOTA self-supervised methods
Surprising results
Why batch normalization is critical in BYOL: mode collapse
Why BN is implicit contrastive learning: all examples are compared to the mode
BYOL 作者的回应:BYOL works even without batch statistics!
SimSiam
Empirical Study
Hypothesis
作者的猜想着重解释了 SimSiam 可能在进行的优化问题,而对于 SimSiam 为什么没有发生模型崩溃,作者给出了一些简单的看法:
The alternating optimization provides a different trajectory, and the trajectory depends on the initialization. It is unlikely that the initialized η η η, which is the output of a randomly initialized network, would be a constant. Starting from this initialization, it may be difficult for the alternating optimizer to approach a constant η x η_x ηx for all x x x, because the method does not compute the gradients w.r.t. η η η jointly for all x x x.
Comparisons
可以看到,MoCo v2 和 SimSiam 在迁移学习效果上表现是最好的
MoCo v3
Stability of Self-Supervised ViT Training
不过作者也提到,这个 trick 只是缓解了训练不稳定的问题,但并没有真正解决它,当学习率过大时,模型仍会变得不稳定
Self-distillation with no labels
Self-attention from a Vision Transformer
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。