spring知识点提炼
历史(History)
In 2012, AlexNet outperformed all the existing models on the ImageNet data. Neural networks were about to see major adoption. By 2015, many state of the arts were broken. The trend was to use neural networks on any use case you could find. The success of VGG Net further affirmed the use of deeper-model or ensemble of models to get a performance boost.
在2012年,AlexNet优于ImageNet数据上的所有现有模型。 神经网络即将被广泛采用。 到2015年,许多先进技术都被打破了。 趋势是在可以找到的任何用例上使用神经网络。 VGG Net的成功进一步肯定了使用更深层次的模型或整体模型来提高性能。
(Ensemble of models is only a fancy term. It means averaging of outputs from multiple models. Like if there are three models and two models predict ‘A’ while one model predicts ‘B’, then take the final prediction as ‘A’ (two votes versus one vote))
(模型集合只是一个花哨的术语。它意味着对多个模型的输出求平均值。例如,如果有三个模型,并且两个模型预测“ A”,而一个模型预测“ B”,则将最终预测作为“ A”( 2票对1票))
But these deeper models and these ensemble of models are too costly to run during inference. (An ensemble of 3 models uses 3x the amount of computations of a single model).
但是这些更深层次的模型和这些模型的整体过于昂贵,无法在推理过程中运行。 (3个模型的集合使用的计算量是单个模型的3倍)。
构想 (Ideation)
Geoffrey Hinton, Oriol Vinyals and Jeff Dean came up with a strategy to train shallow models guided by these pre-trained ensembles. They called this knowledge distillation because you distill knowledge from a pre-trained model to a new model. As this seems like a teacher guiding a student, so this is also called teacher-student learning. https://arxiv.org/abs/1503.02531
Geoffrey Hinton,Oriol Vinyals和Jeff Dean提出了一种策略来训练由这些预训练的合奏引导的浅层模型。 他们之所以称为知识蒸馏,是因为您将知识从预先训练的模型中提炼为新模型。 因为这似乎像是老师在指导学生,所以这也称为师生学习。 https://arxiv.org/abs/1503.02531
In Knowledge Distillation they used the output probability of the pre-trained model as the labels for the new shallow model. Through this blog you would go through the improvements of this technique.
在知识蒸馏中,他们使用了预训练模型的输出概率作为新浅层模型的标签。 通过此博客,您将了解该技术的改进。
Fitnets (Fitnets)
In 2015 came FitNets: Hints for Thin Deep Nets (published at ICLR’15)
在2015年出现了FitNets:薄型深网的提示(在ICLR'15上发布)
FitNets add an additional term along with the KD loss. They take representation from the middle point of both the networks, and add a mean square loss between the feature representations at these points.
FitNets在KD损失的基础上增加了一个附加项。 它们从两个网络的中间点获取表示,并在这些点的特征表示之间添加均方损失。
The trained-network is providing a learnt-intermediate-representation which the new-network is mimicking. These representations help the student to learn efficiently, and were called hints.
训练有素的网络正在提供新网络正在模仿的学习型中间表示。 这些表示帮助学生有效地学习,被称为提示。
Looking back, this choice of using a single point for giving hints is sub-optimal. A lot of subsequent papers try to improve these hints.
回顾过去,使用单点提示的选择是次优的。 随后的许多论文都试图改善这些提示。
多加注意 (Paying more attention to attention)
Paying more attention to attention: Improving the performance of convolutional neural networks via Attention Transfer was published at ICLR 2017
更加注意注意力:通过注意力转移改善卷积神经网络的性能已在ICLR 2017上发表
They have similar motivation as FitNets, but rather than the representations from a point in the network, they use the attention maps as the hints. (MSE over attention maps of student and teacher). They also use multiple points in the network for giving hints, rather than the one point hint in FitNets
它们具有与FitNets类似的动机,但是它们不是使用网络中某个点的表示,而是使用注意力图作为提示。 (MSE在学生和老师的注意力图上)。 他们还使用网络中的多个点提供提示,而不是FitNets中的一个点提示
知识蒸馏的礼物(A Gift from Knowledge Distillation)
In the same year, A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning was published at CVPR 2017.
同年,在CVPR 2017上发布了《知识蒸馏的礼物:快速优化,网络最小化和迁移学习》 。
This is also similar to FitNets and the attention transfer paper. But instead of the representation and the attention maps, they give hints using the Gram matrices.
这也类似于FitNets和注意力转移文件。 但是,它们没有使用表示法和注意图,而是使用Gram矩阵给出了提示。
They have an analogy for this in the paper:
他们对此有一个类比:
“In the case of people, the teacher explains the solution process for a problem, and the student learns the flow of the solution procedure. The student DNN does not necessarily have to learn the intermediate output when the specific question is input but can learn the solution method when a specific type of question is encountered. In this manner, we believe that demonstrating the solution process for the problem provides better generalization than teaching the intermediate result.”
“对于人来说,老师解释问题的解决过程,而学生则学习解决过程的流程。 输入特定问题时,学生DNN不一定必须学习中间输出,但是当遇到特定类型的问题时,学生DNN可以学习解决方法。 通过这种方式,我们认为,演示问题的解决过程比讲授中间结果更好。
To measure this “flow of solution procedure”, they use a gram matrix between the feature maps of two layers. So instead of the intermediate feature representation as the hints in FitNets, this uses Gram matrix between feature representations as the hints.
为了测量这种“求解过程的流程”,他们在两层的特征图之间使用了一个gram矩阵。 因此,它不是将中间特征表示作为FitNets中的提示,而是使用特征表示之间的Gram矩阵作为提示。
释义复杂网络 (Paraphrasing Complex Network)
Then in 2018 came Paraphrasing Complex Network: Network Compression via Factor Transfer published at NeurIPS 2018
然后在2018年发布了Paraphrasing Complex Network:Network Compressing via Factor Transfer在NeurIPS 2018上发布
They add another module to the model which they call paraphraser. It is basically an auto-encoder which doesn’t reduce dimensions. From the last layer they fork out another layer which trains on the reconstruction loss.
他们在模型中添加了另一个模块,称为释义。 它基本上是一种自动编码器,不会减小尺寸。 他们从最后一层派生出另一层,训练重建损失。
The student has another module named translator. It embeds the outputs of the student’s last layer to the teacher-paraphraser’s dimensions. And they use this latent paraphrased representation from the teacher as hints.
学生有另一个名为翻译器的模块。 它将学生的最后一层的输出嵌入到教师解释器的尺寸中。 他们使用老师的这种潜在的释义表达作为提示。
tl;dr The student should be able to construct an auto-encoded representation of the input from the teacher network.
tl; dr学生应该能够构造来自教师网络的输入的自动编码表示。
功能蒸馏全面检修 (A Comprehensive Overhaul of Feature Distillation)
In 2019, A Comprehensive Overhaul of Feature Distillation was published at ICCV 2019.
在2019年,ICCV 2019上发布了功能蒸馏的全面检修。
They claim that the position from which we take the hints isn’t optimal. The outputs are refined through ReLU and some information is lost during that transformation. They propose a marginReLU activation (a shifted ReLU). “In our margin ReLU, the positive (beneficial) information is used without any transformation while the negative (adverse) information is suppressed. As a result, the proposed method can perform distillation without missing the beneficial information”
他们声称,提示的位置不是最佳的。 通过ReLU完善了输出,并且在转换过程中丢失了一些信息。 他们提出了marginReLU激活(移位的ReLU)。 “在我们的保证金ReLU中,使用正(有益)信息时无需进行任何转换,而抑制了负(不利)信息。 结果,所提出的方法可以进行蒸馏而不会丢失有益的信息”
They employ a partial L2 distance function which is designed to skip the distillation of information on a negative region. (No loss if both the feature vector from student and from the teacher at that location is negative)
它们采用部分L2距离函数,该函数设计为跳过负区域上信息的提炼。 (如果来自该位置的学生和教师的特征向量均为负,则不会丢失)
Contrastive Representation Distillation was published at ICLR 2020. Here also the student learns from the teacher’s intermediate representations, but instead of MSE loss they use a contrastive loss over them.
对比表示提炼在ICLR 2020上发表。在这里,学生也可以从老师的中间表示中学习,但不是MSE损失,而是对他们使用了对比损失。
In total, these different models have employed different methods to
总体而言,这些不同的模型采用了不同的方法
Increase the amount of transferred information in distillation.
增加蒸馏中传输的信息量。
(Feature representations, Gram Matrices, Attention Maps, Paraphrased representations, pre-ReLU features)
(特征表示,克矩阵,注意力图,释义表示,ReLU之前的特征)
Make the process of distillation efficient by tweaking with the loss function
通过调整损失函数使蒸馏过程高效
(Contrastive, partial L2 distance)
(对比,部分L2距离)
Another interesting way to look at these ideas is that new ideas are vector sum of old ideas.
查看这些想法的另一种有趣方式是,新想法是旧想法的矢量和。
- Gram Matrices for KD = Neural Style Transfer + KD KD的克矩阵=神经风格转移+ KD
- Attention Maps for KD = Attention is all you need + KD KD注意图=您需要的就是注意力+ KD
- Paraphrased representations for KD = Autoencoder + KD KD的释义表示=自动编码器+ KD
- Contrastive Representation Distillation = InfoNCE + KD 对比表示蒸馏= InfoNCE + KD
What could be other vector sums?
其他矢量和可能是什么?
- GANs for KD (that is change the Contrastive loss with a GAN loss between feature representations), 用于KD的GAN(即用特征表示之间的GAN损失来改变对比损失),
Weak-supervision KD (Self-Training with Noisy Student Improves ImageNet classification)
弱监督KD(带噪声的学生进行自我训练可改善ImageNet分类)
This blog post is inspired from the tweet-storm on Knowledge Distillation (https://twitter.com/nishantiam/status/1295076936469762048)
这篇博客文章的灵感来自关于知识蒸馏的推文风暴( https://twitter.com/nishantiam/status/1295076936469762048 )
翻译自: https://towardsdatascience.com/knowledge-distillation-a-survey-through-time-187de05a278a
spring知识点提炼