赞
踩
这个太简单了,下面是一个带着InfoNCE的training loop
train_dataloader = Dataloader(dataset, ...) for batch in train_dataloader: self.optimizer.zero_grad() text_features = text_encoder(batch['text']) #[bs,dim] image_features = image_encoder(batch['image']) #[bs,dim] bs,dim = text_features.shape # normalized features image_features = image_features / image_features.norm(dim=1, keepdim=True) text_features = text_features / text_features.norm(dim=1, keepdim=True) # cosine similarity as logits logits_per_image = image_features @ text_features.t() / self.tau logits_per_text = logits_per_image.t() target = torch.arange(bs) loss = (F.cross_entropy(logits_per_image, target) + F.cross_entropy(logits_per_text, target)) / 2 loss.backward() self.optimizer.step()
两个ITC+两个MIM+1个ITM。ITM是基于ground truth的,必须知道一个pair是不是ground truth,同时ITM loss是用了hard negative,这个是和Momentum Distillation(动量蒸馏)是有冲突的,所以ITM只有一个loss没有给基于Momentum Distillation的loss。而另外两个都有基于Momentum Distillation版本的loss
算ITC和ITM的时候用的是原始图像和文本,算MIM的时候用的是mask过文本和原始的图像,所以要过2次forward过程
The image-text pairs used for pre-training are mostly collected from the web and they tend to be noisy. Positive pairs are usually weakly-correlated: the text may contain words that are unrelated to the image, or the image may contain entities that are not described in the text. For ITC learning, negative texts for an image may also match the image’s content. For MLM, there may exist other words different from the annotation that describes the image equally well (or better). However, the one-hot labels for ITC and MLM penalize all negative predictions regardless of their correctness.
To address this, we propose to learn from pseudo-targets generated by the momentum model. The momentum model is a continuously-evolving teacher which consists of exponential-moving-average versions of the unimodal and multimodal encoders. 简单说就通过exponential-moving-average(多数代码库都是自带EMA的,比如swin tranformer和Deit等)来产生一些伪标签。即预测的时候不光和原始的one-hot labeling接近,也和pseudo-targets接近,当one-hot labeling不够准时,pseudo targets就派上用场了
简单来说,就是因为Distillation的结果是一个softmax结果而不是onehot,直接在Pytorch的CE里用不了,所以换了KL散度,下面有更详细的解释
不带label smoothing,label是完全的onehot形式(例如3个类,只能是[0,0,1]、[0,1,0]和[1,0,0]),这种情况下KL散度结果就和交叉熵是完全一样的,可以回顾下信息熵 条件熵 交叉熵 联合熵 相对熵 KL散度 SCE MAE 互信息(信息增益)。但是在Pytorch实现中,交叉熵的只能输入其中某一类类别的下标,而Pytorch的KVDivLoss就可以是两个分布算loss,增加了灵活性。KVDivLoss针对
q
i
q_i
qi也就是logits对应的输入要预先过一下log(都是为了NLLLoss(Softmax)的CrossEntropyLoss进行对齐),而且KVDivLoss和数学上KL(P||Q)的参数顺序是反的,下面截图自
https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html
稍微总结一下:
基本是ALBEF的延伸工作了,loss换成了Contrastive loss 和 caption loss,和ALBef区别在于:
为什么不用ITM呢?因为用ITM一个epoch要forward 3遍,这有一遍就够了,但是CoCa scale的非常大,一般人玩不动
这图基本写清楚了FLAVA和之前工作的对比,一些关键点:
stagewise 训练(先图片,再文本,再多模态;但先文本,再图片就不太行)
可以看成VLMO的多模态版本,只用了Mask loss,图像叫Imglish。Transformer叫multi-way transformer
标题两个关键词,Bootstrapping和Unified框架,主要两个贡献:
相当于把BLIP里的MED给抽成了QFormer来和大语言模型交互,注意下面这个图实际上BLIP2的QFormer并没有用Text作为输入,但训练的时候有ITC和ITC参与,这一点参考Figure3理解一下
QFormer和BLIP1突出的对比有几点:
和BLIP2的区别在于,Qformer把Instruction的内容也加到了QFormer中,用作者自己的话来说就是:Extending BLIP-2, InstructBLIP proposes an instruction-aware Q-former module, which takes in the instruction text tokens as additional input. The instruction interacts with the query embeddings through self-attention layers of the Q-Former, and encourages the extraction of task-relevant image features. As a result, the LLM receives visual information conducive to instruction following。
来参考下huggingface transfomer的代码,其中InstructBlipForConditionalGeneration在forward时候的qformer_input_ids和qformer_attention_mask是InstructBlip相比BLIP2多出来的forward参数,而且qformer_input_ids还是用了和后面过LLM不一样的tokenizer。
transformers/models/instructblip/modeling_instructblip.py
def forward(
self,
pixel_values: torch.FloatTensor,
qformer_input_ids: torch.FloatTensor,
qformer_attention_mask: Optional[torch.LongTensor] = None,
input_ids: Optional[torch.FloatTensor] = None,
attention_mask: Optional[torch.LongTensor] = None,
decoder_input_ids: Optional[torch.LongTensor] = None,
decoder_attention_mask: Optional[torch.LongTensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
labels: Optional[torch.LongTensor] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, InstructBlipForConditionalGenerationModelOutput]:
直接完整对比一下Blip2QFormerModel和InstructBlipQFormerModel,InstructBlipQFormerModel这个多出来的其实只有InstructBlipQFormerEmbeddings这个东西,而这个的实现也比较简单,简单来说就是把多出来的Instruction产生的embedding在seq这个维度拼在Query后面,具体可以看InstructBlipQFormerEmbeddings这个类里面的代码
LLava(https://llava-vl.github.io/,来自Wisconsin团队)->Valley(视频版本的LLava,https://arxiv.org/pdf/2306.07207.pdf,来自Bytedance团队),LLaVA的思想比较简单,仅用Linear Layer来连接文本embedding和CV embedding:
Flamingo(https://arxiv.org/abs/2204.14198,来自Deepmind团队)->Otter(视频版本的Flamingo,https://arxiv.org/pdf/2306.05425.pdf,来自微软团队)
和LLaVa的文本、图片交互方式相比,Flamingo选择在现有预训练并冻住的大语言模型层之间插入Cross Attention Layer来连接文本embedding和CV embedding,Votta在Lemon8的一些实例中经常出现幻觉,而且Votta的论文中也没有在VQAv2等权威数据集上对比结果,VQA能力成疑
MetaLM(https://arxiv.org/pdf/2206.06336.pdf)->Kosmos系列(https://arxiv.org/pdf/2306.14824.pdf),这个系列来自Microsoft的Furu Wei团队,MetaLM更期待把文本、CV、语音、目标检测等一系列任务统一融合到semi-causal LM的框架内。从VQA任务的目前表现来看不如之前的Beit-3,而且Kosmos2更注重grounding任务和多模态任务的融合,期待Kosmos3可以有所突破
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。