赞
踩
目录
本文有点长,请耐心阅读,定会有收货。如有不足,欢迎交流, 另附:论文地址
Exploiting relationships between visual regions and question words have achieved great success in learning multi-modality features for Visual Question Answering (VQA). However, we argue that existing methods mostly model relations between individual visual regions and words, which are not enough to correctly answer the question. From humans’ perspective, answering a visual question requires understanding the summarizations of visual and language information. In this paper, we proposed the Multi-modality Latent Interaction module (MLI) o tackle this problem. The proposed module learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations. The cross-modality information between the latent summarizations are propagated to fuse valuable information from both modalities and are used to update the visual and word features. Such MLI modules can be stacked for several stages to model complex and latent relations between the two modalities and achieves highly competitive performance on public VQA benchmarks, VQA v2.0 and TDIUC . In addition, we show that the performance of our methods could be signifificantly improved by combining with pre-trained language model BERT.
作者认为,现有的视觉问答方法大多是对单个视觉区域和单词之间的关系进行建模,不足以正确回答问题。从人的角度来看,回答视觉问题需要理解视觉和语言的摘要。为此在本论文中,作者提出了多模态潜在交互模块(MLI)来解决此问题。这样的MLI模块可以堆叠多个阶段,以对单词和图像区域两种模式之间的复杂和潜在关系进行建模,实验表明在VQA v2.0数据集上,此方法更具有竞争性的性能。
下图展示了作者提出的Multi-modality Latent Interaction Network(MLIN)的整体流程,它是由一系列堆叠的多模态潜在模块组成,旨在将输入的视觉区域和问题词信息归纳为每种模态少量的潜在摘要向量,关键思想是在潜在的摘要向量之间传播视觉和语言信息,以便从全局角度对复杂的跨模式交互进行建模。
在潜在交互摘要向量之间传播信息之后,视觉区域和单词特征将汇总来自跨区域摘要的信息,以更新其特征。MLI模块的输入和输出具有相同的尺寸,整个网络将MLI模块堆叠为多个阶段,以逐步提纯视觉和语言特征。在最后阶段,将视觉和问题的平均特征之间进行元素乘法,以预测最终的答案。下面进行详细分析该框架。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。