Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.

我们对世界的体验是多模态的(五大感官)——我们看到物体(视觉)，听到声音(听觉)，感觉到质地(触觉)，闻到气味(嗅觉)，品尝味道(味觉)，其实还包括第六感(心觉)。模态是指事物发生或经历的方式，当一个研究问题包含多种模态时，它就被称为多模态。为了让人工智能在理解我们周围的世界方面取得进展，它需要能够同时解读这些多模态信号。多模态机器学习旨在建立能够处理和关联来自多种模式信息的模型。这是一个充满活力的多学科领域，其重要性和潜力都在不断增加。本文不关注具体的多模态应用，而是对多模态机器学习本身的最新进展进行了调查，并将它们以一种常见的分类方式呈现出来。我们超越了典型的早期和晚期融合分类，并确定了多模态机器学习面临的更广泛的挑战，即:表示、翻译、对齐、融合和共同学习。这种新的分类方法将使研究人员更好地了解该领域的现状，并确定未来的研究方向。

Index Terms—Multimodal, machine learning, introductory, survey.

索引术语-多模态，机器学习，入门，调查。

1 INTRODUCTION

THE world surrounding us involves multiple modalities— we see objects, hear sounds, feel texture, smell odors, and so on. In general terms, a modality refers to the way in which something happens or is experienced. Most people associate the word modality with the sensory modalities which represent our primary channels of communication and sensation, such as vision or touch. A research problem or dataset is therefore characterized as multimodal when it includes multiple such modalities. In this paper we focus primarily, but not exclusively, on three modalities: natural language which can be both written or spoken; visual signals which are often represented with images or videos; and vocal signals which encode sounds and para-verbal information such as prosody and vocal expressions.

In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret and reason about multimodal messages. Multi- modal machine learning aims to build models that can process and relate information from multiple modalities. From early research on audio-visual speech recognition to the recent explosion of interest in language and vision models, multi- modal machine learning is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential.

我们周围的世界包含多种形态——我们看到物体，听到声音，感觉到质地，闻到气味，等等。一般来说，模态是指某事发生或被体验的方式。大多数人将“情态”(后均译为模态)一词与代表我们沟通和感觉的主要渠道(如视觉或触觉)的感官形式联系在一起。因此，当一个研究问题或数据集包含多个这样的模态时，它就被描述为多模态。在本文中，我们主要(但不完全)关注三种形式:可以书面或口头的自然语言；通常用图像或视频表示的视觉信号；还有编码声音和似言语信息的声音信号，如韵律和声音表达。

为了让人工智能在理解我们周围的世界方面取得进展，它需要能够解释和推理关于多模态信息。多模态机器学习旨在建立能够处理和关联来自多种模态信息的模型。从早期的视听语音识别研究到最近对语言和视觉模型的兴趣激增，多模态机器学习是一个充满活力的多学科领域，其重要性日益增加，具有非凡的潜力。

The research field of Multimodal Machine Learning brings some unique challenges for computational re- searchers given the heterogeneity of the data. Learning from multimodal sources offers the possibility of capturing cor- respondences between modalities and gaining an in-depth understanding of natural phenomena. In this paper we iden- tify and explore five core technical challenges (and related sub-challenges) surrounding multimodal machine learning. They are central to the multimodal setting and need to be tackled in order to progress the field. Our taxonomy goes beyond the typical early and late fusion split, and consists of the five following challenges:

1)、Representation A first fundamental challenge is learning how to represent and summarize multimodal data in a way that exploits the complementarity and redundancy of multiple modalities. The heterogeneity of multimodal data makes it challenging to construct such representa- tions. For example, language is often symbolic while au- dio and visual modalities will be represented as signals.

2)、Translation A second challenge addresses how to trans- late (map) data from one modality to another. Not only is the data heterogeneous, but the relationship between modalities is often open-ended or subjective. For exam- ple, there exist a number of correct ways to describe an image and and one perfect translation may not exist.

3)、Alignment A third challenge is to identify the direct rela- tions between (sub)elements from two or more different modalities. For example, we may want to align the steps in a recipe to a video showing the dish being made. To tackle this challenge we need to measure similarity between different modalities and deal with possible long- range dependencies and ambiguities.

4)、Fusion A fourth challenge is to join information from two or more modalities to perform a prediction. For example, for audio-visual speech recognition, the visual description of the lip motion is fused with the speech signal to predict spoken words. The information coming from different modalities may have varying predictive power and noise topology, with possibly missing data in at least one of the modalities.

5)、Co-learning A fifth challenge is to transfer knowledge between modalities, their representation, and their pre- dictive models. This is exemplified by algorithms of co- training, conceptual grounding, and zero shot learning. Co-learning explores how knowledge learning from one modality can help a computational model trained on a different modality. This challenge is particularly relevant when one of the modalities has limited resources (e.g., annotated data).

考虑到数据的异质性，多模态机器学习的研究领域给计算研究人员带来了一些独特的挑战。从多模态来源学习提供了捕获模态之间的对应关系的可能性，并获得对自然现象的深入理解。在本文中，我们确定并探讨了围绕多模态机器学习的五个核心技术挑战(以及相关的子挑战)。它们是多模态环境的核心，需要加以解决以推动该领域的发展。我们的分类超越了典型的早期和晚期融合分裂，包括以下五个挑战:

1)、表示：第一个基本挑战是学习如何以一种利用多模态的互补性和冗余性的方式来表示和总结多模态数据。多模态数据的异质性使得构造这样的表示具有挑战性。例如，语言通常是符号化的，而视听形式将被表示为信号。

2)、翻译：第二个挑战是如何将数据从一种模态转换(映射)到另一种模态。不仅数据是异质的，而且模态之间的关系往往是开放的或主观的。例如，存在许多描述图像的正确方法，并且可能不存在一种完美的翻译。

3)、对齐：第三个挑战是识别来自两个或更多不同模态的(子)元素之间的直接关系。例如，我们可能想要将菜谱中的步骤与显示菜肴制作过程的视频对齐。为了应对这一挑战，我们需要衡量不同模态之间的相似性，并处理可能的长期依赖和歧义。

4)、融合：第四个挑战是将来自两个或更多模态的信息连接起来进行预测。例如，在视听语音识别中，将嘴唇运动的视觉描述与语音信号融合在一起来预测口语单词。来自不同模态的信息可能具有不同的预测能力和噪声拓扑，至少在一种模态中可能丢失数据。

5)、共同学习：第五项挑战是如何在模态、表示和预测模型之间传递知识。这可以通过协同训练、概念基础和零样本学习的算法来例证。共同学习探索了如何从一个模态学习知识可以帮助在不同模态上训练的计算模型。当其中一种模态的资源有限(例如，注释数据)时，这个挑战尤其重要。

Table 1: A summary of applications enabled by multimodal machine learning. For each application area we identify the core technical challenges that need to be addressed in order to tackle it.

APPLICATIONS：REPRESENTATION、TRANSLATION、ALIGNMENT、FUSION、CO-LEARNING

1、Speech recognition and synthesis：Audio-visual speech recognition、(Visual) speech synthesis

2、Event detection：Action classification、Multimedia event detection

3、Emotion and affect：Recognition、Synthesis

4、Media description：Image description、Video description、Visual question-answering、Media summarization

5、Multimedia retrieval：Cross modal retrieval、Cross modal hashing

表1:多模态机器学习支持的应用程序的总结。对于每个应用领域，我们确定了需要解决的核心技术挑战。

应用：表示、翻译、对齐、融合、共同学习

1、语音识别与合成:视听语音识别、(视觉)语音合成

2、事件检测:动作分类、多媒体事件检测

3、情感与影响:识别、综合

4、媒体描述:图像描述、视频描述、视觉问答、媒体摘要

5、多媒体检索:交叉模态检索，交叉模态哈希

For each of these five challenges, we defines taxonomic classes and sub-classes to help structure the recent work in this emerging research field of multimodal machine learning. We start with a discussion of main applications of multimodal machine learning (Section 2) followed by a discussion on the recent developments on all of the five core technical challenges facing multimodal machine learning: representation (Section 3), translation (Section 4), alignment (Section 5), fusion (Section 6), and co-learning (Section 7). We conclude with a discussion in Section 8.

对于这五个挑战中的每一个，我们都定义了分类类别和子类别，以帮助构建多模态机器学习这一新兴研究领域的最新工作。我们开始讨论的多通道的主要应用机器学习(2节),后跟一个讨论近期的事态发展在所有的五个核心技术多通道机器学习所面临的挑战:表示(第三节),翻译(4节),对齐(5节),融合(6节),co-learning(第7节)。我们在第8节中进行了讨论。

2 Applications: a historical perspective 应用：历史视角

Multimodal machine learning enables a wide range of applications: from audio-visual speech recognition to im-age captioning. In this section we present a brief history of multimodal applications, from its beginnings in audio-visual speech recognition to a recently renewed interest in language and vision applications.

One of the earliest examples of multimodal research is audio-visual speech recognition (AVSR) [243]. It was moti-vated by the McGurk effect [138] — an interaction between hearing and vision during speech perception. When human subjects heard the syllable /ba-ba/ while watching the lips of a person saying /ga-ga/, they perceived a third sound: /da-da/. These results motivated many researchers from the speech community to extend their approaches with visual information. Given the prominence of hidden Markov mod-els (HMMs) in the speech community at the time [95], it is without surprise that many of the early models for AVSR were based on various HMM extensions [24], [25]. While research into AVSR is not as common these days, it has seen renewed interest from the deep learning community [151].

While the original vision of AVSR was to improve speech recognition performance (e.g., word error rate) in all contexts, the experimental results showed that the main advantage of visual information was when the speech signal was noisy (i.e., low signal-to-noise ratio) [75], [151], [243]. In other words, the captured interactions between modalities were supplementary rather than complementary. The same information was captured in both, improving the robustness of the multimodal models but not improving the speech recognition performance in noiseless scenarios.

多模态机器学习实现了广泛的应用：从视听语音识别到图像字幕。在本节中，我们将简要介绍多模态应用的历史，从它在视听语音识别方面的起步，到最近在语言和视觉应用方面重新燃起的兴趣。

多模态研究最早的例子之一是视听语音识别(AVSR)[243]。它的动机是McGurk效应[138]——在言语感知过程中听觉和视觉之间的交互作用。当受试者在观察一个人说/ga-ga/时的嘴唇时听到/ba-ba/音节，他们会听到第三个声音:/da-da/。这些结果激发了语言学界的许多研究人员将他们的方法扩展到视觉信息。考虑到隐马尔可夫模型(HMM)在当时的语音社区中的突出程度[95]，许多早期的AVSR模型都是基于各种HMM扩展[24]，[25]，这一点也不令人惊讶。虽然目前对AVSR的研究并不常见，但深度学习社区对它重新燃起了兴趣[151]。

虽然AVSR的原始视觉是为了提高所有语境下的语音识别性能（例如，单词错误率），但实验结果表明，视觉信息的主要优势是在语音信号有噪声(即低信噪比)时[75]、[151]、[243]。换句话说，模态之间的相互作用是互补的而不是互补的。两种方法都捕获了相同的信息，提高了多模态模型的鲁棒性，但没有提高在无噪声场景下的语音识别性能。

A second important category of multimodal applications comes from the field of multimedia content indexing and retrieval [11], [188]. With the advance of personal comput-ers and the internet, the quantity of digitized multime-dia content has increased dramatically [2]. While earlier approaches for indexing and searching these multimedia videos were keyword-based [188], new research problems emerged when trying to search the visual and multimodal content directly. This led to new research topics in multi-media content analysis such as automatic shot-boundary detection [123] and video summarization [53]. These re-search projects were supported by the TrecVid initiative from the National Institute of Standards and Technologies which introduced many high-quality datasets, including the multimedia event detection (MED) tasks started in 2011 [1]

第二个重要的多模态应用类别来自多媒体内容索引和检索领域[11][188]。随着个人电脑和互联网的发展，数字化多媒体内容的数量急剧增加。虽然早期对这些多媒体视频进行索引和搜索的方法是基于关键词的[188]，但当试图直接搜索视觉和多模态内容时，出现了新的研究问题。这导致了多媒体内容分析的新研究课题，如自动镜头边界检测[123]和视频摘要[53]。这些研究项目由国家标准和技术研究所的TrecVid计划支持，该计划引入了许多高质量的数据集，包括2011年[1]开始的多媒体事件检测(MED)任务

A third category of applications was established in the early 2000s around the emerging field of multimodal interaction with the goal of understanding human multi-modal behaviors during social interactions. One of the first landmark datasets collected in this field is the AMI Meet-ing Corpus which contains more than 100 hours of video recordings of meetings, all fully transcribed and annotated [33]. Another important dataset is the SEMAINE corpus which allowed to study interpersonal dynamics between speakers and listeners [139]. This dataset formed the basis of the first audio-visual emotion challenge (AVEC) orga-nized in 2011 [179]. The fields of emotion recognition and affective computing bloomed in the early 2010s thanks to strong technical advances in automatic face detection, facial landmark detection, and facial expression recognition [46]. The AVEC challenge continued annually afterward with the later instantiation including healthcare applications such as automatic assessment of depression and anxiety [208]. A great summary of recent progress in multimodal affect recognition was published by D’Mello et al. [50]. Their meta-analysis revealed that a majority of recent work on multimodal affect recognition show improvement when using more than one modality, but this improvement is reduced when recognizing naturally-occurring emotions.

第三类应用是在21世纪初建立的，围绕着新兴的多模态互动领域，目的是理解社会互动中人类的多模态行为。在这个领域收集的第一个具有里程碑意义的数据集是AMI会议语料库(AMI meetet Corpus)，它包含了100多个小时的会议视频记录，全部都是完全转录和注释的[33]。另一个重要的数据集是SEMAINE语料库，它可以研究说话者和听者之间的人际动力学[139]。该数据集构成了2011年组织的第一次视听情绪挑战(AVEC)的基础[179]。由于自动人脸检测、面部地标检测和面部表情识别[46]技术的强大进步，情绪识别和情感计算领域在2010年代早期蓬勃发展。此后，AVEC挑战每年都在继续，后来的实例包括抑郁和焦虑的自动评估等医疗保健应用[208]。D’mello et al.[50]对多模态情感识别的最新进展进行了很好的总结。他们的荟萃分析显示，最近关于多模态情感识别的大部分工作在使用多个模态时表现出改善，但当识别自然发生的情绪时，这种改善就会减少。

Most recently, a new category of multimodal applica-tions emerged with an emphasis on language and vision: media description. One of the most representative applica-tions is image captioning where the task is to generate a text description of the input image [83]. This is motivated by the ability of such systems to help the visually impaired in their daily tasks [20]. The main challenges media description is evaluation: how to evaluate the quality of the predicted descriptions. The task of visual question-answering (VQA) was recently proposed to address some of the evaluation challenges [9], where the goal is to answer a specific ques-tion about the image.

In order to bring some of the mentioned applications to the real world we need to address a number of tech-nical challenges facing multimodal machine learning. We summarize the relevant technical challenges for the above mentioned application areas in Table 1. One of the most im-portant challenges is multimodal representation, the focus of our next section.

最近，一种新的多模态应用出现了，它强调语言和视觉:媒体描述。最具代表性的应用之一是图像字幕，其任务是生成输入图像的文本描述[83]。这是由这些系统的能力来帮助视障人士在他们的日常任务[20]。媒体描述的主要挑战是评估:如何评估预测描述的质量。最近提出的视觉回答任务(VQA)是为了解决[9]的一些评估挑战，其目标是回答关于图像的特定问题。

为了将上述一些应用应用到现实世界中，我们需要解决多模态机器学习所面临的一系列技术挑战。我们在表1中总结了上述应用领域的相关技术挑战。最重要的挑战之一是多模态表示，这是我们下一节的重点。

3 Multimodal Representations多模态表示

Representing raw data in a format that a computational model can work with has always been a big challenge in machine learning. Following the work of Bengio et al. [18] we use the term feature and representation interchangeably, with each referring to a vector or tensor representation of an entity, be it an image, audio sample, individual word, or a sentence. A multimodal representation is a representation of data using information from multiple such entities. Repre-senting multiple modalities poses many difficulties: how to combine the data from heterogeneous sources; how to deal with different levels of noise; and how to deal with missing data. The ability to represent data in a meaningful way is crucial to multimodal problems, and forms the backbone of any model.

Good representations are important for the performance of machine learning models, as evidenced behind the recent leaps in performance of speech recognition [79] and visual object classification [109] systems. Bengio et al. [18] identify a number of properties for good representations: smooth-ness, temporal and spatial coherence, sparsity, and natural clustering amongst others. Srivastava and Salakhutdinov [198] identify additional desirable properties for multi-modal representations: similarity in the representation space should reflect the similarity of the corresponding concepts, the representation should be easy to obtain even in the absence of some modalities, and finally, it should be possible to fill-in missing modalities given the observed ones.

以一种计算模型可以使用的格式表示原始数据一直是机器学习的一大挑战。在Bengio等人[18]的工作之后，我们交替使用术语“特征”和“表示”，每一个都指一个实体的向量或张量表示，无论是图像、音频样本、单个单词还是一个句子。多模态表示是使用来自多个此类实体的信息的数据表示。表示多种模态带来了许多困难:如何组合来自不同来源的数据;如何处理不同程度的噪音;以及如何处理丢失的数据。以有意义的方式表示数据的能力对多模态问题至关重要，并构成任何模型的支柱。

良好的表示对机器学习模型的性能非常重要，这在语音识别[79]和视觉对象分类[109]系统最近的性能飞跃中得到了证明。Bengio等人的[18]为良好的表示识别了许多属性:平滑性、时间和空间一致性、稀疏性和自然聚类。Srivastava和Salakhutdinov[198]确定了多模态表示的其他理想属性:表示空间中的相似性应反映出相应概念的相似性，即使在没有某些模态的情况下，表示也应易于获得，最后，对于观察到的模态，应能够填充缺失的模态。

The development of unimodal representations has been extensively studied [5], [18], [122]. In the past decade there has been a shift from hand-designed for specific applications to data-driven. For example, one of the most famous image descriptors in the early 2000s, the scale invariant feature transform (SIFT) was hand designed [127], but currently most visual descriptions are learned from data using neural architectures such as convolutional neural networks (CNN)[109]. Similarly, in the audio domain, acoustic features such as Mel-frequency cepstral coefficients (MFCC) have been superseded by data-driven deep neural networks in speech recognition [79] and recurrent neural networks for para-linguistic analysis [207]. In natural language process-ing, the textual features initially relied on counting word occurrences in documents, but have been replaced data-driven word embeddings that exploit the word context [141]. While there has been a huge amount of work on unimodal representation, up until recently most multimodal representations involved simple concatenation of unimodal ones [50], but this has been rapidly changing.

单模态表征的发展已被广泛研究[5]，[18]，[122]。在过去的十年里，已经出现了从手工设计特定应用程序到数据驱动的转变。例如，本世纪初最著名的图像描述符之一，尺度不变特征变换(SIFT)是手工设计的[127]，但目前大多数视觉描述都是使用卷积神经网络(CNN)等神经体系结构从数据中学习的[109]。同样，在音频领域，如Mel-frequency倒谱系数(MFCC)等声学特征已被语音识别中的数据驱动深度神经网络[79]和辅助语言分析中的循环神经网络[207]所取代。在自然语言处理中，文本特征最初依赖于计算文档中的单词出现次数，但已经取代了利用单词上下文的数据驱动单词嵌入[141]。尽管在单模态表示方面已经做了大量的工作，但直到最近，大多数多模态表示都涉及单模态表示[50]的简单串联，但这种情况正在迅速改变。

To help understand the breadth of work, we propose two categories of multimodal representation: joint and coor-dinated. Joint representations combine the unimodal signals into the same representation space, while coordinated repre-sentations process unimodal signals separately, but enforce certain similarity constraints on them to bring them to what we term a coordinated space. An illustration of different multimodal representation types can be seen in Figure 1.

Mathematically, the joint representation is expressed as:

where the multimodal representation xm is computed using function f (e.g., a deep neural network, restricted Boltz-mann machine, or a recurrent neural network) that relies on unimodal representations x1, . . . xn. While coordinated representation is as follows:

where each modality has a corresponding projection func-tion (f and g above) that maps it into a coordinated multi-modal space. While the projection into the multimodal space is independent for each modality, but the resulting space is coordinated between them (indicated as ∼). Examples of such coordination include minimizing cosine distance [61], maximizing correlation [7], and enforcing a partial order [212] between the resulting spaces.

为了帮助理解工作的广度，我们提出了两种类型的多模态表示:联合的和协调的。联合表示将单模态信号组合到相同的表示空间中，而协调表示则分别处理单模态信号，但对它们施加某种相似性约束，使它们进入我们所说的协调空间。图1展示了不同的多模态表示类型。

在数学上，联合表示为:

其中，多模态表示xm是使用依赖于单模态表示x1、…的函数f(例如，深度神经网络、受限玻尔兹曼机或循环神经网络)计算的。而协调表示如下:

其中，每个模态都有一个相应的投影函数(f和g)，将其映射到一个协调的多模态空间中。虽然投射到多模态空间的每个模态都是独立的，但最终的空间在它们之间是协调的(表示为~)。这种协调的例子包括最小化余弦距离[61]，最大化相关性[7]，以及在结果空间之间强制执行偏序[212]。

3.1 Joint Representations 联合表示

We start our discussion with joint representations that project unimodal representations together into a multimodal space (Equation 1). Joint representations are mostly (but not exclusively) used in tasks where multimodal data is present both during training and inference steps. The sim-plest example of a joint representation is a concatenation of individual modality features (also referred to as early fusion [50]). In this section we discuss more advanced methods for creating joint representations starting with neural net-works, followed by graphical models and recurrent neural networks (representative works can be seen in Table 2). Neural networks have become a very popular method for unimodal data representation [18]. They are used to repre-sent visual, acoustic, and textual data, and are increasingly used in the multimodal domain [151], [156], [217]. In this section we describe how neural networks can be used to construct a joint multimodal representation, how to train them, and what advantages they offer.

我们从联合表示开始讨论，联合表示将单模态表示一起投射到多模态空间中(方程1)。联合表示通常(但不是唯一)用于在训练和推理步骤中都存在多模态数据的任务中。联合表示的最简单的例子是单个形态特征的串联(也称为早期融合[50])。在本节中，我们将讨论创建联合表示的更高级方法，首先是神经网络，然后是图形模型和循环神经网络(代表性作品见表2)。神经网络已经成为单模态数据表示[18]的一种非常流行的方法。它们被用来表示视觉、听觉和文本数据，并在多模态领域中越来越多地使用[151]、[156]、[217]。在本节中，我们将描述如何使用神经网络来构建联合多模态表示，如何训练它们，以及它们提供了什么优势。

In general, neural networks are made up of successive building blocks of inner products followed by non-linear activation functions. In order to use a neural network as a way to represent data, it is first trained to perform a specific task (e.g., recognizing objects in images). Due to the multilayer nature of deep neural networks each successive layer is hypothesized to represent the data in a more abstract way [18], hence it is common to use the final or penultimate neural layers as a form of data representation. To construct a multimodal representation using neural networks each modality starts with several individual neural layers fol-lowed by a hidden layer that projects the modalities into a joint space [9], [145], [156], [227]. The joint multimodal representation is then be passed through multiple hidden layers itself or used directly for prediction. Such models can be trained end-to-end — learning both to represent the data and to perform a particular task. This results in a close relationship between multimodal representation learning and multimodal fusion when using neural networks.

一般来说，神经网络由内积的连续构建块和非线性激活函数组成。为了使用神经网络作为一种表示数据的方法，首先要训练它执行特定的任务(例如，识别图像中的对象)。由于深度神经网络的多层性质，假设每一层都以更抽象的方式[18]表示数据，因此通常使用最后或倒数第二层神经网络作为数据表示的一种形式。为了使用神经网络构建多模态表示，每个模态从几个单独的神经层开始，然后是一个隐藏层，该层将模态投射到关节空间[9]，[145]，[156]，[227]。然后将联合多模态表示通过多个隐藏层本身或直接用于预测。这样的模型可以端到端进行训练——既可以表示数据，也可以执行特定的任务。这导致了在使用神经网络时，多模态表示学习和多模态融合之间的密切关系。

Figure 1: Structure of joint and coordinated representations. Joint representations are projected to the same space using all of the modalities as input. Coordinated representations, on the other hand, exist in their own space, but are coordinated through a similarity (e.g. Euclidean distance) or structure constraint (e.g. partial order).

图1:关节和协调表示的结构。使用所有的模态作为输入，将联合表示投影到同一空间。另一方面，协调表示存在于它们自己的空间中，但通过相似性(如欧几里德距离)或结构约束(如部分顺序)进行协调。

As neural networks require a lot of labeled training data, it is common to pre-train such representations using an autoencoder on unsupervised data [80]. The model pro-posed by Ngiam et al. [151] extended the idea of using autoencoders to the multimodal domain. They used stacked denoising autoencoders to represent each modality individ-ually and then fused them into a multimodal representation using another autoencoder layer. Similarly, Silberer and Lapata [184] proposed to use a multimodal autoencoder for the task of semantic concept grounding (see Section 7.2). In addition to using a reconstruction loss to train the representation they introduce a term into the loss function that uses the representation to predict object labels. It is also common to fine-tune the resulting representation on a particular task at hand as the representation constructed using an autoencoder is generic and not necessarily optimal for a specific task [217].

The major advantage of neural network based joint rep-resentations comes from their often superior performance and the ability to pre-train the representations in an unsu-pervised manner. The performance gain is, however, depen-dent on the amount of data available for training. One of the disadvantages comes from the model not being able to handle missing data naturally — although there are ways to alleviate this issue [151], [217]. Finally, deep networks are often difficult to train [69], but the field is making progress in better training techniques [196].

Probabilistic graphical models are another popular way to construct representations through the use of latent random variables [18]. In this section we describe how probabilistic graphical models are used to represent unimodal and mul-timodal data.

由于神经网络需要大量标注的训练数据，通常使用自动编码器对非监督数据进行此类表示的预训练[80]。Ngiam等人[151]提出的模型将使用自动编码器的思想扩展到多模态域。他们使用堆叠降噪自动编码器来单独表示每个模态，然后使用另一个自动编码器层将它们融合成一个多模态表示。类似地，Silberer和Lapata[184]提出使用多模态自动编码器来完成语义概念扎根的任务(见章节7.2)。除了使用重构损失来训练表示之外，他们还在损失函数中引入了一个术语，该术语使用表示来预测对象标签。由于使用自动编码器构造的表示是通用的，对于特定的任务不一定是最佳的，因此对当前特定任务的结果表示进行微调也是很常见的[217]。

基于神经网络的联合表示的主要优势来自于它们通常卓越的性能，以及以无监督的方式对表示进行预训练的能力。然而，性能增益取决于可供训练的数据量。缺点之一是模型不能自然地处理缺失的数据——尽管有一些方法可以缓解这个问题[151]，[217]。最后，深度网络通常很难训练[69]，但该领域在更好的训练技术方面正在取得进展[196]。

概率图形模型是另一种通过使用潜在随机变量[18]来构造表示的流行方法。在本节中，我们将描述如何使用概率图形模型来表示单模态和多模态数据。

The most popular approaches for graphical-model based representation are deep Boltzmann machines (DBM) [176], that stack restricted Boltzmann machines (RBM) [81] as building blocks. Similar to neural networks, each successive layer of a DBM is expected to represent the data at a higher level of abstraction. The appeal of DBMs comes from the fact that they do not need supervised data for training [176]. As they are graphical models the representation of data is probabilistic, however it is possible to convert them to a deterministic neural network — but this loses the generative aspect of the model [176].

Work by Srivastava and Salakhutdinov [197] introduced multimodal deep belief networks as a multimodal represen-tation. Kim et al. [104] used a deep belief network for each modality and then combined them into joint representation for audiovisual emotion recognition. Huang and Kingsbury [86] used a similar model for AVSR, and Wu et al. [225] for audio and skeleton joint based gesture recognition.

Multimodal deep belief networks have been extended to multimodal DBMs by Srivastava and Salakhutdinov [198]. Multimodal DBMs are capable of learning joint represen-tations from multiple modalities by merging two or more undirected graphs using a binary layer of hidden units on top of them. They allow for the low level representations of each modality to influence each other after the joint training due to the undirected nature of the model.

Ouyang et al. [156] explore the use of multimodal DBMs for the task of human pose estimation from multi-view data. They demonstrate that integrating the data at a later stage —after unimodal data underwent nonlinear transformations— was beneficial for the model. Similarly, Suk et al. [199] use multimodal DBM representation to perform Alzheimer’s disease classification from positron emission tomography and magnetic resonance imaging data.

最流行的基于图形模型的表示方法是深度玻尔兹曼机(DBM)[176]，它将限制玻尔兹曼机(RBM)[81]堆叠为构建块。与神经网络类似，DBM的每一个后续层都被期望在更高的抽象级别上表示数据。DBMs的吸引力来自于这样一个事实，即它们不需要监督数据进行训练[176]。由于它们是图形模型，数据的表示是概率的，但是可以将它们转换为确定性神经网络——但这失去了模型的生成方面[176]。

Srivastava和Salakhutdinov[197]的研究引入了多模态深度信念网络作为多模态表征。Kim等人[104]对每个模态使用深度信念网络，然后将它们组合成联合表征，用于视听情感识别。Huang和Kingsbury[86]在AVSR中使用了类似的模型，Wu等[225]在基于音频和骨骼关节的手势识别中使用了类似的模型。

Srivastava和Salakhutdinov将多模态深度信念网络扩展到多模态DBMs[198]。多模态DBMs能够通过在两个或多个无向图上使用隐藏单元的二进制层来合并它们，从而从多个模态中学习联合表示。由于模型的无方向性，它们允许每个模态的低层次表示在联合训练后相互影响。

欧阳等人[156]探讨了使用多模态DBMs完成从多视图数据中估计人体姿态的任务。他们证明，在单模态数据经过非线性转换后的后期阶段对数据进行集成对模型是有益的。类似地，Suk等人[199]利用多模态DBM表示法，从正电子发射断层扫描和磁共振成像数据中进行阿尔茨海默病分类。

One of the big advantages of using multimodal DBMs for learning multimodal representations is their generative nature, which allows for an easy way to deal with missing data — even if a whole modality is missing, the model has a natural way to cope. It can also be used to generate samples of one modality in the presence of the other one, or both modalities from the representation. Similar to autoen-coders the representation can be trained in an unsupervised manner enabling the use of unlabeled data. The major disadvantage of DBMs is the difficulty of training them —high computational cost, and the need to use approximate variational training methods [198].

Sequential Representation. So far we have discussed mod-els that can represent fixed length data, however, we often need to represent varying length sequences such as sen-tences, videos, or audio streams. In this section we describe models that can be used to represent such sequences.

使用多模态DBMs学习多模态表示的一大优点是它们的生成特性，这允许使用一种简单的方法来处理缺失的数据——即使整个模态都缺失了，模型也有一种自然的方法来处理。它还可以用于在存在另一种模态的情况下产生一种模态的样本，或者从表示中产生两种模态的样本。与自动编码器类似，表示可以以无监督的方式进行训练，以便使用未标记的数据。DBMs的主要缺点是很难训练它们——计算成本高，而且需要使用近似变分训练方法[198]。

顺序表示。到目前为止，我们已经讨论了可以表示固定长度数据的模型，但是，我们经常需要表示不同长度的序列，例如句子、视频或音频流。在本节中，我们将描述可以用来表示这种序列的模型。

Table 2: A summary of multimodal representation tech-niques. We identify three subtypes of joint representations (Section 3.1) and two subtypes of coordinated ones (Section 3.2). For modalities + indicates the modalities combined.

表2:多模态表示技术的概述。我们确定了联合表示的三种子类型(章节3.1)和协调表示的两种子类型(章节3.2)。对于模态，+表示组合的模态。

Recurrent neural networks (RNNs), and their variants such as long-short term memory (LSTMs) networks [82], have recently gained popularity due to their success in sequence modeling across various tasks [12], [213]. So far RNNs have mostly been used to represent unimodal se-quences of words, audio, or images, with most success in the language domain. Similar to traditional neural networks, the hidden state of an RNN can be seen as a representation of the data, i.e., the hidden state of RNN at timestep t can be seen as the summarization of the sequence up to that timestep. This is especially apparent in RNN encoder-decoder frameworks where the task of an encoder is to represent a sequence in the hidden state of an RNN in such a way that a decoder could reconstruct it [12].

The use of RNN representations has not been limited to the unimodal domain. An early use of constructing a multimodal representation using RNNs comes from work by Cosi et al. [43] on AVSR. They have also been used for representing audio-visual data for affect recognition [37],[152] and to represent multi-view data such as different visual cues for human behavior analysis [166].

循环神经网络(rnn)及其变体，如长短期记忆(LSTMs)网络[82]，由于它们在不同任务的序列建模中取得了成功[12]，[213]，近年来越来越受欢迎。到目前为止，神经网络主要用于表示单模态的单词序列、音频序列或图像序列，在语言领域取得了很大的成功。与传统的神经网络类似，RNN的隐藏状态可以看作是数据的一种表示，即RNN在时间步长t处的隐藏状态可以看作是该时间步长的序列的汇总。这在RNN编码器-解码器框架中尤为明显，在该框架中，编码器的任务是表示RNN的隐藏状态下的序列，以便解码器可以将其重构为[12]。

RNN表示的使用并不局限于单模态域。Cosi等人在AVSR上的工作最早使用RNNs构造多模态表示。它们还被用于表示视听数据，用于情感识别[37][152]，并表示多视图数据，如用于人类行为分析的不同视觉线索[166]。

3.2 Coordinated Representations协调表示

An alternative to a joint multimodal representation is a coor-dinated representation. Instead of projecting the modalities together into a joint space, we learn separate representations for each modality but coordinate them through a constraint. We start our discussion with coordinated representations that enforce similarity between representations, moving on to coordinated representations that enforce more structure on the resulting space (representative works of different coordinated representations can be seen in Table 2).

Similarity models minimize the distance between modal-ities in the coordinated space. For example such models encourage the representation of the word dog and an image of a dog to have a smaller distance between them than distance between the word dog and an image of a car [61]. One of the earliest examples of such a representation comes from the work by Weston et al. [221], [222] on the WSABIE (web scale annotation by image embedding) model, where a coordinated space was constructed for images and their annotations. WSABIE constructs a simple linear map from image and textual features such that corresponding anno-tation and image representation would have a higher inner product (smaller cosine distance) between them than non-corresponding ones.

联合多模态表示的另一种选择是协调表示。我们学习每个模态的单独表示，但通过一个约束来协调它们，而不是将这些模态一起投影到关节空间中。我们从协调表示开始讨论，协调表示强制表示之间的相似性，然后继续讨论在结果空间上强制更多结构的协调表示(不同协调表示的代表作品见表2)。

相似模型最小化协调空间中各模态之间的距离。例如，这样的模型鼓励单词dog和一只狗的图像之间的距离比单词dog和一辆汽车的图像之间的距离更小[61]。这种表达最早的例子之一来自Weston等人[221]，[222]在WSABIE(图像嵌入的web尺度注释)模型上的工作，其中为图像及其注释构建了一个协调的空间。WSABIE从图像和文本特征构造了一个简单的线性映射，这样对应的标注和图像表示就会比不对应的标注和图像之间有更高的内积(更小的余弦距离)。

More recently, neural networks have become a popular way to construct coordinated representations, due to their ability to learn representations. Their advantage lies in the fact that they can jointly learn coordinated representations in an end-to-end manner. An example of such coordinated representation is DeViSE — a deep visual-semantic embed-ding [61]. DeViSE uses a similar inner product and ranking loss function to WSABIE but uses more complex image and word embeddings. Kiros et al. [105] extended this to sentence and image coordinated representation by using an LSTM model and a pairwise ranking loss to coordinate the feature space. Socher et al. [191] tackle the same task, but extend the language model to a dependency tree RNN to incorporate compositional semantics. A similar model was also proposed by Pan et al. [159], but using videos instead of images. Xu et al. [231] also constructed a coordinated space between videos and sentences using a subject, verb, object compositional language model and a deep video model. This representation was then used for the task of cross-modal retrieval and video description.

While the above models enforced similarity between representations, structured coordinated space models go beyond that and enforce additional constraints between the modality representations. The type of structure enforced is often based on the application, with different constraints for hashing, cross-modal retrieval, and image captioning.

最近，由于神经网络具有学习表征的能力，它已经成为构建协调表征的一种流行方式。它们的优势在于能够以端到端方式共同学习协调的表示。这种协调表示的一个例子是设计——一种深度的视觉语义嵌入[61]。设计使用与WSABIE类似的内部产品和排名损失函数，但使用更复杂的图像和单词嵌入。Kiros等人[105]通过使用LSTM模型和一对排序损失来协调特征空间，将其扩展到句子和图像的协调表示。Socher等人[191]处理了相同的任务，但将语言模型扩展到依赖树RNN，以合并复合语义。Pan等人[159]也提出了类似的模型，但使用的是视频而不是图像。Xu等人[231]也使用主语、动词、宾语构成语言模型和深度视频模型构建了视频和句子之间的协调空间。然后将该表示用于跨模态检索和视频描述任务。

虽然上述模型加强了表示之间的相似性，但结构化协调空间模型超越了这一点，并加强了模态表示之间的附加约束。强制的结构类型通常基于应用程序，对哈希、交叉模态检索和图像标题有不同的约束。

Structured coordinated spaces are commonly used in cross-modal hashing — compression of high dimensional data into compact binary codes with similar binary codes for similar objects [218]. The idea of cross-modal hashing is to create such codes for cross-modal retrieval [27], [93],[113]. Hashing enforces certain constraints on the result-ing multimodal space: 1) it has to be an N-dimensional Hamming space — a binary representation with controllable number of bits; 2) the same object from different modalities has to have a similar hash code; 3) the space has to be similarity-preserving. Learning how to represent the data as a hash function attempts to enforce all of these three requirements [27], [113]. For example, Jiang and Li [92] introduced a method to learn such common binary space between sentence descriptions and corresponding images using end-to-end trainable deep learning techniques. While Cao et al. [32] extended the approach with a more complex LSTM sentence representation and introduced an outlier insensitive bit-wise margin loss and a relevance feedback based semantic similarity constraint. Similarly, Wang et al.[219] constructed a coordinated space in which images (and sentences) with similar meanings are closer to each other.

结构化协调空间通常用于高维数据的交叉模态哈希压缩，将其压缩为具有相似对象的相似二进制码的紧凑二进制码[218]。交叉模态哈希的思想是为交叉模态检索[27]，[93]，[113]创建这样的代码。哈希对结果的多模态空间施加了一定的约束:1)它必须是一个n维的汉明空间——一个具有可控位数的二进制表示;2)来自不同模态的相同对象必须有相似的哈希码;3)空间必须保持相似性。学习如何将数据表示为一个哈希函数，尝试执行所有这三个要求[27]，[113]。例如，Jiang和Li[92]介绍了一种方法，利用端到端可训练的深度学习技术学习句子描述与相应图像之间的公共二值空间。而Cao等人的[32]扩展了该方法，使用了更复杂的LSTM句子表示，并引入了离群值不敏感的位边缘损失和基于关联反馈的语义相似度约束。同样，Wang等[219]构建了一个具有相似意义的图像(和句子)更加接近的协调空间。

Another example of a structured coordinated represen-tation comes from order-embeddings of images and lan-guage [212], [249]. The model proposed by Vendrov et al.[212] enforces a dissimilarity metric that is asymmetric and implements the notion of partial order in the multimodal space. The idea is to capture a partial order of the language and image representations — enforcing a hierarchy on the space; for example image of “a woman walking her dog“ → text “woman walking her dog” → text “woman walking”. A similar model using denotation graphs was also proposed by Young et al. [238] where denotation graphs are used to induce a partial ordering. Lastly, Zhang et al. present how exploiting structured representations of text and images can create concept taxonomies in an unsupervised manner [249].

A special case of a structured coordinated space is one based on canonical correlation analysis (CCA) [84]. CCA computes a linear projection which maximizes the correla-tion between two random variables (in our case modalities) and enforces orthogonality of the new space. CCA models have been used extensively for cross-modal retrieval [76],[106], [169] and audiovisual signal analysis [177], [187]. Extensions to CCA attempt to construct a correlation max-imizing nonlinear projection [7], [116]. Kernel canonical correlation analysis (KCCA) [116] uses reproducing kernel Hilbert spaces for projection. However, as the approach is nonparametric it scales poorly with the size of the training set and has issues with very large real-world datasets. Deep canonical correlation analysis (DCCA) [7] was introduced as an alternative to KCCA and addresses the scalability issue, it was also shown to lead to better correlated representation space. Similar correspondence autoencoder [58] and deep correspondence RBMs [57] have also been proposed for cross-modal retrieval.

CCA, KCCA, and DCCA are unsupervised techniques and only optimize the correlation over the representations, thus mostly capturing what is shared across the modal-ities. Deep canonically correlated autoencoders [220] also include an autoencoder based data reconstruction term. This encourages the representation to also capture modal-ity specific information. Semantic correlation maximization method [248] also encourages semantic relevance, while retaining correlation maximization and orthogonality of the resulting space — this leads to a combination of CCA and cross-modal hashing techniques.

另一个结构化协调表示的例子来自图像和语言的顺序嵌入[212]，[249]。venrov等人提出的模型[212]在多模态空间中实施了一个非对称的不相似度规，并实现了偏序的概念。其想法是捕捉语言和图像表示的部分顺序——在空间上强制执行层次结构;例如，图像“一个女人遛狗”→文本“女人遛狗”→文本“女人散步”。Young等人[238]也提出了一个类似的使用表示图的模型，其中表示图用于诱导部分排序。最后，Zhang等人提出了如何利用文本和图像的结构化表示以无监督的方式创建概念分类[249]。

结构化协调空间的一种特殊情况是基于典型相关分析(CCA)的情况[84]。CCA计算线性投影，最大化两个随机变量(在本例中为模态)之间的相关性，并加强新空间的正交性。CCA模型被广泛用于跨模态检索[76]、[106]、[169]和视听信号分析[177]、[187]。对CCA的扩展尝试构造一个相关max- imalize非线性投影[7]，[116]。核典型相关分析(Kernel canonical correlation analysis, KCCA)[116]使用再现核希尔伯特空间进行投影。然而，由于该方法是非参数化的，它不能很好地适应训练集的大小，并且在处理非常大的真实世界数据集时存在问题。深度典型相关分析(DCCA)[7]被引入作为KCCA的替代方案，并解决了可伸缩性问题，它还可以带来更好的相关表示空间。类似的对应自动编码器[58]和深度对应RBMs[57]也被提出用于跨模态检索。

CCA、KCCA和DCCA是无监督技术，只优化表示的相关性，因此主要捕获跨模态共享的内容。深层规范相关自动编码器[220]还包括基于自动编码器的数据重构项。这鼓励表示也捕获特定于模态的信息。语义相关性最大化方法[248]也鼓励语义相关性，同时保留相关性最大化和结果空间的正交性——这导致CCA和跨模态哈希技术的结合。

3.3 Discussion讨论

In this section we identified two major types of multimodal representations — joint and coordinated. Joint representa-tions project multimodal data into a common space and are best suited for situations when all of the modalities are present during inference. They have been extensively used for AVSR, affect, and multimodal gesture recognition. Coordinated representations, on the other hand, project each modality into a separate but coordinated space, making them suitable for applications where only one modality is present at test time, such as: multimodal retrieval and trans-lation (Section 4), grounding (Section 7.2), and zero shot learning (Section 7.2). Finally, while joint representations have been used in situations to construct representations of more than two modalities, coordinated spaces have, so far, been mostly limited to two modalities.

在本节中，我们确定了两种主要类型的多模态表示——联合表示和协调表示。联合表示将多模态数据投射到公共空间中，最适合在推理过程中出现所有模态的情况。它们已被广泛用于AVSR、情感和多模态手势识别。另一方面，协调表示将每个模态投射到一个独立但协调的空间中，使它们适合于在测试时只有一个模态的应用，例如:多模态检索和翻译(章节4)、接地(章节7.2)和零镜头学习(章节7.2)。最后，虽然联合表示已经被用于构造两种以上形态的表示，但到目前为止，协调空间大多局限于两种形态。

Table 3: Taxonomy of multimodal translation research. For each class and sub-class, we include example tasks with references. Our taxonomy also includes the directionality of the translation: unidirectional (⇒) and bidirectional (⇔).

表3:多模态翻译研究的分类。对于每个类及其子类，我们都包含带有引用的示例任务。我们的分类还包括翻译的方向性:单向(玛)和双向(⇔)。

4 Translation翻译

A big part of multimodal machine learning is concerned with translating (mapping) from one modality to another. Given an entity in one modality the task is to generate the same entity in a different modality. For example given an image we might want to generate a sentence describing it or given a textual description generate an image matching it. Multimodal translation is a long studied problem, with early work in speech synthesis [88], visual speech generation [136] video description [107], and cross-modal retrieval [169].

More recently, multimodal translation has seen renewed interest due to combined efforts of the computer vision and natural language processing (NLP) communities [19] and recent availability of large multimodal datasets [38], [205]. A particularly popular problem is visual scene description, also known as image [214] and video captioning [213], which acts as a great test bed for a number of computer vision and NLP problems. To solve it, we not only need to fully understand the visual scene and to identify its salient parts, but also to produce grammatically correct and comprehensive yet concise sentences describing it.

多模态机器学习的很大一部分是关于从一种模态到另一种模态的翻译(映射)。给定一个以一种形态存在的实体，任务是在不同形态中生成相同的实体。例如，给定一幅图像，我们可能想要生成一个描述它的句子，或者给定一个文本描述生成与之匹配的图像。多模态翻译是一个长期研究的问题，早期的工作包括语音合成[88]、视觉语音生成[136]、视频描述[107]和跨模态检索[169]。

最近，由于计算机视觉和自然语言处理(NLP)社区[19]和最近可用的大型多模态数据集[38]的共同努力，多模态翻译又引起了人们的兴趣[205]。一个特别流行的问题是视觉场景描述，也被称为图像[214]和视频字幕[213]，它是许多计算机视觉和NLP问题的一个很好的测试平台。要解决这一问题，我们不仅需要充分理解视觉场景，识别视觉场景的突出部分，还需要生成语法正确、全面而简洁的描述视觉场景的句子。

While the approaches to multimodal translation are very broad and are often modality specific, they share a number of unifying factors. We categorize them into two types —example-based, and generative. Example-based models use a dictionary when translating between the modalities. Genera-tive models, on the other hand, construct a model that is able to produce a translation. This distinction is similar to the one between non-parametric and parametric machine learning approaches and is illustrated in Figure 2, with representative examples summarized in Table 3.

Generative models are arguably more challenging to build as they require the ability to generate signals or sequences of symbols (e.g., sentences). This is difficult for any modality — visual, acoustic, or verbal, especially when temporally and structurally consistent sequences need to be generated. This led to many of the early multimodal transla-tion systems relying on example-based translation. However,

this has been changing with the advent of deep learning models that are capable of generating images [171], [210], sounds [157], [209], and text [12].

尽管多模态翻译的方法非常广泛，而且往往是针对特定的模态，但它们有许多共同的因素。我们将它们分为两种类型——基于实例的和生成的。在模态之间转换时，基于实例的模型使用字典。另一方面，生成模型构建的是能够生成翻译的模型。这种区别类似于非参数机器学习方法和参数机器学习方法之间的区别，如图2所示，表3总结了具有代表性的例子。

生成模型的构建更具挑战性，因为它们需要生成信号或符号序列(如句子)的能力。这对于任何形式(视觉的、听觉的或口头的)都是困难的，特别是当需要生成时间和结构一致的序列时。这导致了许多早期的多模态翻译系统依赖于实例翻译。然而,

随着能够生成图像[171]、[210]、声音[157]、[209]和文本[12]的深度学习模型的出现，这种情况已经有所改变。

Figure 2: Overview of example-based and generative multimodal translation. The former retrieves the best translation from a dictionary, while the latter first trains a translation model on the dictionary and then uses that model for translation.

图2:基于实例和生成式多模态翻译的概述。前者从字典中检索最佳的翻译，而后者首先根据字典训练翻译模型，然后使用该模型进行翻译。

4.1 Example-based 基于实例

Example-based algorithms are restricted by their training data — dictionary (see Figure 2a). We identify two types of such algorithms: retrieval based, and combination based. Retrieval-based models directly use the retrieved translation without modifying it, while combination-based models rely on more complex rules to create translations based on a number of retrieved instances.

Retrieval-based models are arguably the simplest form of multimodal translation. They rely on finding the closest sample in the dictionary and using that as the translated result. The retrieval can be done in unimodal space or inter-mediate semantic space.

Given a source modality instance to be translated, uni-modal retrieval finds the closest instances in the dictionary in the space of the source — for example, visual feature space for images. Such approaches have been used for visual speech synthesis, by retrieving the closest matching visual example of the desired phoneme [26]. They have also been used in concatenative text-to-speech systems [88]. More recently, Ordonez et al. [155] used unimodal retrieval to generate image descriptions by using global image features to retrieve caption candidates [155]. Yagcioglu et al. [232] used a CNN-based image representation to retrieve visu-ally similar images using adaptive neighborhood selection. Devlin et al. [49] demonstrated that a simple k-nearest neighbor retrieval with consensus caption selection achieves competitive translation results when compared to more complex generative approaches. The advantage of such unimodal retrieval approaches is that they only require the representation of a single modality through which we are performing retrieval. However, they often require an extra processing step such as re-ranking of retrieved translations [135], [155], [232]. This indicates a major problem with this approach — similarity in unimodal space does not always imply a good translation.

基于示例的算法受到训练数据字典的限制(见图2a)。我们确定了这类算法的两种类型:基于检索的和基于组合的。基于检索的模型直接使用检索到的翻译而不修改它，而基于组合的模型则依赖于更复杂的规则来创建基于大量检索到的实例的翻译。

基于检索的模型可以说是最简单的多模态翻译形式。他们依赖于在字典中找到最接近的样本，并将其作为翻译结果。检索可以在单峰空间或中间语义空间进行。

给定要翻译的源模态实例，单模态检索在源空间(例如，图像的视觉特征空间)中找到字典中最近的实例。这种方法已经被用于视觉语音合成，通过检索最接近匹配的期望音素[26]的视觉示例。它们也被用于串联文本-语音系统[88]。最近，Ordonez等人[155]使用单模态检索，通过使用全局图像特征检索候选标题来生成图像描述[155]。Yagcioglu等人[232]使用了一种基于cnn的图像表示，使用自适应邻域选择来检索视觉上相似的图像。Devlin et al.[49]证明，与更复杂的生成方法相比，具有一致标题选择的简单k近邻检索可以获得有竞争力的翻译结果。这种单模态检索方法的优点是，它们只需要表示我们执行检索时所使用的单一模态。然而，它们通常需要额外的处理步骤，如对检索到的翻译进行重新排序[135]、[155]、[232]。这表明了这种方法的一个主要问题——单峰空间中的相似性并不总是意味着好的翻译。

An alternative is to use an intermediate semantic space for similarity comparison during retrieval. An early ex-ample of a hand crafted semantic space is one used by Farhadi et al. [56]. They map both sentences and images to a space of object, action, scene, retrieval of relevant caption to an image is then performed in that space. In contrast to hand-crafting a representation, Socher et al. [191] learn a coordinated representation of sentences and CNN visual features (see Section 3.2 for description of coordinated spaces). They use the model for both translating from text to images and from images to text. Similarly, Xu et al. [231] used a coordinated space of videos and their descriptions for cross-modal retrieval. Jiang and Li [93] and Cao et al. [32] use cross-modal hashing to perform multimodal translation from images to sentences and back, while Ho-dosh et al. [83] use a multimodal KCCA space for image-sentence retrieval. Instead of aligning images and sentences globally in a common space, Karpathy et al. [99] propose a multimodal similarity metric that internally aligns image fragments (visual objects) together with sentence fragments (dependency tree relations).

另一种方法是在检索过程中使用中间语义空间进行相似度比较。Farhadi等人使用的手工语义空间是一个早期的例子。它们将句子和图像映射到对象、动作、场景的空间中，然后在该空间中检索图像的相关标题。与手工制作表征不同，Socher等人[191]学习句子和CNN视觉特征的协调表征(关于协调空间的描述，请参见章节3.2)。他们将该模型用于从文本到图像和从图像到文本的转换。类似地，Xu等人[231]使用视频及其描述的协调空间进行跨模态检索。Jiang和Li[93]、Cao等使用跨模态哈希进行图像到句子的多模态转换，Ho-dosh等[83]使用多模态KCCA空间进行图像-句子检索。Karpathy等人[99]提出了一种多模态相似性度量方法，该方法将图像片段(视觉对象)与句子片段(依赖树关系)内部对齐，而不是将图像和句子整体对齐到一个公共空间中。

Retrieval approaches in semantic space tend to perform better than their unimodal counterparts as they are retriev-ing examples in a more meaningful space that reflects both modalities and that is often optimized for retrieval. Fur-thermore, they allow for bi-directional translation, which is not straightforward with unimodal methods. However, they require manual construction or learning of such a semantic space, which often relies on the existence of large training dictionaries (datasets of paired samples).

Combination-based models take the retrieval based ap-proaches one step further. Instead of just retrieving exam-ples from the dictionary, they combine them in a meaningful way to construct a better translation. Combination based media description approaches are motivated by the fact that sentence descriptions of images share a common and simple structure that could be exploited. Most often the rules for combinations are hand crafted or based on heuristics.

Kuznetsova et al. [114] first retrieve phrases that describe visually similar images and then combine them to generate novel descriptions of the query image by using Integer Linear Programming with a number of hand crafted rules. Gupta et al. [74] first find k images most similar to the source image, and then use the phrases extracted from their captions to generate a target sentence. Lebret et al. [119] use a CNN-based image representation to infer phrases that describe it. The predicted phrases are then combined using a trigram constrained language model.

A big problem facing example-based approaches for translation is that the model is the entire dictionary — mak-ing the model large and inference slow (although, optimiza-tions such as hashing alleviate this problem). Another issue facing example-based translation is that it is unrealistic to expect that a single comprehensive and accurate translation relevant to the source example will always exist in the dic-tionary — unless the task is simple or the dictionary is very large. This is partly addressed by combination models that are able to construct more complex structures. However, they are only able to perform translation in one direction, while semantic space retrieval-based models are able to perform it both ways.

语义空间中的检索方法往往比单模态检索方法表现得更好，因为它们检索的例子是在一个更有意义的空间中，反映了两种模态，并且通常对检索进行优化。此外，它们允许双向翻译，这与单峰方法不同。然而，它们需要手工构建或学习这样的语义空间，而这往往依赖于大型训练字典(成对样本的数据集)的存在。

基于组合的模型将基于检索的方法又向前推进了一步。它们不只是从字典中检索示例，而是以一种有意义的方式将它们组合在一起，从而构建出更好的翻译。基于组合的媒体描述方法是基于这样一个事实，即图像的句子描述具有共同的、简单的结构，可以被利用。最常见的组合规则是手工制作的或基于启发式。

Kuznetsova等人[114]首先检索描述视觉上相似图像的短语，然后通过使用带有大量手工规则的整数线性规划将它们组合起来，生成查询图像的新颖描述。Gupta等人[74]首先找到与源图像最相似的k张图像，然后使用从这些图像的标题中提取的短语来生成目标句。Lebret等人[119]使用基于cnn的图像表示来推断描述图像的短语。然后，使用三元组合约束语言模型将预测的短语组合在一起。

基于实例的翻译方法面临的一个大问题是，模型是整个字典——这使得模型变大，推理速度变慢(尽管，哈希等优化可以缓解这个问题)。基于示例的翻译面临的另一个问题是，期望与源示例相关的单个全面而准确的翻译总是存在于词典中是不现实的——除非任务很简单或词典非常大。这可以通过能够构建更复杂结构的组合模型部分地解决。然而，它们只能在一个方向上进行翻译，而基于语义空间检索的模型可以以两种方式进行翻译。

4.2 Generative approaches生成方法

Generative approaches to multimodal translation construct models that can perform multimodal translation given a unimodal source instance. It is a challenging problem as it requires the ability to both understand the source modality and to generate the target sequence or signal. As discussed in the following section, this also makes such methods much more difficult to evaluate, due to large space of possible correct answers.

In this survey we focus on the generation of three modal-ities: language, vision, and sound. Language generation has been explored for a long time [170], with a lot of recent attention for tasks such as image and video description [19]. Speech and sound generation has also seen a lot of work with a number of historical [88] and modern approaches [157], [209]. Photo-realistic image generation has been less explored, and is still in early stages [132], [171], however, there have been a number of attempts at generating abstract scenes [253], computer graphics [45], and talking heads [6].

多模态翻译的生成方法可以在给定单模态源实例的情况下构建能够执行多模态翻译的模型。这是一个具有挑战性的问题，因为它要求既能理解源模态，又能生成目标序列或信号。正如下一节所讨论的，这也使得这些方法更难评估，因为可能的正确答案空间很大。

在这个调查中，我们关注三种模态的生成:语言、视觉和声音。语言生成已经被探索了很长一段时间[170]，最近很多人关注的是图像和视频描述[19]等任务。语音和声音生成也见证了许多历史[88]和现代方法[157]、[209]的大量工作。真实感图像生成的研究较少，仍处于早期阶段[132]，[171]，然而，在生成抽象场景[253]、计算机图形学[45]和会说话的头[6]方面已经有了一些尝试。

We identify three broad categories of generative mod-els: grammar-based, encoder-decoder, and continuous generation models. Grammar based models simplify the task by re-stricting the target domain by using a grammar, e.g., by gen-erating restricted sentences based on a subject, object, verbtemplate. Encoder-decoder models first encode the source modality to a latent representation which is then used by a decoder to generate the target modality. Continuous gen-eration models generate the target modality continuously based on a stream of source modality inputs and are most suited for translating between temporal sequences — such as text-to-speech.

Grammar-based models rely on a pre-defined grammar for generating a particular modality. They start by detecting high level concepts from the source modality, such as objects in images and actions from videos. These detections are then incorporated together with a generation procedure based on a pre-defined grammar to result in a target modality.

Kojima et al. [107] proposed a system to describe human behavior in a video using the detected position of the person’s head and hands and rule based natural language generation that incorporates a hierarchy of concepts and actions. Barbu et al. [14] proposed a video description model that generates sentences of the form: who did what to whom and where and how they did it. The system was based on handcrafted object and event classifiers and used a restricted grammar suitable for the task. Guadarrama et al.[73] predict subject, verb, object triplets describing a video using semantic hierarchies that use more general words in case of uncertainty. Together with a language model their approach allows for translation of verbs and nouns not seen in the dictionary.

我们确定了生成模型的三大类:基于语法的、编码器-解码器和连续生成模型。基于语法的模型通过使用语法限制目标领域来简化任务，例如，通过基于主语、宾语、动词模板生成限制句。编码器-解码器模型首先将源模态编码为一个潜在的表示，然后由解码器使用它来生成目标模态。连续生成模型基于源模态输入流连续地生成目标模态，最适合于时间序列之间的转换——比如文本到语音。

基于语法的模型依赖于预定义的语法来生成特定的模态。他们首先从源模态检测高级概念，如图像中的对象和视频中的动作。然后将这些检测与基于预定义语法的生成过程合并在一起，以产生目标模态。

Kojima等人[107]提出了一种系统，利用检测到的人的头和手的位置，以及基于规则的自然语言生成(包含概念和行为的层次)，来描述视频中的人类行为。Barbu et al.[14]提出了一个视频描述模型，该模型生成如下形式的句子:谁对谁做了什么，在哪里以及他们是如何做的。该系统基于手工制作的对象和事件分类器，并使用了适合该任务的限制性语法。guadarama等人[73]预测主语、动词、宾语三连词描述视频，使用语义层次结构，在不确定的情况下使用更一般的词汇。与语言模型一起，他们的方法允许翻译字典中没有的动词和名词。

To describe images, Yao et al. [235] propose to use an and-or graph-based model together with domain-specific lexicalized grammar rules, targeted visual representation scheme, and a hierarchical knowledge ontology. Li et al.[121] first detect objects, visual attributes, and spatial re-lationships between objects. They then use an n-gram lan-guage model on the visually extracted phrases to generatesubject, preposition, object style sentences. Mitchell et al.[142] use a more sophisticated tree-based language model to generate syntactic trees instead of filling in templates, leading to more diverse descriptions. A majority of ap-proaches represent the whole image jointly as a bag of visual objects without capturing their spatial and semantic relationships. To address this, Elliott et al. [51] propose to explicitly model proximity relationships of objects for image description generation.

Some grammar-based approaches rely on graphical models to generate the target modality. An example includes BabyTalk [112], which given an image generates object, preposition, object triplets, that are used together with a conditional random field to construct the sentences. Yang et al. [233] predict a set of noun, verb, scene, prepositioncandidates using visual features extracted from an image and combine them into a sentence using a statistical lan-guage model and hidden Markov model style inference. A similar approach has been proposed by Thomason et al. [204], where a factor graph model is used for video description of the form subject, verb, object, place. The factor model exploits language statistics to deal with noisy visual representations. Going the other way Zitnick et al.[253] propose to use conditional random fields to generate abstract visual scenes based on language triplets extracted from sentences.

为了描述图像，Yao等人[235]提出使用基于和或图的模型，以及特定领域的词汇化语法规则、有针对性的视觉表示方案和层次知识本体。Li等人[121]首先检测对象、视觉属性和对象之间的空间关系。然后，他们在视觉提取的短语上使用一个n-gram语言模型，生成主语、介词、宾语式的句子。Mitchell等人[142]使用更复杂的基于树的语言模型来生成语法树，而不是填充模板，从而产生更多样化的描述。大多数方法将整个图像共同表示为一袋视觉对象，而没有捕捉它们的空间和语义关系。为了解决这个问题，Elliott et al.[51]提出明确地建模物体的接近关系，以生成图像描述。

一些基于语法的方法依赖于图形模型来生成目标模态。一个例子包括BabyTalk[112]，它给出一个图像生成object，介词，object三连词，这些连词与条件随机场一起用来构造句子。Yang等人[233]利用从图像中提取的视觉特征预测一组的名词、动词、场景、介词候选人，并使用统计语言模型和隐马尔可夫模型风格推理将它们组合成一个句子。Thomason等人也提出了类似的方法[204]，其中一个因子图模型用于subject, verb, object, place形式的视频描述。因子模型利用语言统计来处理嘈杂的视觉表示。Zitnick等人[253]则提出利用条件随机场从句子中提取语言三联体，生成抽象视觉场景。

An advantage of grammar-based methods is that they are more likely to generate syntactically (in case of lan-guage) or logically correct target instances as they use predefined templates and restricted grammars. However, this limits them to producing formulaic rather than creative translations. Furthermore, grammar-based methods rely on complex pipelines for concept detection, with each concept requiring a separate model and a separate training dataset. Encoder-decoder models based on end-to-end trained neu-ral networks are currently some of the most popular tech-niques for multimodal translation. The main idea behind the model is to first encode a source modality into a vectorial representation and then to use a decoder module to generate the target modality, all this in a single pass pipeline. Al-though, first used for machine translation [97], such models have been successfully used for image captioning [134],[214], and video description [174], [213]. So far, encoder-decoder models have been mostly used to generate text, but they can also be used to generate images [132], [171], and continuos generation of speech and sound [157], [209].

The first step of the encoder-decoder model is to encode the source object, this is done in modality specific way.Popular models to encode acoustic signals include RNNs [35] and DBNs [79]. Most of the work on encoding words sentences uses distributional semantics [141] and variants of RNNs [12]. Images are most often encoded using convo-lutional neural networks (CNN) [109], [185]. While learned CNN representations are common for encoding images, this is not the case for videos where hand-crafted features are still commonly used [174], [204]. While it is possible to use unimodal representations to encode the source modality, it has been shown that using a coordinated space (see Section 3.2) leads to better results [105], [159], [231].

基于语法的方法的一个优点是，当它们使用预定义模板和受限制的语法时，它们更有可能生成语法上(对于语言)或逻辑上正确的目标实例。然而，这限制了他们只能写出公式化的翻译，而不是创造性的翻译。此外，基于语法的方法依赖于复杂的管道进行概念检测，每个概念都需要一个单独的模型和一个单独的训练数据集。基于端到端训练神经网络的编解码模型是目前最流行的多模态翻译技术之一。该模型背后的主要思想是，首先将源模态编码为矢量表示，然后使用解码器模块生成目标模态，所有这些都在一个单通道中完成。虽然该模型最初用于机器翻译[97]，但已成功应用于图像字幕[134]、[214]和视频描述[174]、[213]。到目前为止，编码器-解码器模型大多用于生成文本，但它们也可以用于生成图像[132]、[171]，以及语音和声音的连续生成[157]、[209]。

编码器-解码器模型的第一步是对源对象进行编码，这是以特定于模态的方式完成的。常用的声学信号编码模型包括RNNs、[35]和DBNs[79]。大多数关于单词和句子编码的研究使用了分布语义[141]和RNNs的变体[12]。图像通常使用卷积神经网络(CNN)进行编码[109]，[185]。虽然学习过的CNN表示通常用于编码图像，但对于手工制作的特征仍然常用的视频却不是这样[174]，[204]。虽然可以使用单模态表示对源模态进行编码，但已经证明使用协调空间(见3.2节)可以得到更好的结果[105]、[159]、[231]。

Decoding is most often performed by an RNN or an LSTM using the encoded representation as the initial hidden state [54], [132], [214], [215]. A number of extensions have been proposed to traditional LSTM models to aid in the task of translation. A guide vector could be used to tightly couple the solutions in the image input [91]. Venugopalan et al.[213] demonstrate that it is beneficial to pre-train a decoder LSTM for image captioning before fine-tuning it to video description. Rohrbach et al. [174] explore the use of various LSTM architectures (single layer, multilayer, factored) and a number of training and regularization techniques for the task of video description.

A problem facing translation generation using an RNN is that the model has to generate a description from a single vectorial representation of the image, sentence, or video. This becomes especially difficult when generating long sequences as these models tend to forget the initial input. This has been partly addressed by neural attention models (see Section 5.2) that allow the network to focus on certain parts of an image [230], sentence [12], or video [236] during generation.

Generative attention-based RNNs have also been used for the task of generating images from sentences [132], while the results are still far from photo-realistic they show a lot of promise. More recently, a large amount of progress has been made in generating images using generative adversarial networks [71], which have been used as an alternative to RNNs for image generation from text [171].

解码通常由RNN或LSTM执行，使用编码表示作为初始隐藏状态[54]，[132]，[214]，[215]。人们对传统的LSTM模型进行了大量的扩展，以帮助完成翻译任务。一个引导向量可以用来紧耦合图像输入中的解[91]。Venugopalan等人[213]证明，在将解码器LSTM微调为视频描述之前，对图像字幕进行预训练是有益的。Rohrbach等人[174]探讨了在视频描述任务中使用各种LSTM架构(单层、多层、因子)和多种训练和正则化技术。

使用RNN进行翻译生成面临的一个问题是，模型必须从图像、句子或视频的单个矢量表示生成描述。这在生成长序列时变得特别困难，因为这些模型往往会忘记最初的输入。神经注意力模型已经部分解决了这一问题(见5.2节)，神经注意力模型允许网络在生成时聚焦于图像[230]、句子[12]或视频[236]的某些部分。

基于生成注意力的神经网络也被用于从句子中生成图像的任务[132]，尽管其结果还远远不够逼真，但它们显示出了很大的希望。最近，在使用生成对抗网络生成图像方面取得了大量进展[71]，生成对抗网络已被用于替代rnn从文本生成图像[171]。

While neural network based encoder-decoder systems have been very successful they still face a number of issues. Devlin et al. [49] suggest that it is possible that the network is memorizing the training data rather than learning how to understand the visual scene and generate it. This is based on the observation that k-nearest neighbor models perform very similarly to those based on generation. Furthermore, such models often require large quantities of data for train-ing.

Continuous generation models are intended for sequence translation and produce outputs at every timestep in an online manner. These models are useful when translating from a sequence to a sequence such as text to speech, speech to text, and video to text. A number of different techniques have been proposed for such modeling — graphical models, continuous encoder-decoder approaches, and various other regression or classification techniques. The extra difficulty that needs to be tackled by these models is the requirement of temporal consistency between modalities.

A lot of early work on sequence to sequence transla-tion used graphical or latent variable models. Deena and Galata [47] proposed to use a shared Gaussian process latent variable model for audio-based visual speech synthesis. The model creates a shared latent space between audio and vi-sual features that can be used to generate one space from the other, while enforcing temporal consistency of visual speech at different timesteps. Hidden Markov models (HMM) have also been used for visual speech generation [203] and text-to-speech [245] tasks. They have also been extended to use cluster adaptive training to allow for training on multiple speakers, languages, and emotions allowing for more con-trol when generating speech signal [244] or visual speech parameters [6].

虽然基于神经网络的编码器-解码器系统已经非常成功，但它们仍然面临一些问题。Devlin et al.[49]认为，网络可能是在记忆训练数据，而不是学习如何理解视觉场景并生成它。这是基于k近邻模型与基于生成的模型非常相似的观察得出的。此外，这种模型通常需要大量的数据进行训练。

连续生成模型用于序列转换，并以在线方式在每个时间步中产生输出。这些模型在将一个序列转换为另一个序列时非常有用，比如文本到语音、语音到文本和视频到文本。为这种建模提出了许多不同的技术——图形模型、连续编码器-解码器方法，以及各种其他回归或分类技术。这些模型需要解决的额外困难是对模态之间时间一致性的要求。

许多早期的序列到序列转换的工作使用图形或潜在变量模型。Deena和Galata[47]提出了一种共享高斯过程潜变量模型用于基于音频的可视语音合成。该模型在音频和视觉特征之间创建了一个共享的潜在空间，可用于从另一个空间生成一个空间，同时在不同的时间步长强制实现视觉语音的时间一致性。隐马尔可夫模型(HMM)也被用于视觉语音生成[203]和文本-语音转换[245]任务。它们还被扩展到使用聚类自适应训练，以允许对多种说话人、语言和情绪进行训练，从而在产生语音信号[244]或视觉语音参数[6]时进行更多的控制。

Encoder-decoder models have recently become popular for sequence to sequence modeling. Owens et al. [157] used an LSTM to generate sounds resulting from drumsticks based on video. While their model is capable of generat-ing sounds by predicting a cochleogram from CNN visual features, they found that retrieving a closest audio sample based on the predicted cochleogram led to best results. Di-rectly modeling the raw audio signal for speech and music generation has been proposed by van den Oord et al. [209]. The authors propose using hierarchical fully convolutional neural networks, which show a large improvement over previous state-of-the-art for the task of speech synthesis. RNNs have also been used for speech to text translation (speech recognition) [72]. More recently encoder-decoder based continuous approach was shown to be good at pre-dicting letters from a speech signal represented as a filter bank spectra [35] — allowing for more accurate recognition of rare and out of vocabulary words. Collobert et al. [42] demonstrate how to use a raw audio signal directly for speech recognition, eliminating the need for audio features.

A lot of earlier work used graphical models for mul-timodal translation between continuous signals. However, these methods are being replaced by neural network encoder-decoder based techniques. Especially as they have recently been shown to be able to represent and generate complex visual and acoustic signals.

编码器-解码器模型是近年来序列对序列建模的流行方法。Owens等人[157]使用LSTM来产生基于视频的鼓槌的声音。虽然他们的模型能够通过预测CNN视觉特征的耳蜗图来产生声音，但他们发现，根据预测的耳蜗图检索最近的音频样本会带来最好的结果。van den Oord等人提出直接对原始音频信号建模以生成语音和音乐[209]。作者建议使用分层全卷积神经网络，这表明在语音合成的任务中，比以前的最先进技术有了很大的改进。rnn也被用于语音到文本的翻译(语音识别)[72]。最近，基于编码器-解码器的连续方法被证明能够很好地从表示为滤波器组光谱[35]的语音信号中预测字母，从而能够更准确地识别罕见的和词汇之外的单词。Collobert等人的[42]演示了如何直接使用原始音频信号进行语音识别，消除了对音频特征的需求。

许多早期的工作使用图形模型来实现连续信号之间的多模态转换。然而，这些方法正在被基于神经网络的编码器-解码器技术所取代。特别是它们最近被证明能够表示和产生复杂的视觉和听觉信号。

4.3 Model evaluation and discussion模型评价与讨论

A major challenge facing multimodal translation methods is that they are very difficult to evaluate. While some tasks such as speech recognition have a single correct translation, tasks such as speech synthesis and media description do not. Sometimes, as in language translation, multiple answers are correct and deciding which translation is better is often subjective. Fortunately, there are a number of approximate automatic metrics that aid in model evaluation.

Often the ideal way to evaluate a subjective task is through human judgment. That is by having a group of people evaluating each translation. This can be done on a Likert scale where each translation is evaluated on a certain dimension: naturalness and mean opinion score for speech synthesis [209], [244], realism for visual speech synthesis [6],[203], and grammatical and semantic correctness, relevance, order, and detail for media description [38], [112], [142],[213]. Another option is to perform preference studies where two (or more) translations are presented to the participant for preference comparison [203], [244]. However, while user studies will result in evaluation closest to human judgments they are time consuming and costly. Furthermore, they require care when constructing and conducting them to avoid fluency, age, gender and culture biases.

多模态翻译方法面临的一个主要挑战是它们很难评估。语音识别等任务只有一个正确的翻译，而语音合成和媒体描述等任务则没有。有时，就像在语言翻译中，多重答案是正确的，决定哪个翻译更好往往是主观的。幸运的是，有许多有助于模型评估的近似自动指标。

评估主观任务的理想方法通常是通过人的判断。那就是让一群人评估每一个翻译。这可以通过李克特量表来完成，其中每一篇翻译都在一个特定的维度上进行评估:语音合成的自然度和平均意见得分[209]，[244]，视觉语音合成的真实感[6]，[203]，以及媒体描述[38]，[112]，[142]，[213]的语法和语义正确性、相关性、顺序和细节。另一种选择是进行偏好研究，将两种(或更多)翻译呈现给参与者进行偏好比较[203]，[244]。然而，虽然用户研究将导致最接近人类判断的评估，但它们既耗时又昂贵。此外，在构建和指导这些活动时，需要小心谨慎，以避免流利性、年龄、性别和文化偏见。

While human studies are a gold standard for evaluation, a number of automatic alternatives have been proposed for the task of media description: BLEU [160], ROUGE [124], Meteor [48], and CIDEr [211]. These metrics are directly taken from (or are based on) work in machine translation and compute a score that measures the similarity between the generated and ground truth text. However, the use of them has faced a lot of criticism. Elliott and Keller [52] showed that sentence-level unigram BLEU is only weakly correlated with human judgments. Huang et al. [87] demon-strated that the correlation between human judgments and BLEU and Meteor is very low for visual story telling task. Furthermore, the ordering of approaches based on human judgments did not match that of the ordering using au-tomatic metrics on the MS COCO challenge [38] — with a large number of algorithms outperforming humans on all the metrics. Finally, the metrics only work well when a number of reference translations is high [211], which is often unavailable, especially for current video description datasets [205]

虽然人类研究是评估的黄金标准，但人们提出了许多媒体描述任务的自动替代方案:BLEU[160]、ROUGE[124]、Meteor[48]和CIDEr[211]。这些指标是直接从(或基于)机器翻译的工作，并计算出一个分数，以衡量生成的文本和地面真实文本之间的相似性。然而，它们的使用面临着许多批评。Elliott和Keller[52]表明句子层面的ungram BLEU与人类判断只有弱相关。Huang等[87]研究表明，在视觉讲故事任务中，人类判断与BLEU和Meteor之间的相关性非常低。此外，基于人类判断的方法排序与在MS COCO挑战[38]上使用自动度量的排序并不匹配——大量算法在所有度量上都优于人类。最后，只有在大量参考翻译量高的情况下，指标才能很好地工作[211]，而这通常是不可用的，特别是对于当前的视频描述数据集[205]。

These criticisms have led to Hodosh et al. [83] proposing to use retrieval as a proxy for image captioning evaluation, which they argue better reflects human judgments. Instead of generating captions, a retrieval based system ranks the available captions based on their fit to the image, and is then evaluated by assessing if the correct captions are given a high rank. As a number of caption generation models are generative they can be used directly to assess the likelihood of a caption given an image and are being adapted by im-age captioning community [99], [105]. Such retrieval based evaluation metrics have also been adopted by the video captioning community [175].

Visual question-answering (VQA) [130] task was pro-posed partly due to the issues facing evaluation of image captioning. VQA is a task where given an image and a ques-tion about its content the system has to answer it. Evaluating such systems is easier due to the presence of a correct answer. However, it still faces issues such as ambiguity of certain questions and answers and question bias.

We believe that addressing the evaluation issue will be crucial for further success of multimodal translation systems. This will allow not only for better comparison be-tween approaches, but also for better objectives to optimize.

这些批评导致Hodosh等人[83]提出使用检索作为图像标题评价的代理，他们认为检索可以更好地反映人类的判断。基于检索的系统不是生成标题，而是根据它们与图像的契合度对可用的标题进行排序，然后通过评估正确的标题是否被给予较高的级别来进行评估。由于许多字幕生成模型是可生成的，它们可以直接用于评估给定图像的字幕的可能性，并被图像字幕社区改编[99]，[105]。这种基于检索的评价指标也被视频字幕社区采用[175]。

视觉问答(Visual question-answer, VQA)[130]任务的提出，部分是由于图像字幕评价面临的问题。VQA是一个任务，在这个任务中，给定一个图像和一个关于其内容的问题，系统必须回答它。由于存在正确的答案，评估这些系统更容易。然而，它仍然面临一些问题，如某些问题和答案的模糊性和问题的偏见。

我们认为，解决评价问题将是多模态翻译系统进一步成功的关键。这不仅可以更好地比较不同的方法，而且还可以优化更好的目标。

5 Alignment对齐

We define multimodal alignment as finding relationships and correspondences between sub-components of instances from two or more modalities. For example, given an image and a caption we want to find the areas of the image cor-responding to the caption’s words or phrases [98]. Another example is, given a movie, aligning it to the script or the book chapters it was based on [252].

We categorize multimodal alignment into two types –implicit and explicit. In explicit alignment, we are explicitly interested in aligning sub-components between modalities,e.g., aligning recipe steps with the corresponding instructional video [131]. Implicit alignment is used as an interme-diate (often latent) step for another task, e.g., image retrieval based on text description can include an alignment step between words and image regions [99]. An overview of such approaches can be seen in Table 4 and is presented in more detail in the following sections.

我们将多模态对齐定义为寻找来自两个或多个模态实例的子组件之间的关系和对应关系。例如，给定一张图片和一个标题，我们想要找到图片中与标题中的单词或短语相对应的区域[98]。另一个例子是，给定一部电影，将其与剧本或书中的章节对齐[252]。

我们将多模态对齐分为两种类型-隐式和显式。在显式对齐中，我们明确地对模态之间的子组件对齐感兴趣。，将配方步骤与相应的教学视频对齐[131]。隐式对齐是另一个任务的中间(通常是潜伏的)步骤，例如，基于文本描述的图像检索可以包括单词和图像区域之间的对齐步骤[99]。这些方法的概述见表4，并在下面几节中给出更详细的介绍。

Table 4: Summary of our taxonomy for the multimodal alignment challenge. For each sub-class of our taxonomy, we include reference citations and modalities aligned.

表4:我们对多模态对齐挑战的分类总结。对于我们分类法的每一个子类，我们包括参考引文和模态对齐。

5.1 Explicit alignment显式对齐

We categorize papers as performing explicit alignment if their main modeling objective is alignment between sub-components of instances from two or more modalities. A very important part of explicit alignment is the similarity metric. Most approaches rely on measuring similarity be-tween sub-components in different modalities as a basic building block. These similarities can be defined manually or learned from data.

We identify two types of algorithms that tackle ex-plicit alignment — unsupervised and (weakly) supervised. The first type operates with no direct alignment labels (i.e., la-beled correspondences) between instances from the different modalities. The second type has access to such (sometimes weak) labels.

Unsupervised multimodal alignment tackles modality alignment without requiring any direct alignment labels. Most of the approaches are inspired from early work on alignment for statistical machine translation [28] and genome sequences [3], [111]. To make the task easier the approaches assume certain constrains on alignment, such as temporal ordering of sequence or an existence of a similarity metric between the modalities.

如果论文的主要建模目标是对齐来自两个或更多模态的实例的子组件，那么我们将其分类为执行显式对齐。显式对齐的一个非常重要的部分是相似性度量。大多数方法都依赖于度量不同模态的子组件之间的相似性作为基本构建块。这些相似点可以手工定义，也可以从数据中学习。

我们确定了两种处理显式对齐的算法-无监督和(弱)监督。第一种类型在不同模态的实例之间没有直接对齐标签(即标签对应)。第二种类型可以访问这样的标签(有时是弱标签)。

无监督多模态对齐处理模态对齐，而不需要任何直接对齐标签。大多数方法的灵感来自于早期对统计机器翻译[28]和基因组序列[3]的比对工作，[111]。为了使任务更容易，这些方法在对齐上假定了一定的约束，例如序列的时间顺序或模态之间存在相似性度量。

Dynamic time warping (DTW) [3], [111] is a dynamic programming approach that has been extensively used to align multi-view time series. DTW measures the similarity between two sequences and finds an optimal match between them by time warping (inserting frames). It requires the timesteps in the two sequences to be comparable and re-quires a similarity measure between them. DTW can be used directly for multimodal alignment by hand-crafting similar-ity metrics between modalities; for example Anguera et al.[8] use a manually defined similarity between graphemes and phonemes; and Tapaswi et al. [201] define a similarity between visual scenes and sentences based on appearance of same characters [201] to align TV shows and plot syn-opses. DTW-like dynamic programming approaches have also been used for multimodal alignment of text to speech [77] and video [202].

As the original DTW formulation requires a pre-defined similarity metric between modalities, it was extended using canonical correlation analysis (CCA) to map the modali-ties to a coordinated space. This allows for both aligning (through DTW) and learning the mapping (through CCA) between different modality streams jointly and in an unsu-pervised manner [180], [250], [251]. While CCA based DTW models are able to find multimodal data alignment under a linear transformation, they are not able to model non-linear relationships. This has been addressed by the deep canonical time warping approach [206], which can be seen as a generalization of deep CCA and DTW.

动态时间翘曲(DTW)[3]，[111]是一种动态规划方法，被广泛用于对齐多视图时间序列。DTW测量两个序列之间的相似性，并通过时间翘曲(插入帧)找到它们之间的最优匹配。它要求两个序列中的时间步具有可比性，并要求它们之间的相似性度量。DTW可以通过手工制作模态之间的相似性度量直接用于多模态校准;例如，Anguera等人使用手工定义的字素和音素之间的相似度;和Tapaswi等[201]根据相同角色的出现定义视觉场景和句子之间的相似性[201]，以对齐电视节目和情节同步。类似dtw的动态规划方法也被用于文本到语音[77]和视频[202]的多模态对齐。

由于原始的DTW公式需要预定义的模态之间的相似性度量，因此使用典型相关分析(CCA)对其进行了扩展，以将模态映射到协调空间。这既允许(通过DTW)对齐，也允许(通过CCA)以非监督的方式共同学习不同模态流之间的映射[180]、[250]、[251]。虽然基于CCA的DTW模型能够在线性变换下找到多模态数据对齐，但它们不能建模非线性关系。深度标准时间翘曲方法已经解决了这一问题[206]，该方法可以看作是深度CCA和DTW的推广。

Various graphical models have also been popular for multimodal sequence alignment in an unsupervised man-ner. Early work by Yu and Ballard [239] used a generative graphical model to align visual objects in images with spoken words. A similar approach was taken by Cour et al.[44] to align movie shots and scenes to the corresponding screenplay. Malmaud et al. [131] used a factored HMM to align recipes to cooking videos, while Noulas et al. [154] used a dynamic Bayesian network to align speakers to videos. Naim et al. [147] matched sentences with corre-sponding video frames using a hierarchical HMM model to align sentences with frames and a modified IBM [28] algorithm for word and object alignment [15]. This model was then extended to use latent conditional random fields for alignments [146] and to incorporate verb alignment to actions in addition to nouns and objects [195].

Both DTW and graphical model approaches for align-ment allow for restrictions on alignment, e.g. temporal consistency, no large jumps in time, and monotonicity. While DTW extensions allow for learning both the similarity met-ric and alignment jointly, graphical model based approaches require expert knowledge for construction [44], [239]. Supervised alignment methods rely on labeled aligned in-stances. They are used to train similarity measures that are used for aligning modalities.

各种图形模型也流行于无监督方式的多模态序列比对。Yu和Ballard的早期工作[239]使用生成图形模型，将图像中的视觉对象与口语对齐。Cour et al.[44]采用了类似的方法，将电影镜头和场景与相应的剧本对齐。Malmaud等人[131]使用一种经过分解的HMM将食谱与烹饪视频进行对齐，而Noulas等人[154]使用动态贝叶斯网络将说话者与视频进行对齐。Naim等人[147]使用分层HMM模型对句子和帧进行对齐，并使用改进的IBM[28]算法对单词和对象进行对齐[15]，将句子与相应的视频帧进行匹配。随后，该模型被扩展到使用潜在条件随机场进行对齐[146]，并将动词对齐合并到动作中，除了名词和对象之外[195]。

DTW和图形模型的对齐方法都允许对对齐的限制，例如时间一致性、时间上没有大的跳跃和单调性。虽然DTW扩展可以同时学习相似性度量和对齐，但基于图形模型的方法需要专家知识来构建[44]，[239]。监督对齐方法依赖于标记对齐的实例。它们被用来训练用于对齐模态的相似性度量。

A number of supervised sequence alignment techniques take inspiration from unsupervised ones. Bojanowski et al.[22], [23] proposed a method similar to canonical time warp-ing, but have also extended it to take advantage of exist-ing (weak) supervisory alignment data for model training. Plummer et al. [161] used CCA to find a coordinated space between image regions and phrases for alignment. Gebru et al. [65] trained a Gaussian mixture model and performed semi-supervised clustering together with an unsupervised latent-variable graphical model to align speakers in an audio channel with their locations in a video. Kong et al. [108] trained a Markov random field to align objects in 3D scenes to nouns and pronouns in text descriptions.

Deep learning based approaches are becoming popular for explicit alignment (specifically for measuring similarity) due to very recent availability of aligned datasets in the lan-guage and vision communities [133], [161]. Zhu et al. [252] aligned books with their corresponding movies/scripts by training a CNN to measure similarities between scenes and text. Mao et al. [133] used an LSTM language model and a CNN visual one to evaluate the quality of a match between a referring expression and an object in an image. Yu et al.[242] extended this model to include relative appearance and context information that allows to better disambiguate between objects of the same type. Finally, Hu et al. [85] used an LSTM based scoring function to find similarities between image regions and their descriptions.

许多监督序列比对技术的灵感来自于非监督序列比对技术。Bojanowski et al.[22]，[23]提出了一种类似于规范时间扭曲的方法，但也对其进行了扩展，以利用现有的(弱)监督对准数据进行模型训练。Plummer等[161]利用CCA在图像区域和短语之间找到一个协调的空间进行对齐。Gebru等人[65]训练了一种高斯混合模型，并将半监督聚类与一种无监督的潜在变量图形模型结合在一起，以使音频通道中的扬声器与视频中的位置保持一致。Kong等人[108]训练了马尔可夫随机场来将3D场景中的物体与文本描述中的名词和代词对齐。

基于深度学习的方法在显式对齐(特别是度量相似性)方面正变得流行起来，这是由于最近在语言和视觉社区中对齐数据集的可用性[133]，[161]。Zhu等人[252]通过训练CNN来衡量场景和文本之间的相似性，将书籍与相应的电影/脚本对齐。Mao等人[133]使用LSTM语言模型和CNN视觉模型来评估参考表达和图像中物体匹配的质量。Yu等人[242]将该模型扩展到包含相对外观和上下文信息，从而可以更好地消除同一类型对象之间的歧义。最后，Hu等[85]使用基于LSTM的评分函数来寻找图像区域与其描述之间的相似点。

5.2 Implicit alignment隐式对齐

In contrast to explicit alignment, implicit alignment is used as an intermediate (often latent) step for another task. This allows for better performance in a number of tasks including speech recognition, machine translation, media description, and visual question-answering. Such models do not explic-itly align data and do not rely on supervised alignment examples, but learn how to latently align the data during model training. We identify two types of implicit alignment models: earlier work based on graphical models, and more modern neural network methods.

Graphical models have seen some early work used to better align words between languages for machine translation [216] and alignment of speech phonemes with their tran-scriptions [186]. However, they require manual construction of a mapping between the modalities, for example a gener-ative phone model that maps phonemes to acoustic features [186]. Constructing such models requires training data or human expertise to define them manually.

Neural networks Translation (Section 4) is an example of a modeling task that can often be improved if alignment is performed as a latent intermediate step. As we mentioned before, neural networks are popular ways to address this translation problem, using either an encoder-decoder model or through cross-modal retrieval. When translation is per-formed without implicit alignment, it ends up putting a lot of weight on the encoder module to be able to properly summarize the whole image, sentence or a video with a single vectorial representation.

与显式对齐相反，隐式对齐用作另一个任务的中间(通常是潜在的)步骤。这允许在许多任务中有更好的表现，包括语音识别、机器翻译、媒体描述和视觉问题回答。这些模型不显式地对齐数据，也不依赖于监督对齐示例，而是学习如何在模型训练期间潜在地对齐数据。我们确定了两种类型的隐式对齐模型:基于图形模型的早期工作，以及更现代的神经网络方法。

图形化模型的一些早期工作已被用于机器翻译中更好地对齐语言之间的单词[216]，以及对齐语音音素与其转录文本[186]。然而，它们需要人工构建模态之间的映射，例如将音素映射到声学特征的生成电话模型[186]。构建这样的模型需要训练数据或人类专业知识来手动定义它们。

神经网络翻译(第4节)是建模任务的一个例子，如果将对齐作为一个潜在的中间步骤执行，那么通常可以改进建模任务。正如我们前面提到的，神经网络是解决这个翻译问题的常用方法，可以使用编码器-解码器模型，也可以通过跨模态检索。当在没有隐式对齐的情况下进行平移时，编码器模块最终会因为能够正确地用单个矢量表示总结整个图像、句子或视频而受到很大的影响。

A very popular way to address this is through attention [12], which allows the decoder to focus on sub-components of the source instance. This is in contrast with encoding all source sub-components together, as is performed in a con-ventional encoder-decoder model. An attention module will tell the decoder to look more at targeted sub-components of the source to be translated — areas of an image [230], words of a sentence [12], segments of an audio sequence [35], [39], frames and regions in a video [236], [241], and even parts of an instruction [140]. For example, in image captioning in-stead of encoding an entire image using a CNN, an attention mechanism will allow the decoder (typically an RNN) to focus on particular parts of the image when generating each successive word [230]. The attention module which learns what part of the image to focus on is typically a shallow neural network and is trained end-to-end together with a target task (e.g., translation).

Attention models have also been successfully applied to question answering tasks, as they allow for aligning the words in a question with sub-components of an information source such as a piece of text [228], an image [62], or a video sequence [246]. This both allows for better performance in question answering and leads to better model interpretabil-ity [4]. In particular, different types of attention models have been proposed to address this problem, including hierar-chical [128], stacked [234], and episodic memory attention [228].

解决这个问题的一种非常流行的方法是通过关注[12]，它允许解码器关注源实例的子组件。这与在传统的编码器-解码器模型中执行的将所有源子组件编码在一起形成对比。注意模块将告诉解码器看起来更有针对性的子组件的源代码翻译——图像[230]、句子[12],段音频序列[35],[39],一个视频帧和地区[236],[241],甚至部分指令[140]。例如，在图像标题中，不是使用CNN对整个图像进行编码，而是一种注意机制，允许解码器(通常是RNN)在生成每个连续单词时聚焦于图像的特定部分[230]。注意力模块学习图像的哪一部分需要关注，它通常是一个浅层神经网络，并与目标任务(如翻译)一起进行端到端训练。

注意力模型也已成功应用于问答任务，因为它们允许将问题中的单词与信息源的子组件(如一段文本[228]、一幅图像[62]或一段视频序列[246])对齐。这既允许更好的问题回答性能，也导致更好的模型可解释性[4]。特别是，人们提出了不同类型的注意模型来解决这个问题，包括层次结构注意[128]、堆叠注意[234]和情景记忆注意[228]。

Another neural alternative for aligning images with cap-tions for cross-modal retrieval was proposed by Karpathy et al. [98], [99]. Their proposed model aligns sentence frag-ments to image regions by using a dot product similarity measure between image region and word representations. While it does not use attention, it extracts a latent alignment between modalities through a similarity measure that is learned indirectly by training a retrieval model.

Karpathy等人提出了另一种神经方法，可用于对带有标题的图像进行交叉模态检索[98]，[99]。他们提出的模型通过使用图像区域和单词表示之间的点积相似度度量来将句子片段与图像区域对齐。虽然它不使用注意力，但它通过通过训练检索模型间接学习的相似度度量来提取模态之间的潜在对齐。

5.3 Discussion讨论

Multimodal alignment faces a number of difficulties:

1) there are few datasets with explicitly annotated alignments;

2) it is difficult to design similarity metrics between modalities;

3) there may exist multiple possible alignments and not all elements in one modality have correspondences in another.

Earlier work on multimodal alignment focused on aligning multimodal sequences in an unsupervised manner using graphical models and dynamic programming techniques. It relied on hand-defined measures of similarity between the modalities or learnt them in an unsupervised manner. With recent availability of labeled training data supervised learn-ing of similarities between modalities has become possible. However, unsupervised techniques of learning to jointly align and translate or fuse data have also become popular.

多模态对齐面临许多困难：

1）具有明确注释对齐的数据集很少；

2) 难以设计模态之间的相似性度量；

3) 可能存在多种可能的对齐方式，并且并非一种模态中的所有元素在另一种模态中都有对应关系。

早期关于多模态对齐的工作侧重于使用图形模型和动态规划技术以无监督方式对齐多模态序列。它依靠手动定义的模态之间的相似性度量或以无人监督的方式学习它们。随着最近标记训练数据的可用性，对模态之间相似性的监督学习成为可能。然而，学习联合对齐和翻译或融合数据的无监督技术也变得流行起来。

6 Fusion融合

Multimodal fusion is one of the original topics in mul-timodal machine learning, with previous surveys empha-sizing early, late and hybrid fusion approaches [50], [247]. In technical terms, multimodal fusion is the concept of integrating information from multiple modalities with the goal of predicting an outcome measure: a class (e.g., happy vs. sad) through classification, or a continuous value (e.g., positivity of sentiment) through regression. It is one of the most researched aspects of multimodal machine learning with work dating to 25 years ago [243].

The interest in multimodal fusion arises from three main benefits it can provide. First, having access to multiple modalities that observe the same phenomenon may allow for more robust predictions. This has been especially ex-plored and exploited by the AVSR community [163]. Second, having access to multiple modalities might allow us to capture complementary information — something that is not visible in individual modalities on their own. Third, a multimodal system can still operate when one of the modalities is missing, for example recognizing emotions from the visual signal when the person is not speaking [50].

多模态融合是多模态机器学习中最原始的主题之一，以往的研究强调早期、晚期和混合融合方法[50]，[247]。用技术术语来说，多模态融合是将来自多种模态的信息整合在一起的概念，目的是预测一个结果度量:通过分类得到一个类别(例如，快乐vs.悲伤)，或者通过回归得到一个连续值(例如，情绪的积极性)。这是多模态机器学习研究最多的方面之一，可追溯到25年前的工作[243]。

人们对多模态融合的兴趣源于它能提供的三个主要好处。首先，使用观察同一现象的多种模态可能会使预测更加准确。AVSR社区对此进行了特别的探索和利用[163]。其次，接触多种模态可能会让我们获得互补信息——在单独的模态中是看不到的信息。第三，当其中一种模态缺失时，多模态系统仍然可以运行，例如，当一个人不说话时，从视觉信号中识别情绪。

Multimodal fusion has a very broad range of appli-cations, including audio-visual speech recognition (AVSR)[163], multimodal emotion recognition [192], medical image analysis [89], and multimedia event detection [117]. There are a number of reviews on the subject [11], [163], [188],[247]. Most of them concentrate on multimodal fusion for a particular task, such as multimedia analysis, information retrieval or emotion recognition. In contrast, we concentrate on the machine learning approaches themselves and the technical challenges associated with these approaches.

While some prior work used the term multimodal fu-sion to include all multimodal algorithms, in this survey paper we classify approaches as fusion category when the multimodal integration is performed at the later prediction stages, with the goal of predicting outcome measures. In recent work, the line between multimodal representation and fusion has been blurred for models such as deep neural networks where representation learning is interlaced with classification or regression objectives. As we will describe in this section, this line is clearer for other approaches such as graphical models and kernel-based methods.

We classify multimodal fusion into two main categories: model-agnostic approaches (Section 6.1) that are not di-rectly dependent on a specific machine learning method; and model-based (Section 6.2) approaches that explicitly ad-dress fusion in their construction — such as kernel-based approaches, graphical models, and neural networks. An overview of such approaches can be seen in Table 5.

多模态融合有非常广泛的应用，包括视听语音识别[163]、多模态情感识别[192]、医学图像分析[89]、多媒体事件检测[117]。关于这个主题[11]，[163]，[188]，[247]有许多评论。它们大多集中于针对特定任务的多模态融合，如多媒体分析、信息检索或情感识别。相比之下，我们专注于机器学习方法本身以及与这些方法相关的技术挑战。

虽然之前的一些工作使用术语多模态融合来包括所有的多模态算法，但在本调查论文中，当多模态集成在后期预测阶段进行时，我们将方法归类为融合类别，目的是预测结果度量。在最近的工作中，多模态表示和融合之间的界限已经模糊，例如在深度神经网络中，表示学习与分类或回归目标交织在一起。正如我们将在本节中描述的那样，这一行对于其他方法(如图形模型和基于内核的方法)来说更清晰。

我们将多模态融合分为两大类:不直接依赖于特定机器学习方法的模型无关方法(章节6.1);和基于模型(第6.2节)的方法，这些方法在其构造中明确地处理融合——例如基于内核的方法、图形模型和神经网络。这些方法的概述见表5。

Table 5: A summary of our taxonomy of multimodal fusion approaches. OUT — output type (class — classification or reg — regression), TEMP — is temporal modeling possible.

表5:我们对多模态融合方法的分类总结。OUT -输出类型(类-分类或reg -回归)，TEMP -是时间建模的可能。

6.1 Model-agnostic approaches与模型无关的方法

Historically, the vast majority of multimodal fusion has been done using model-agnostic approaches [50]. Such ap-proaches can be split into early (i.e., feature-based), late (i.e., decision-based) and hybrid fusion [11]. Early fusion inte-grates features immediately after they are extracted (often by simply concatenating their representations). Late fusion on the other hand performs integration after each of the modalities has made a decision (e.g., classification or regres-sion). Finally, hybrid fusion combines outputs from early fusion and individual unimodal predictors. An advantage of model agnostic approaches is that they can be implemented using almost any unimodal classifiers or regressors.

Early fusion could be seen as an initial attempt by mul-timodal researchers to perform multimodal representation learning — as it can learn to exploit the correlation and interactions between low level features of each modality. Furthermore it only requires the training of a single model, making the training pipeline easier compared to late and hybrid fusion.

历史上，绝大多数的多模态融合都是使用模型无关的方法[50]完成的。这样的方法可以分为早期(即基于特征的)、后期(即基于决策的)和混合融合[11]。早期融合会在特征被提取后立即进行整合(通常是简单地将它们的表示连接起来)。另一方面，晚期融合在每种模态做出决定(如分类或回归)后进行整合。最后，混合融合结合早期融合和单个单模态预测的结果。模型无关方法的一个优点是，它们可以使用几乎任何单模态分类器或回归器来实现。

早期的融合可以被看作是多模态研究人员进行多模态表征学习的初步尝试，因为它可以学习利用每个模态的低水平特征之间的相关性和相互作用。而且，它只需要对单个模型进行训练，相比后期的混合融合更容易实现。

In contrast, late fusion uses unimodal decision values and fuses them using a fusion mechanism such as averaging [181], voting schemes [144], weighting based on channel noise [163] and signal variance [53], or a learned model [68], [168]. It allows for the use of different models for each modality as different predictors can model each individual modality better, allowing for more flexibility. Furthermore, it makes it easier to make predictions when one or more of the modalities is missing and even allows for training when no parallel data is available. However, late fusion ignores the low level interaction between the modalities.

Hybrid fusion attempts to exploit the advantages of both of the above described methods in a common framework. It has been used successfully for multimodal speaker identifi-cation [226] and multimedia event detection (MED) [117].

相反，后期融合使用单模态决策值，并使用一种融合机制来融合它们，如平均[181]、投票方案[144]、基于信道噪声[163]和信号方差[53]的加权或学习模型[68]、[168]。它允许为每个模态使用不同的模型，因为不同的预测器可以更好地为每个模态建模，从而具有更大的灵活性。此外，当一个或多个模态缺失时，它可以更容易地进行预测，甚至可以在没有并行数据可用时进行训练。然而，晚期融合忽略了模态之间低水平的相互作用。

混合融合尝试在一个公共框架中利用上述两种方法的优点。它已成功地用于多模态说话人识别[226]和多媒体事件检测(MED)[117]。

6.2 Model-based approaches基于模型的方法

While model-agnostic approaches are easy to implement using unimodal machine learning methods, they end up using techniques that are not designed to cope with mul-timodal data. In this section we describe three categories of approaches that are designed to perform multimodal fusion: kernel-based methods, graphical models, and neural networks.

Multiple kernel learning (MKL) methods are an extension to kernel support vector machines (SVM) that allow for the use of different kernels for different modalities/views of the data [70]. As kernels can be seen as similarity functions be-tween data points, modality-specific kernels in MKL allows for better fusion of heterogeneous data.

MKL approaches have been an especially popular method for fusing visual descriptors for object detection [31], [66] and only recently have been overtaken by deep learning methods for the task [109]. They have also seen use for multimodal affect recognition [36], [90], [182], mul-timodal sentiment analysis [162], and multimedia event detection (MED) [237]. Furthermore, McFee and Lanckriet [137] proposed to use MKL to perform musical artist simi-larity ranking from acoustic, semantic and social view data. Finally, Liu et al. [125] used MKL for multimodal fusion in Alzheimer’s disease classification. Their broad applicability demonstrates the strength of such approaches in various domains and across different modalities.

虽然使用单模态机器学习方法很容易实现模型无关的方法，但它们最终使用的技术不是用来处理多模态数据的。在本节中，我们将描述用于执行多模态融合的三类方法:基于核的方法、图形模型和神经网络。

多核学习(MKL)方法是对核支持向量机(SVM)的一种扩展，它允许对数据的不同模态/视图使用不同的核[70]。由于内核可以被视为数据点之间的相似函数，因此MKL中的特定于模态的内核可以更好地融合异构数据。

MKL方法是融合视觉描述符用于目标检测[31]的一种特别流行的方法[66]，直到最近才被用于任务的深度学习方法所取代[109]。它们也被用于多模态情感识别[36][90]，[182]，多模态情感分析[162]，以及多媒体事件检测(MED)[237]。此外，McFee和Lanckriet[137]提出使用MKL从声学、语义和社会视图数据中进行音乐艺术家相似度排序。最后，Liu等[125]将MKL用于阿尔茨海默病的多模态融合分类。它们广泛的适用性表明了这些方法在不同领域和不同模态中的优势。

Besides flexibility in kernel selection, an advantage of MKL is the fact that the loss function is convex, allowing for model training using standard optimization packages and global optimum solutions [70]. Furthermore, MKL can be used to both perform regression and classification. One of the main disadvantages of MKL is the reliance on training data (support vectors) during test time, leading to slow inference and a large memory footprint.

Graphical models are another family of popular methods for multimodal fusion. In this section we overview work done on multimodal fusion using shallow graphical models. A description of deep graphical models such as deep belief networks can be found in Section 3.1.

Majority of graphical models can be classified into two main categories: generative — modeling joint probability; or discriminative — modeling conditional probability [200]. Some of the earliest approaches to use graphical models for multimodal fusion include generative models such as cou-pled [149] and factorial hidden Markov models [67] along-side dynamic Bayesian networks [64]. A more recently-proposed multi-stream HMM method proposes dynamic weighting of modalities for AVSR [75].

Arguably, generative models lost popularity to discrimi-native ones such as conditional random fields (CRF) [115] which sacrifice the modeling of joint probability for pre-dictive power. A CRF model was used to better segment images by combining visual and textual information of image description [60]. CRF models have been extended to model latent states using hidden conditional random fields [165] and have been applied to multimodal meeting seg-mentation [173]. Other multimodal uses of latent variable discriminative graphical models include multi-view hidden CRF [194] and latent variable models [193]. More recently Jiang et al. [93] have shown the benefits of multimodal hidden conditional random fields for the task of multimedia classification. While most graphical models are aimed at classification, CRF models have been extended to a continu-ous version for regression [164] and applied in multimodal settings [13] for audio visual emotion recognition.

除了在核选择上的灵活性外，MKL的一个优点是损失函数是凸的，允许使用标准优化包和全局最优解进行模型训练[70]。此外，MKL可用于回归和分类。MKL的主要缺点之一是在测试期间依赖于训练数据(支持向量)，导致推理缓慢和占用大量内存。

图形模型是另一类流行的多模态融合方法。在本节中，我们将概述使用浅层图形模型进行多模态融合的工作。深度图形模型(如深度信念网络)的描述可以在3.1节中找到。

大多数图形模型可分为两大类:生成-建模联合概率模型;或判别建模条件概率[200]。最早将图形模型用于多模态融合的一些方法包括生成模型，如耦合模型[149]和阶乘隐马尔可夫模型[67]以及动态贝叶斯网络[64]。最近提出的一种多流HMM方法提出了AVSR模态的动态加权[75]。

可以证明的是，生成型模型被诸如条件随机域(CRF)[115]这样的判别型模型所取代，后者牺牲了对联合概率的建模来提高预测能力。结合图像描述[60]的视觉和文本信息，采用CRF模型对图像进行更好的分割。CRF模型已被扩展到使用隐藏条件随机场来模拟潜在状态[165]，并已被应用于多模态相遇分割[173]。潜变量判别图形模型的其他多模态应用包括多视图隐CRF[194]和潜变量模型[193]。最近，Jiang等人[93]展示了多模态隐藏条件随机场对多媒体分类任务的好处。虽然大多数图形模型的目的是分类，但CRF模型已扩展到连续版本用于回归[164]，并应用于多模态设置[13]用于视听情感识别。

The benefit of graphical models is their ability to easily exploit spatial and temporal structure of the data, making them especially popular for temporal modeling tasks, such as AVSR and multimodal affect recognition. They also allow to build in human expert knowledge into the models. and often lead to interpretable models.

Neural Networks have been used extensively for the task of multimodal fusion [151]. The earliest examples of using neural networks for multi-modal fusion come from work on AVSR [163]. Nowadays they are being used to fuse information for visual and media question answering [63],[130], [229], gesture recognition [150], affect analysis [96],[153], and video description generation [94]. While the modalities used, architectures, and optimization techniques might differ, the general idea of fusing information in joint hidden layer of a neural network remains the same.

Neural networks have also been used for fusing tempo-ral multimodal information through the use of RNNs and LSTMs. One of the earlier such applications used a bidi-rectional LSTM was used to perform audio-visual emotion classification [224]. More recently, W¨ollmer et al. [223] used LSTM models for continuous multimodal emotion recog-nition, demonstrating its advantage over graphical models and SVMs. Similarly, Nicolaou et al. [152] used LSTMs for continuous emotion prediction. Their proposed method used an LSTM to fuse the results from a modality specific (audio and facial expression) LSTMs.

图形模型的优点是能够轻松利用数据的空间和时间结构，这使得它们在时间建模任务(如AVSR和多模态影响识别)中特别受欢迎。它们还允许在模型中加入人类的专业知识。通常会产生可解释的模型。

神经网络已被广泛用于多模态融合的任务[151]。使用神经网络进行多模态融合的最早例子来自于AVSR的研究[163]。如今，它们被用于融合信息，用于视觉和媒体问答[63]、[130]、[229]、手势识别[150]、情感分析[96]、[153]和视频描述生成[94]。虽然所使用的模态、架构和优化技术可能不同，但在神经网络的联合隐层中融合信息的一般思想是相同的。

神经网络也通过使用rnn和lstm来融合时间多模态信息。较早使用双向LSTM进行视听情绪分类的应用之一[224]。最近，W¨ollmer等人[223]使用LSTM模型进行连续多模态情绪识别，证明了其优于图形模型和支持向量机。同样，Nicolaou等[152]使用lstm进行连续情绪预测。他们提出的方法使用LSTM来融合来自特定模态(音频和面部表情)LSTM的结果。

Approaching modality fusion through recurrent neural networks has been used in various image captioning tasks, example models include: neural image captioning [214] where a CNN image representation is decoded using an LSTM language model, gLSTM [91] which incorporates the image data together with sentence decoding at every time step fusing the visual and sentence data in a joint repre-sentation. A more recent example is the multi-view LSTM (MV-LSTM) model proposed by Rajagopalan et al. [166]. MV-LSTM model allows for flexible fusion of modalities in the LSTM framework by explicitly modeling the modality-specific and cross-modality interactions over time.

A big advantage of deep neural network approaches in data fusion is their capacity to learn from large amount of data. Secondly, recent neural architectures allow for end-to-end training of both the multimodal representation compo-nent and the fusion component. Finally, they show good performance when compared to non neural network based system and are able to learn complex decision boundaries that other approaches struggle with.

The major disadvantage of neural network approaches is their lack of interpretability. It is difficult to tell what the prediction relies on, and which modalities or features play an important role. Furthermore, neural networks require large training datasets to be successful.

通过递归神经网络实现模态融合已被用于各种图像字幕任务，示例模型包括:神经图像字幕[214]，其中CNN图像表示使用LSTM语言模型进行解码，gLSTM[91]将图像数据和每一步的句子解码结合在一起，将视觉数据和句子数据融合在一个联合表示中。最近的一个例子是Rajagopalan等人提出的多视图LSTM (MV-LSTM)模型[166]。MV-LSTM模型通过显式地建模随时间变化的特定模态和跨模态交互，允许LSTM框架中模态的灵活融合。

深度神经网络方法在数据融合中的一大优势是能够从大量数据中学习。其次，最近的神经体系结构允许端到端训练多模态表示组件和融合组件。最后，与基于非神经网络的系统相比，它们表现出了良好的性能，并且能够学习其他方法难以处理的复杂决策边界。

神经网络方法的主要缺点是缺乏可解释性。很难判断预测的依据是什么，以及哪种模态或特征发挥了重要作用。此外，神经网络需要大量的训练数据集才能成功。

6.3 Discussion讨论

Multimodal fusion has been a widely researched topic with a large number of approaches proposed to tackle it, includ-ing model agnostic methods, graphical models, multiple kernel learning, and various types of neural networks. Each approach has its own strengths and weaknesses, with some more suited for smaller datasets and others performing bet-ter in noisy environments. Most recently, neural networks have become a very popular way to tackle multimodal fu-sion, however graphical models and multiple kernel learn-ing are still being used, especially in tasks with limited training data or where model interpretability is important.

多模态融合是一个被广泛研究的课题，有大量的方法被提出来解决它，包括模型不确定方法、图形模型、多核学习和各种类型的神经网络。每种方法都有自己的优点和缺点，一些方法更适合于较小的数据集，而另一些方法在嘈杂的环境中表现得更好。最近，神经网络已经成为处理多模态融合的一种非常流行的方法，但图形模型和多核学习仍在使用，特别是在训练数据有限的任务或模型可解释性很重要的地方。

Despite these advances multimodal fusion still faces the following challenges:

1) signals might not be temporally aligned (possibly dense continuous signal and a sparse event);

2) it is difficult to build models that exploit supple-mentary and not only complementary information;

3) each modality might exhibit different types and different levels of noise at different points in time.

尽管有这些进展，但多模态融合仍面临以下挑战:

1)信号可能没有时间对齐(可能是密集的连续信号和稀疏事件);

2)很难建立既能利用互补信息又能利用互补信息的模型;

3)各模态在不同时间点可能表现出不同类型和不同水平的噪声。

7 Co-learning共同学习

The final multimodal challenge in our taxonomy is co-learning — aiding the modeling of a (resource poor) modal-ity by exploiting knowledge from another (resource rich) modality. It is particularly relevant when one of the modali-ties has limited resources — lack of annotated data, noisy input, and unreliable labels. We call this challenge co-learning as most often the helper modality is used only during model training and is not used during test time. We identify three types of co-learning approaches based on their training resources: parallel, non-parallel, and hybrid. Parallel-data approaches require training datasets where the observations from one modality are directly linked to the ob-servations from other modalities. In other words, when the multimodal observations are from the same instances, such as in an audio-visual speech dataset where the video and speech samples are from the same speaker. In contrast, non-parallel data approaches do not require direct links between observations from different modalities. These approaches usually achieve co-learning by using overlap in terms of categories. For example, in zero shot learning when the con-ventional visual object recognition dataset is expanded with a second text-only dataset from Wikipedia to improve the generalization of visual object recognition. In the hybrid data setting the modalities are bridged through a shared modality or a dataset. An overview of methods in co-learning can be seen in Table 6 and summary of data parallelism in Figure 3.

我们分类法中的最后一个多模态挑战是共同学习——通过从另一个(资源丰富的)模态中获取知识来帮助(资源贫乏的)模态建模。当其中一种模态的资源有限时——缺乏注释的数据、嘈杂的输入和不可靠的标签——这一点尤其重要。我们称这种挑战为共同学习，因为大多数情况下，助手模态只在模型训练中使用，而在测试期间不使用。我们根据他们的培训资源确定了三种类型的共同学习方法:并行、非并行和混合。平行数据方法需要训练数据集，其中一个模态的观察结果与其他模态的观察结果直接相连。换句话说，当多模态观察来自相同的实例时，例如在一个视听语音数据集中，视频和语音样本来自同一个说话者。相反，非平行数据方法不需要不同模态的观察结果之间的直接联系。这些方法通常通过使用类别上的重叠来实现共同学习。例如，在零镜头学习时，将传统的视觉对象识别数据集扩展为维基百科的第二个纯文本数据集，以提高视觉对象识别的泛化。在混合数据设置中，模态通过共享的模态或数据集进行连接。在表6中可以看到共同学习方法的概述，在图3中可以看到数据并行性的总结。

7.1 Parallel data并行数据

In parallel data co-learning both modalities share a set of in-stances — audio recordings with the corresponding videos, images and their sentence descriptions. This allows for two types of algorithms to exploit that data to better model the modalities: co-training and representation learning.

Co-training is the process of creating more labeled training samples when we have few labeled samples in a multimodal problem [21]. The basic algorithm builds weak classifiers in each modality to bootstrap each other with labels for the unlabeled data. It has been shown to discover more training samples for web-page classification based on the web-page itself and hyper-links leading in the seminal work of Blum and Mitchell [21]. By definition this task requires parallel data as it relies on the overlap of multimodal samples.

在并行数据协同学习中，两种模态都共享一组实例—音频记录与相应的视频、图像及其句子描述。这就允许了两种类型的算法来利用这些数据来更好地为模态建模:协同训练和表示学习。

协同训练是在多模态问题[21]中有少量标记样本的情况下，创建更多标记训练样本的过程。基本算法在每个模态中构建弱分类器，对未标记的数据进行标签引导。Blum和Mitchell[21]的开创性工作表明，基于网页本身和超链接，可以发现更多的训练样本用于网页分类。根据定义，该任务需要并行数据，因为它依赖于多模态样本的重叠。

Figure 3: Types of data parallelism used in co-learning: parallel — modalities are from the same dataset and there is a direct correspondence between instances; non-parallel— modalities are from different datasets and do not have overlapping instances, but overlap in general categories or concepts; hybrid — the instances or concepts are bridged by a third modality or a dataset.

图3:在共同学习中使用的数据并行类型:并行模态来自相同的数据集，并且实例之间有直接对应关系;非平行模态来自不同的数据集，没有重叠的实例，但在一般类别或概念上有重叠;混合—实例或概念由第三种模态或数据集连接起来。

Co-training has been used for statistical parsing [178] to build better visual detectors [120] and for audio-visual speech recognition [40]. It has also been extended to deal with disagreement between modalities, by filtering out unreliable samples [41]. While co-training is a powerful method for generating more labeled data, it can also lead to biased training samples resulting in overfitting. Transfer learning is another way to exploit co-learning with parallel data. Multimodal representation learning (Section 3.1) approaches such as multimodal deep Boltzmann ma-chines [198] and multimodal autoencoders [151] transfer information from representation of one modality to that of another. This not only leads to multimodal representations, but also to better unimodal ones, with only one modality being used during test time [151] .

Moon et al. [143] show how to transfer information from a speech recognition neural network (based on audio) to a lip-reading one (based on images), leading to a better visual representation, and a model that can be used for lip-reading without need for audio information during test time. Similarly, Arora and Livescu [10] build better acoustic features using CCA on acoustic and articulatory (location of lips, tongue and jaw) data. They use articulatory data only during CCA construction and use only the resulting acoustic (unimodal) representation during test time.

协同训练已被用于统计解析[178]，以构建更好的视觉检测器[120]，并用于视听语音识别[40]。通过过滤掉不可靠的样本[41]，它还被扩展到处理不同模态之间的分歧。虽然协同训练是一种生成更多标记数据的强大方法，但它也会导致有偏差的训练样本导致过拟合。迁移学习是利用并行数据进行共同学习的另一种方法。多模态表示学习(3.1节)方法，如多模态深度玻尔兹曼机[198]和多模态自编码器[151]，将信息从一个模态的表示传递到另一个模态的表示。这不仅导致了多模态表示，而且还导致了更好的单模态表示，在测试期间只使用了一个模态[151]。

Moon等人[143]展示了如何将信息从语音识别神经网络(基于音频)传输到唇读神经网络(基于图像)，从而获得更好的视觉表示，以及在测试期间可以用于唇读而不需要音频信息的模型。类似地，Arora和Livescu[10]利用声学和发音(嘴唇、舌头和下巴的位置)数据的CCA构建了更好的声学特征。他们仅在CCA构建期间使用发音数据，并在测试期间仅使用产生的声学(单模态)表示。

7.2 Non-parallel data非并行数据

Methods that rely on non-parallel data do not require the modalities to have shared instances, but only shared cat-egories or concepts. Non-parallel co-learning approaches can help when learning representations, allow for better semantic concept understanding and even perform unseen object recognition.

Table 6: A summary of co-learning taxonomy, based on data parallelism. Parallel data — multiple modalities can see the same instance. Non-parallel data — unimodal instances are independent of each other. Hybrid data — the modalities are pivoted through a shared modality or dataset.

依赖于非并行数据的方法不需要模态拥有共享的实例，而只需要共享的类别或概念。非平行的共同学习方法可以帮助学习表征，允许更好的语义概念理解，甚至执行看不见的对象识别。

表6:基于数据并行性的协同学习分类概述。并行数据-多种模态可以看到相同的实例。非并行数据——单模态实例彼此独立。混合数据—模态是通过共享的模态或数据集来转换的。

Transfer learning is also possible on non-parallel data and allows to learn better representations through transferring information from a representation built using a data rich or clean modality to a data scarce or noisy modality. This type of trasnfer learning is often achieved by using coordinated multimodal representations (see Section 3.2). For example, Frome et al. [61] used text to improve visual representations for image classification by coordinating CNN visual features with word2vec textual ones [141] trained on separate large datasets. Visual representations trained in such a way result in more meaningful errors — mistaking objects for ones of similar category [61]. Mahasseni and Todorovic [129] demonstrated how to regularize a color video based LSTM using an autoencoder LSTM trained on 3D skeleton data by enforcing similarities between their hidden states. Such an approach is able to improve the original LSTM and lead to state-of-the-art performance in action recognition. Conceptual grounding refers to learning semantic mean-ings or concepts not purely based on language but also on additional modalities such as vision, sound, or even smell [16]. While the majority of concept learning approaches are purely language-based, representations of meaning in humans are not merely a product of our linguistic exposure, but are also grounded through our sensorimotor experience and perceptual system [17], [126]. Human semantic knowl-edge relies heavily on perceptual information [126] and many concepts are grounded in the perceptual system and are not purely symbolic [17]. This implies that learning semantic meaning purely from textual information might not be optimal, and motivates the use of visual or acoustic cues to ground our linguistic representations.

在非并行数据上也可以进行迁移学习，通过将信息从使用数据丰富或干净的模态构建的表示转移到数据缺乏或有噪声的模态，可以学习更好的表示。这种类型的迁移学习通常是通过使用协调的多模态表示来实现的(见第3.2节)。例如，Frome等人[61]通过将CNN视觉特征与在单独的大数据集上训练的word2vec文本特征[141]相协调，使用文本来改善图像分类的视觉表示。以这种方式训练的视觉表征会导致更有意义的错误——将物体误认为类似类别的物体[61]。Mahasseni和Todorovic[129]演示了如何使用在3D骨架数据上训练的自动编码器LSTM来正则化基于LSTM的彩色视频，方法是增强隐藏状态之间的相似性。这种方法可以改进原有的LSTM，在动作识别方面达到最先进的性能。概念基础是指学习语义的意义或概念，不单纯基于语言，也基于其他形式，如视觉、声音，甚至嗅觉。虽然大多数概念学习方法是纯粹基于语言的，但人类意义的表征并不仅仅是语言接触的产物，而且还基于我们的感觉运动经验和感知系统[17]，[126]。人类的语义知识-边缘严重依赖于感知信息[126]，许多概念都建立在感知系统的基础上，而不是纯粹的符号[17]。这意味着，单纯从文本信息中学习语义意义可能不是最理想的，这促使我们使用视觉或听觉线索来建立语言表征的基础。

Starting from work by Feng and Lapata [59], grounding is usually performed by finding a common latent space between the representations [59], [183] (in case of parallel datasets) or by learning unimodal representations sepa-rately and then concatenating them to lead to a multimodal one [29], [101], [172], [181] (in case of non-parallel data). Once a multimodal representation is constructed it can be used on purely linguistic tasks. Shutova et al. [181] and Bruni et al. [29] used grounded representations for better classification of metaphors and literal language. Such representations have also been useful for measuring conceptual similarity and relatedness — identifying how semantically or conceptually related two words are [30], [101], [183] or actions [172]. Furthermore, concepts can be grounded not only using visual signals, but also acoustic ones, leading to better performance especially on words with auditory associations [103], or even olfactory signals [102] for words with smell associations. Finally, there is a lot of overlap between multimodal alignment and conceptual grounding, as aligning visual scenes to their descriptions leads to better textual or visual representations [108], [161], [172], [240].

Conceptual grounding has been found to be an effective way to improve performance on a number of tasks. It also shows that language and vision (or audio) are com-plementary sources of information and combining them in multimodal models often improves performance. However, one has to be careful as grounding does not always lead to better performance [102], [103], and only makes sense when grounding has relevance for the task — such as grounding using images for visually-related concepts.

从工作开始由冯和Lapata[59],接地通常是由之间找到一个共同的潜在空间表征[59],[183](并行数据集的情况下)或通过学习sepa-rately单峰表示,然后导致一个多通道连接[29],[101],[172],[181](对于非并行数据)。一旦构建了多模态表示，它就可以用于纯语言任务。Shutova等人[181]和Bruni等人[29]使用扎根表征来更好地分类隐喻和字面语言。这样的表征在度量概念相似度和相关性方面也很有用——识别两个词在语义上或概念上的关联程度分别为[30]、[101]、[183]或动作[172]。此外，概念不仅可以使用视觉信号，也可以使用听觉信号，从而使词汇表现得更好[103]，甚至嗅觉信号[102]也能使词汇表现得更好。最后，多模态对齐和概念基础之间有很多重叠，因为将视觉场景与其描述相匹配会导致更好的文本或视觉表征[108]、[161]、[172]、[240]。

概念基础已被发现是提高许多任务性能的有效方法。它还表明，语言和视觉(或音频)是互补的信息来源，在多模态模型中结合它们通常可以提高性能。然而，必须小心，因为接地并不总是会带来更好的性能[102]，[103]，而且只有当接地与任务相关时才有意义——例如使用图像作为视觉相关概念的接地。

Zero shot learning (ZSL) refers to recognizing a concept without having explicitly seen any examples of it. For ex-ample classifying a cat in an image without ever having seen (labeled) images of cats. This is an important problem to address as in a number of tasks such as visual object clas-sification: it is prohibitively expensive to provide training examples for every imaginable object of interest.

There are two main types of ZSL — unimodal and multimodal. The unimodal ZSL looks at component parts or attributes of the object, such as phonemes to recognize an unheard word or visual attributes such as color, size, and shape to predict an unseen visual class [55]. The multimodal ZSL recognizes the objects in the primary modality through the help of the secondary one — in which the object has been seen. The multimodal version of ZSL is a problem facing non-parallel data by definition as the overlap of seen classes is different between the modalities.

Socher et al. [190] map image features to a conceptual word space and are able to classify between seen and unseen concepts. The unseen concepts can be then assigned to a word that is close to the visual representation — this is enabled by the semantic space being trained on a separate dataset that has seen more concepts. Instead of learning a mapping from visual to concept space Frome et al. [61] learn a coordinated multimodal representation between concepts and images that allows for ZSL. Palatucci et al. [158] per-form prediction of words people are thinking of based on functional magnetic resonance images, they show how it is possible to predict unseen words through the use of an intermediate semantic space. Lazaridou et al. [118] present a fast mapping method for ZSL by mapping extracted visual feature vectors to text-based vectors through a neural network.

零机会学习(ZSL)指的是在没有明确看到任何例子的情况下识别一个概念。例如，在从未见过(有标记的)猫的图像的情况下，将一只猫在图像中分类。这是一个需要解决的重要问题，就像在许多任务中(如视觉对象分类)一样:为每一个可以想象到的感兴趣的对象提供训练示例是非常昂贵的。

ZSL主要有两种类型——单峰和多峰。单模态ZSL查看对象的组件部分或属性，如音素，以识别未听过的单词或视觉属性(如颜色、大小和形状)，以预测不可见的视觉类[55]。多模态ZSL通过辅助模态的帮助识别主要模态的物体——在辅助模态中，物体已经被看到。根据定义，ZSL的多模态版本是面对非并行数据的一个问题，因为所看到的类的重叠在模态之间是不同的。

Socher等人[190]将图像特征映射到概念词空间，并能够在可见和不可见概念之间进行分类。然后，看不见的概念可以分配给一个接近于视觉表示的单词——这是通过在一个单独的数据集上训练语义空间实现的，该数据集已经看到了更多的概念。Frome等人[61]没有学习从视觉空间到概念空间的映射，而是学习概念和图像之间的协调多模态表示，从而实现ZSL。Palatucci等人[158]基于功能性磁共振图像对人们正在思考的单词进行形式预测，他们展示了如何通过使用中间语义空间来预测未见的单词。Lazaridou等人[118]提出了一种ZSL的快速映射方法，通过神经网络将提取的视觉特征向量映射到基于文本的向量。

7.3 Hybrid data混合数据

In the hybrid data setting two non-parallel modalities are bridged by a shared modality or a dataset (see Figure 3c). The most notable example is the Bridge Correlational Neural Network [167], which uses a pivot modality to learn coordinated multimodal representations in presence of non-parallel data. For example, in the case of multilingual image captioning, the image modality would always be paired with at least one caption in any language. Such methods have also been used to bridge languages that might not have parallel corpora but have access to a shared pivot language, such as for machine translation [148], [167] and document transliteration [100].

在混合数据设置中，两个非并行模态由一个共享的模态或数据集连接起来(见图3c)。最著名的例子是Bridge相关神经网络[167]，它使用一个枢轴模态来学习非并行数据的协调多模态表示。例如，在多语言图像字幕的情况下，图像模态总是与任何语言的至少一个字幕配对。这些方法也被用于连接那些可能没有并行语料库但可以访问共享的主语言的语言，如机器翻译[148]、[167]和文档音译[100]。

Instead of using a separate modality for bridging, some methods rely on existence of large datasets from a similar or related task to lead to better performance in a task that only contains limited annotated data. Socher and Fei-Fei [189] use the existence of large text corpora in order to guide image segmentation. While Hendricks et al. [78] use separately trained visual model and a language model to lead to a better image and video description system, for which only limited data is available.

一些方法依赖于来自类似或相关任务的大型数据集，而不是使用单独的模态进行桥接，从而在只包含有限注释数据的任务中获得更好的性能。Socher和feifei[189]利用存在的大型文本语料库来指导图像分割。而Hendricks等人[78]则分别使用训练过的视觉模型和语言模型来得到更好的图像和视频描述系统，但数据有限。

7.4 Discussion讨论

Multimodal co-learning allows for one modality to influ-ence the training of another, exploiting the complementary information across modalities. It is important to note that co-learning is task independent and could be used to cre-ate better fusion, translation, and alignment models. This challenge is exemplified by algorithms such as co-training, multimodal representation learning, conceptual grounding, and zero shot learning (ZSL) and has found many applica-tions in visual classification, action recognition, audio-visual speech recognition, and semantic similarity estimation.

多模态共同学习允许一种模态影响另一种模态的训练，利用不同模态之间的互补信息。值得注意的是，共同学习是独立于任务的，可以用来创建更好的融合、翻译和对齐模型。协同训练、多模态表示学习、概念基础和零样本学习(ZSL)等算法都体现了这一挑战，并在视觉分类、动作识别、视听语音识别和语义相似度估计等方面得到了许多应用。

8 Conclusion结论

As part of this survey, we introduced a taxonomy of multi-modal machine learning: representation, translation, fusion, alignment, and co-learning.

Some of them such as fusion have been studied for a long time, but more recent interest in representation and translation have led to a large number of new multimodal algorithms and exciting multimodal applications.

We believe that our taxonomy will help to catalog future research papers and also better understand the remaining unresolved problems facing multimodal machine learning.

作为调查的一部分，我们介绍了多模态机器学习的分类:表示、翻译、对齐、融合和共同学习。

其中一些如融合已经被研究了很长时间，但最近对表示、翻译的兴趣导致了大量新的多模态算法和令人兴奋的多模态应用。

我们相信我们的分类法将有助于对未来的研究论文进行分类，并更好地理解多模态机器学习面临的剩余未解决问题。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/Monodyee/article/detail/315999