赞
踩
由于本人的研究是文本、语音和图像三者模态数据的融合,所以针对的是这三种模态的特征融合方法进行总结。本文章会从方法、总结网址和综述文章进行介绍。
方法:
首先,在特征提取之后,多模态特征融合的方法分为四种:特征级融合、决策级融合、混合级融合和模型级融合。其中:
python代码:
- class Feature_level(nn.Module):
- def __init__(self):
- super(Feature_level, self).__init__()
-
- self.fc = nn.Linear(in_features=200, out_features=1)
-
- def forward(self, feature_a, feature_t, feature_v):
- x = torch.cat((feature_a, feature_t, feature_v), 0)
- out = self.fc(x)[:, 0][0]
- return out
python代码:
- class decision_level(nn.Module):
- def __init__(self):
- super(decision_level, self).__init__()
-
- self.fc1 = nn.Linear(in_features=200, out_features=100)
- self.fc2 = nn.Linear(in_features=100, out_features=1)
-
- def forward(self, feature_a, feature_t, feature_v):
- x_a = self.fc1(feature_a)
- x_t = self.fc1(feature_t)
- x_v = self.fc1(feature_v)
- x = torch.cat((torch.mul(0.3, x_a), torch.mul(0.4, x_t), torch.mul(0.3, x_v)), 0)
- out = self.fc2(x)[:, 0][0]
- return out
python代码:
- class Hybrid_level(nn.Module):
- def __init__(self):
- super(Hybrid_level, self).__init__()
-
- self.fc1 = nn.Linear(in_features=200, out_features=1)
- self.fc2 = nn.Linear(in_features=201, out_features=1)
-
- def forward(self, feature_a, feature_t, feature_v):
- a_x = self.fc1(feature_a)
- t_x = self.fc1(feature_t)
- v_x = self.fc1(feature_v)
- a = torch.cat((feature_a, a_x), 2)
- t = torch.cat((feature_a, t_x), 2)
- v = torch.cat((feature_a, v_x), 2)
- x = torch.cat((a, t, v), 0)
- out = self.fc2(x)[:, 0][0]
- return out
模型级融合。该方法旨在获得三种模态的联合特征表示,它的实现主要取决于使用的融合模型。模型级融合是更深层次的融合方法,为分类和回归任务产生更优化的联合判别特征表示。以ML-LSTM为例,多层次LSTM(Multi-layers LSTM,ML-LSTM)作为模型级融合方法之一,该方法是将多层网络与传统的LSTM模型相结合,通过充分考虑话语之间的关系,来使得在学习过程中处理话语层面的多模态融合问题,旨在用于抑郁症的识别。融合思路如下:将文本特征输入到第一层LSTM(Layer1)得到的是每个神经元的隐藏层状态,然后将音频特征与Layer1得到的隐藏层状态相拼接输入到第二层LSTM(Layer2)得到第二层每个神经元的隐藏层状态,之后将视觉特征与Layer2得到的隐藏层状态相拼接输入到第二层LSTM(Layer3)得到第三层每个神经元的隐藏层状态,最后将融合后的特征输入到FC层得到最终的预测结果。
python代码:
- class MLLSTM(nn.Module):
- def __init__(self, input_size, hidden_size, output_size, batch_size, num_layers, num_directions):
- super(MLLSTM, self).__init__()
-
- self.input_size = input_size
- self.hidden_size = hidden_size
- self.output_size = output_size
- self.batch_size = batch_size
- self.num_layers = num_layers
- self.num_directions = num_directions
-
- self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers,
- bidirectional=(num_directions == 2)).cuda()
- self.lstm2 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers,
- bidirectional=(num_directions == 2)).cuda()
- self.lstm3 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers,
- bidirectional=(num_directions == 2)).cuda()
- # self.attention_weights_layer = nn.Sequential(
- # nn.Linear(hidden_size, hidden_size),
- # nn.ReLU(inplace=True)
- # )
- self.fc = nn.Linear(hidden_size * 2, output_size).cuda()
- self.pred = nn.Linear(output_size, 1).cuda()
-
- def forward(self, x, y, z):
- # lstm的输入维度为 [seq_len, batch, input_size]
- # x [batch_size, sentence_length, embedding_size]
- # xx = x.permute(1, 0, 2) # [sentence_length, batch_size, embedding_size]
-
- # 由于数据集不一定是预先设置的batch_size的整数倍,所以用size(1)获取当前数据实际的batch
- # batch_size = x.size(1)
-
- # 设置lstm最初的前项输出
- h_0 = torch.randn(self.num_layers * self.num_directions, self.input_size, self.hidden_size).cuda()
- c_0 = torch.randn(self.num_layers * self.num_directions, self.input_size, self.hidden_size).cuda()
-
- # out[seq_len, batch, num_directions * hidden_size]。多层lstm,out只保存最后一层每个时间步t的输出h_t
- # h_n, c_n [num_layers * num_directions, batch, hidden_size]
- text_out, (text_h_n, text_c_n) = self.lstm1(x, (h_0, c_0))
- text = self.fc(text_out)
- audio_out, (audio_h_n, audio_c_n) = self.lstm2(y + text, (text_h_n, text_c_n))
- audio = self.fc(audio_out)
- visual_out, (visual_h_n, visual_c_n) = self.lstm3(z + audio, (audio_h_n, audio_c_n))
- visual = self.fc(visual_out)
- out = self.pred(visual)
- return out[:, 0][0]
总结网址:
下面介绍比较实用的专门总结多模态融合文章的网址(里面都是关于多模态的高水平论文):
文章:
1、综述文章:
深度多模态表征学习:一项调查,该文章通过对深度学习中多模态数据方法进行总结和讨论,分析方法种类和各自优缺点。网址:Deep Multimodal Representation Learning: A Survey | IEEE Journals & Magazine | IEEE Xplore
2、特征级融合文章汇总:
A Multi-Modal Hierarchical Recurrent Neural Network for Depression Detection:通过分层BiLSTM对三种模态进行集成,第一层对音频和视觉特征进行编码,然后通过隐藏状态和置信度得分加权得到对应的特征,第二层将三种模态特征通过其置信度得分加权的平均值进行融合
Multi-level Attention network using text, audio and video for Depression Prediction:通过BiLSTM得到的音频特征和视觉特征,以及使用预训练工具得到文本特征,然后在三种模态上使用一个注意力层,进行加权相加,最后得到融合后的特征。
3、决策级融合文章汇总:
Depression recognition based on dynamic facial and vocal expression features using partial least square regression
Detect depression from communication: how computer vision, signal processing, and sentiment analysis join forces
4、混合级融合文章汇总:
A Linguistically-Informed Fusion Approach for Multimodal Depression Detection:为每种模态训练单独的预测模型,然后从每个单模态中获得预测,最后使用这些新的特征向量来训练新模型以进行最终预测。
Cross-cultural detection of depression from nonverbal behaviour:将每个模态的结果连接到早期融合特征向量然后进行模型的预测,最后采用多数票表决法来评估效果。
Depression Status Estimation by Deep Learning based Hybrid Multi-Modal Fusion Model:将三种模态特征分别输入Linear中分别得到对应的分数,然后将得到的分数进行拼接得到融合后的特征,最后将融合后的特征输入Linear中进行模型的训练。
5、模型级融合文章汇总:
Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition:ML-LSTM,对应上述的模型级融合讲述
Deep-HOSeq: Deep Higher Order Sequence Fusion for Multimodal Sentiment Analysis
6、其他文章集合:
待更新
如有侵犯,会立即修改或删除。如果有知识上的问题,欢迎交流
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。