赞
踩
会议:2021ICLR
作者:chenmingjian,tanxu
单位:微软
class Condional_LayerNorm(nn.Module): def __init__(self, normal_shape, epsilon=1e-5 ): super(Condional_LayerNorm, self).__init__() if isinstance(normal_shape, int): self.normal_shape = normal_shape self.speaker_embedding_dim = 256 self.epsilon = epsilon self.W_scale = nn.Linear(self.speaker_embedding_dim, self.normal_shape) self.W_bias = nn.Linear(self.speaker_embedding_dim, self.normal_shape) self.reset_parameters() def reset_parameters(self): torch.nn.init.constant_(self.W_scale.weight, 0.0) torch.nn.init.constant_(self.W_scale.bias, 1.0) torch.nn.init.constant_(self.W_bias.weight, 0.0) torch.nn.init.constant_(self.W_bias.bias, 0.0) def forward(self, x, speaker_embedding): mean = x.mean(dim=-1, keepdim=True) var = ((x - mean) ** 2).mean(dim=-1, keepdim=True) std = (var + self.epsilon).sqrt() y = (x - mean) / std scale = self.W_scale(speaker_embedding) bias = self.W_bias(speaker_embedding) y *= scale.unsqueeze(1) y += bias.unsqueeze(1) return y
ppgs-based:ASR提取PPGs,输入额外的ppgs-encoder,本质上是mel-encoder的上限,纯文本的内容;
joint-training:mel-encoder和phn-encoder同时训练,对比是为了证明本文提出的阶段式训练结果偏差更小。
adaptation策略的对比
adaptation 数据的对比
zero-shot TTS的困难:(1)之前的工作致力于让speaker encoder提取更加可靠的speaker embedding,但实际上在one-shot的场景并不是很靠谱,而且说话人特性不仅和音色有关,还和韵律、风格等强相关;(2)之前的工作将speaker embedding直接和phn-emb拼接送入decoder,但是unseen speaker embedding会影响decoder的泛化性;(3)根据ref生成音色一致的mel-spec,但实际上很难,也有用说话人分类loss作为辅助,但是改进效果不明显。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。