赞
踩
MuseTalk 2024.4.2
GPU:英伟达4070 12G
MuseTalk如何生成高质量视频(使用技巧)
MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed whisper-tiny model. The architecture of the generation network was borrowed from the UNet of the stable-diffusion-v1-4, where the audio embeddings were fused to the image embeddings by cross-attention.
MuseTalk在潜伏空间中进行训练,图像由冻结的VAE编码。音频由冻结 whisper-tiny 模型编码。生成网络的架构借鉴了 stable-diffusion-v1-4 的 UNet,其中音频嵌入通过交叉注意力融合到图像嵌入中。
Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is NOT a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。