赞
踩
Recent advances in 3D-aware generative models (3D-aware GANs) combined with Neural Radiance Fields (NeRF) have achieved impressive results. However no prior works investigate 3D-aware GANs for 3D consistent multi-class image-to-image (3D-aware I2I) translation. Naively using 2D-I2I translation methods suffers from unrealistic shape/identity change. To perform 3D-aware multi-class I2I translation, we decouple this learning process into a multi-class 3D-aware GAN step and a 3D-aware I2I translation step. In the first step, we propose two novel techniques: a new conditional architecture and an effective training strategy. In the second step, based on the well-trained multi-class 3D-aware GAN architecture, that preserves view-consistency, we construct a 3D-aware I2I translation system. To further reduce the view-consistency problems, we propose several new techniques, including a U-net-like adaptor network design, a hierarchical representation constrain and a relative regularization loss. In extensive experiments on two datasets, quantitative and qualitative results demonstrate that we successfully perform 3D-aware I2I translation with multi-view consistency. Code is available in 3DI2I.
3D感知生成模型(3D感知GAN)与神经辐射场(NeRF)相结合的最新进展取得了令人印象深刻的结果。然而,没有先前的作品研究3D感知GAN用于3D一致的多类图像到图像(3D感知I2 I)转换。天真地使用2D-I2 I转换方法会遭受不切实际的形状/身份改变。为了执行3D感知的多类I2 I翻译,我们将此学习过程解耦为多类3D感知的GAN步骤和3D感知的I2 I翻译步骤。在第一步中,我们提出了两个新的技术:一个新的条件架构和一个有效的训练策略。在第二步中,基于训练良好的多类3D感知GAN架构,保持视图一致性,我们构建了一个3D感知I2 I翻译系统。为了进一步减少视图一致性问题,我们提出了几种新的技术,包括一个U-网一样的适配器网络设计,分层表示约束和相对正则化损失。 在两个数据集上进行的大量实验中,定量和定性结果表明,我们成功地执行了具有多视图一致性的3D感知I2 I翻译。代码在3DI 2 I中可用。
Figure 1: 3D-aware I2I translation: given a view-consistent 3D scene (the input), our method maps it into a high-quality target-specific image. Our approach produces consistent results across viewpoints.
图1:3D感知的I2 I转换:给定视图一致的3D场景(输入),我们的方法将其映射到高质量的目标特定图像。我们的方法在不同的视角下产生一致的结果。
Neural Radiance Fields (NeRF) have increasingly gained attention with their outstanding capacity to synthesize high-quality view-consistent images [41, 32, 68]. Benefiting from the adversarial mechanism [12], StyleNeRF [13] and concurrent works [46, 5, 9, 71] have successfully synthesized high-quality view-consistent, detailed 3D scenes by combining NeRF with StyleGAN-like generator design [23]. This recent progress in 3D-aware image synthesis has not yet been extended to 3D-aware I2I translation, where the aim is to translate in a 3D-consistent manner from a source scene to a target scene of another class (see Figure 1).
神经辐射场(NeRF)因其合成高质量视图一致图像的出色能力而日益受到关注[ 41,32,68]。受益于对抗机制[ 12],StyleNeRF [ 13]和并发作品[46,5,9,71]通过将NeRF与StyleGAN类生成器设计相结合[ 23],成功合成了高质量的视图一致的详细3D场景。3D感知图像合成的最新进展尚未扩展到3D感知I2I转换,其目标是以3D一致的方式从源场景转换到另一类目标场景(见图1)。
A naive strategy is to use well-designed 2D-I2I translation methods [17, 48, 72, 16, 29, 67, 27, 65]. These methods, however, suffer from unrealistic shape/identity changes when changing the viewpoint, which are especially notable when looking at a video. Main target class characteristics, such as hairs, ears, and noses, are not geometrically realistic, leading to unrealistic results which are especially disturbing when applying I2I to translate videos. Also, these methods typically underestimate the viewpoint change and result in target videos with less viewpoint change than the source video. Another direction is to apply video-to-video synthesis methods [55, 2, 3, 7, 31]. These approaches, however, either rely heavily on labeled data or multi-view frames for each object. In this work, we assume that we only have access to single-view RGB data.
一种简单的策略是使用精心设计的2D-I2 I翻译方法[ 17,48,72,16,29,67,27,65]。然而,这些方法在改变视点时遭受不切实际的形状/身份改变,这在观看视频时尤其显著。主要目标类别特征,如头发、耳朵和鼻子,在几何上并不真实,导致不切实际的结果,这在应用I2 I翻译视频时尤其令人不安。此外,这些方法通常低估了视点变化,并且导致目标视频具有比源视频更少的视点变化。另一个方向是应用视频到视频合成方法[ 55,2,3,7,31]。然而,这些方法严重依赖于标记数据或每个对象的多视图帧。在这项工作中,我们假设我们只能访问单视图RGB数据。
To perform 3D-aware I2I translation, we extend the theory developed for 2D-I2I with recent developments in 3D-aware image synthesis. We decouple the learning process into a multi-class 3D-aware generative model step and a 3D-aware I2I translation step. The former can synthesize view-consistent 3D scenes given a scene label, thereby addressing the 3D inconsistency problems we discussed for 2D-I2I. We will use this 3D-aware generative model to initialize our 3D-aware I2I model. It therefore inherits the capacity of synthesizing 3D consistent images. To train effectively a multi-class 3D-aware generative model (see Figure 2(b)), we provide a new training strategy consisting of: (1) training an unconditional 3D-aware generative model (i.e., StyleNeRF) and (2) partially initializing the multi-class 3D-aware generative model (i.e., multi-class StyleNeRF) with the weights learned from StyleNeRF. In the 3D-aware I2I translation step, we design a 3D-aware I2I translation architecture (Figure 2(f)) adapted from the trained multi-class StyleNeRF network. To be specific, we use the main network of the pretrained discriminator (Figure 2(b)) to initialize the encoder � of the 3D-aware I2I translation model (Figure 2(f)), and correspondingly, the pretrained generator (Figure 2(b)) to initialize the 3D-aware I2I generator (Figure 2(f)). This initialization inherits the capacity of being sensitive to the view information.
为了执行3D感知I2 I翻译,我们扩展了为2D-I2 I开发的理论,其中包括3D感知图像合成的最新发展。我们将学习过程解耦为多类3D感知生成模型步骤和3D感知I2 I翻译步骤。前者可以在给定场景标签的情况下合成视图一致的3D场景,从而解决我们针对2D-I2 I讨论的3D不一致问题。我们将使用此3D感知生成模型来初始化我们的3D感知I2 I模型。因此,它继承了合成3D一致图像的能力。为了有效地训练多类3D感知生成模型(参见图2(B)),我们提供了一种新的训练策略,包括:(1)训练无条件3D感知生成模型(即,StyleNeRF)和(2)部分地初始化多类3D感知生成模型(即,多类StyleNeRF),其权重从StyleNeRF学习。 在3D感知I2I转换步骤中,我们设计了一个3D感知I2I转换架构(图2(f)),该架构改编自经过训练的多类StyleNeRF网络。具体来说,我们使用预训练的I2I的主网络(图2(b))来初始化3D感知I2I转换模型的编码器 � (图2(f)),相应地,使用预训练的生成器(图2(b))来初始化3D感知I2I生成器(图2(f))。这种初始化继承了对视图信息敏感的能力。
Directly using the constructed 3D-aware I2I translation model (Figure 2(f)), there still exists some view-consistency problem. This is because of the lack of multi-view consistency regularization, and the usage of the single-view image. Therefore, to address these problems we introduce several techniques, including a U-net-like adaptor network design, a hierarchical representation constrain and a relative regularization loss.
直接使用构建的3D感知I2I转换模型(图2(f)),仍然存在一些视图一致性问题。这是因为缺乏多视点一致性正则化,以及使用了单视点图像。因此,为了解决这些问题,我们引入了几种技术,包括一个U型网络的适配器网络设计,分层表示约束和相对正则化损失。
In sum, our work makes the following contributions:
总之,我们的工作作出了以下贡献:
We are the first to explore 3D-aware multi-class I2I translation, which allows generating 3D consistent videos.
We decouple 3D-aware I2I translation into two steps. First, we propose a multi-class StyleNeRF. To train this multi-class StyleNeRF effectively, we provide a new training strategy. The second step is the proposal of a 3D-aware I2I translation architecture.
To further address the view-inconsistency problem of 3D-aware I2I translation, we propose several techniques: a U-net-like adaptor, a hierarchical representation constraint and a relative regularization loss.
On extensive experiments, we considerably outperform existing 2D-I2I systems with our 3D-aware I2I method when evaluating temporal consistency.
Neural Implicit Fields. Using neural implicit fields to represent 3D scenes has shown unprecedented quality. [40, 39, 47, 50, 53, 45] use 3D supervision to predict neural implicit fields. Recently, NeRF has shown powerful performance to neural implicit representations. NeRF and its variants [41, 32, 68] utilize a volume rendering technique for reconstructing a 3D scene as a combination of neural radiance and density fields to synthesize novel views.
神经隐式场使用神经隐式场来表示3D场景已经显示出前所未有的质量。[40,39,47,50,53,45]使用3D监督来预测神经内隐场。最近,NeRF已经显示出强大的性能,神经隐式表示。NeRF及其变体[41,32,68]利用体绘制技术重建3D场景,作为神经辐射和密度场的组合,以合成新视图。
3D-aware GANs Recent approaches [42, 6, 44, 43, 36, 14, 10, 20, 64, 70, 54] learn neural implicit representations without 3D or multi-view supervisions. Combined with the adversarial loss, these methods typically randomly sample viewpoints, render photorealistic 2D images, and finally optimize their 3D representations. StyleNeRF [13] and concurrent works [46, 5, 9, 71] have successfully synthesized high-quality view-consistent, detailed 3D scenes with StyleGAN-like generator design [23]. In this paper, we investigate 3D-aware image-to-image (3D-aware I2I) translation, where the aim is to translate in a 3D-consistent manner from a source scene to a target scene of another class. We combine transfer learning of GANs [62, 57].
最近的方法[ 42,6,44,43,36,14,10,20,64,70,54]在没有3D或多视图监督的情况下学习神经隐式表示。结合对抗性损失,这些方法通常随机采样视点,渲染逼真的2D图像,并最终优化其3D表示。StyleNeRF [ 13]和并发作品[46,5,9,71]已经成功地合成了高质量的视图一致性,详细的3D场景与StyleGAN类似的生成器设计[ 23]。在本文中,我们研究3D感知图像到图像(3D感知I2I)的翻译,其目的是在3D一致的方式从源场景到另一类的目标场景翻译。我们结合了GAN的联合收割机迁移学习[62,57]。
I2I translation. I2I translation with GAN [17, 61, 63, 59] has increasingly gained attention in computer vision. Based on the differences of the I2I translation task, recent works focus on paired I2I translation [11, 17, 73], unpaired I2I translation [25, 33, 38, 48, 66, 72, 65, 52, 1, 19, 58, 60, 28], diverse I2I translation [25, 33, 38, 48, 66, 72] and scalable I2I translation [8, 30, 67].
I2I翻译。使用GAN的I2I翻译[17,61,63,59]在计算机视觉中越来越受到关注。基于I2I翻译任务的差异,最近的工作集中在配对I2I翻译[ 11,17,73],非配对I2I翻译[ 25,33,38,48,66,72,65,52,1,19,58,60,28],多样I2I翻译[ 25,33,38,48,66,72]和可扩展的I2I转换[8,30,67]。
However, none of these approaches addresses the problem of 3D-aware I2I. For the 3D scenes represented by neural implicit fields, directly using these methods suffers from view-inconsistency.
然而,这些方法都没有解决3D感知I2I的问题。对于用神经隐场表示的三维场景,直接使用这些方法会产生视图不一致性。
Problem setting. Our goal is to achieve 3D consistent multi-class I2I translation trained on single-view data only. The system is designed to translate a viewpoint-video consisting of multiple images (source domain) into a new, photorealistic viewpoint-video scene of a target class. Furthermore, the system should be able to handle multi-class target domains. We decouple our learning into a multi-class 3D-aware generative model step and a multi-class 3D-aware I2I translation step.
问题设置。我们的目标是实现仅在单视图数据上训练的3D一致的多类I2I翻译。该系统的目的是将由多个图像(源域)组成的视点视频转换为目标类的新的、照片级逼真的视点视频场景。此外,系统应该能够处理多类目标域。我们将学习解耦为多类3D感知生成模型步骤和多类3D感知I2I翻译步骤。
Figure 2:Overview of our method. (a) We first train a 3D-aware generative mode (i.e., StyleNeRF) with single-view photos. (b) We extend StyleNerf to multi-class StyleNerf. We introduce an effective training strategy: initializing multi-class StyleNeRF with StyleNeRF. (c) The training of the proposed 3D-aware I2I translation. It consists of the encoder �, the adaptor �, the generator � and two mapping networks �1 and �2. We freeze all networks except for training the adaptor �. The encoder is initialized by the main networks of the pretrained discriminator. We introduce several techniques to address the view-consistency problems: including a U-net-like adaptor �, (d) relative regularization loss and (e) hierarchical representation constrain. (f) Usage of proposed model at inference time.
图2:我们的方法概述。(a)我们首先训练3D感知生成模式(即,StyleNeRF)与单视图照片。(b)我们将StyleNerf扩展为多类StyleNerf。我们介绍了一种有效的训练策略:使用StyleNeRF初始化多类StyleNeRF。(c)建议的3D感知I2 I翻译的培训。它由编码器 � 、适配器 � 、生成器 � 和两个映射网络 �1 和 �2 组成。我们冻结所有网络,除了训练适配器 � 。编码器由预训练的编码器的主网络初始化。我们介绍了几种技术来解决视图一致性问题:包括一个U-网一样的适配器 � ,(d)相对正则化损失和(e)分层表示约束。(f)在推理时使用所提出的模型。
Let ℐℛ
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。