赞
踩
Paper之BigGAN:《Large Scale Gan Training For High Fidelity Natural Image Synthesis》翻译与解读
目录
《Large Scale Gan Training For High Fidelity Natural Image Synthesis》翻译与解读
3.1、TRADING OFF VARIETY AND FIDELITY WITH THE TRUNCATION TRICK
4.1、CHARACTERIZING INSTABILITY: THE GENERATOR
4.2、CHARACTERIZING INSTABILITY: THE DISCRIMINATOR
5.2、ADDITIONAL EVALUATION ON JFT-300M
Appendix B Architectural Details
效果有多好?先看数字。经过ImageNet上进行128×128分辨率的训练后,BigGAN的Inception Score(IS)得分是166.3,一下子比前人52.52的最佳得分提升了100多分,离真实图像的233分更近了。而Frechet Inception Distance(FID)得分,也从之前的18.65优化到了9.6。
你能分辨出以下哪张图片是AI生成的假图片,哪张是真实的图片么
再来一个。以下八张,哪个是假的
现在公布答案,以上12张,全都是生成的假图片。现在你能理解为什么大家都震惊并且齐声称赞了吧。
作者 | Andrew Brock∗ † Heriot-Watt University ajb5@hw.ac.uk Jeff Donahue† DeepMind,jeffdonahue@google.com Karen Simonyan† DeepMind simonyan@google.com |
时间 | ICLR 2019 |
链接 |
Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple “truncation trick,” allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator’s input. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128×128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Fr´echet Inception Dis-tance (FID) of 7.4, improving over the previous best IS of 52.52 and FID of 18.65. | 尽管生成图像建模最近取得了进展,但从复杂的数据集(如ImageNet)中成功生成高分辨率、多样化的样本仍然是一个难以实现的目标。为此,我们在迄今为止尝试的最大规模上训练生成对抗网络,并研究这种规模特有的不稳定性。我们发现,对生成器应用正交正则化使其易于使用简单的“截断技巧”,通过减少生成器输入的方差,可以很好地控制样本保真度和多样性之间的权衡。我们的修改使得模型在类条件图像合成中设定了最新的技术水平。在128×128分辨率的ImageNet上训练时,我们的模型(BigGANs)实现了166.5的Inception Score (IS)和7.4的Fr´echet Inception距离(FID),比之前的最佳IS(52.52)和FID(18.65)提高了。 |
Figure 1: Class-conditional samples generated by our model.
图1:我们的模型生成的类条件样本。
The state of generative image modeling has advanced dramatically in recent years, with Generative Adversarial Networks (GANs, Goodfellow et al. (2014)) at the forefront of efforts to generate high-fidelity, diverse images with models learned directly from data. GAN training is dynamic, and sensitive to nearly every aspect of its setup (from optimization parameters to model architecture), but a torrent of research has yielded empirical and theoretical insights enabling stable training in a variety of settings. Despite this progress, the current state of the art in conditional ImageNet modeling (Zhang et al., 2018) achieves an Inception Score (Salimans et al., 2016) of 52.5, compared to 233 for real data. | 近年来,生成图像建模的状态有了显著的进步,生成对抗网络(GANs, Goodfellow et al.(2014))通过直接从数据中学习的模型生成高保真、多样化图像的处于努力的最前沿。GAN训练是动态的,并且对其设置的几乎每个方面(从优化参数到模型架构)都很敏感,但大量的研究已经产生了实证和理论见解,可以在各种设置下进行稳定的训练。尽管取得了这一进展,但有条件ImageNet建模(Zhang等人,2018)的当前技术水平达到了52.5的初始得分(Salimans等人,2016),而真实数据的初始得分为233。 |
In this work, we set out to close the gap in fidelity and variety between images generated by GANs and real-world images from the ImageNet dataset. We make the following three contributions to-wards this goal: · We demonstrate that GANs benefit dramatically from scaling, and train models with two to four times as many parameters and eight times the batch size compared to prior art. We introduce two simple, general architectural changes that improve scalability, and modify a regularization scheme to improve conditioning, demonstrably boosting performance. · As a side effect of our modifications, our models become amenable to the “truncation trick,” a simple sampling technique that allows explicit, fine-grained control of the trade- off between sample variety and fidelity. · We discover instabilities specific to large scale GANs, and characterize them empirically. Leveraging insights from this analysis, we demonstrate that a combination of novel and existing techniques can reduce these instabilities, but complete training stability can only be achieved at a dramatic cost to performance. | 在这项工作中,我们着手缩小由GANs生成的图像与来自ImageNet数据集的真实图像之间的保真度和多样性的差距。为此,我们作以下三点贡献: ·我们证明了GANs从缩放中显著受益,并且与现有技术相比,训练模型的参数是现有技术的2~4倍,批大小是现有技术的8倍。我们介绍了两个简单的、通用的体系结构更改,以提高可伸缩性,并修改正则化方案以改进调节,从而明显提高性能。 ·作为我们修改的副作用,我们的模型变得易于使用“截断技巧”,这是一种简单的抽样技术,允许明确的、细粒度的控制样本多样性和保真度之间的权衡。 ·我们发现了大规模 GAN 特有的不稳定性,并根据经验对其进行了描述。利用这一分析的见解,我们证明了新技术和现有技术的组合可以减少这些不稳定性,但完全的训练稳定性只能以巨大的性能代价来实现。 |
Our modifications substantially improve class-conditional GANs. When trained on ImageNet at 128×128 resolution, our models (BigGANs) improve the state-of-the-art Inception Score (IS) and Fre´chet Inception Distance (FID) from 52.52 and 18.65 to 166.5 and 7.4 respectively. We also successfully train BigGANs on ImageNet at 256×256 and 512×512 resolution, and achieve IS and FID of 232.5 and 8.1 at 256×256 and IS and FID of 241.5 and 11.5 at 512×512. Finally, we train our models on an even larger dataset – JFT-300M – and demonstrate that our design choices transfer well from ImageNet. Code and weights for our pretrained generators are publicly available 1. | 我们的修改极大地改进了类条件GANs。当在ImageNet上以128×128分辨率进行训练时,我们的模型(BigGANs)将最先进的初始分数(IS)和Fre´chet初始距离(FID)分别从52.52和18.65提高到166.5和7.4。我们还成功地在ImageNet上训练了256×256和512×512分辨率的BigGANs,在256×256上实现了232.5和8.1的IS和FID,在512×512上实现了241.5和11.5的IS和FID。最后,我们在一个更大的数据集(JFT-300M)上训练我们的模型,并证明我们的设计选择可以很好地从 ImageNet 迁移过来。我们预训练生成器的代码和权重是公开的1。 |
A Generative Adversarial Network (GAN) involves Generator (G) and Discriminator (D) networks whose purpose, respectively, is to map random noise to samples and discriminate real and generated samples. Formally, the GAN objective, in its original form (Goodfellow et al., 2014) involves finding a Nash equilibrium to the following two player min-max problem:
min max Ex∼qdata (x)[log D(x)] + Ez∼p(z)[log(1 − D(G(z)))], (1) where z ∈ Rdz is a latent variable drawn from distribution p(z) such as N (0, I) or U [−1, 1]. When applied to images, G and D are usually convolutional neural networks (Radford et al., 2016). Without auxiliary stabilization techniques, this training procedure is notoriously brittle, requiring finely-tuned hyperparameters and architectural choices to work at all. | 生成对抗网络(GAN)包括生成器(G)和判别器(D)网络,其目的分别是将随机噪声映射到样本,并区分真实和生成的样本。形式上,GAN目标的原始形式(Goodfellow et al., 2014)涉及为以下两个玩家最小-最大问题找到纳什均衡: 最小最大交货∼qdata (x)(日志D (x)) + Ez∼p (z)[日志(1−D (G (z)))), (1) 其中z∈Rdz是从分布p(z)如N (0, I)或U[- 1,1]中抽取的潜变量。当应用于图像时,G和D通常是卷积神经网络(Radford et al., 2016)。如果没有辅助的稳定技术,这个训练过程是出了名的脆弱,需要微调超参数和架构选择才能工作。 |
Much recent research has accordingly focused on modifications to the vanilla GAN procedure to impart stability, drawing on a growing body of empirical and theoretical insights (Nowozin et al., 2016; Sønderby et al., 2017; Fedus et al., 2018). One line of work is focused on changing the objective function (Arjovsky et al., 2017; Mao et al., 2016; Lim & Ye, 2017; Bellemare et al., 2017; Salimans et al., 2018) to encourage convergence. Another line is focused on constraining D through gradient penalties (Gulrajani et al., 2017; Kodali et al., 2017; Mescheder et al., 2018) or normalization (Miyato et al., 2018), both to counteract the use of unbounded loss functions and ensure D provides gradients everywhere to G. Of particular relevance to our work is Spectral Normalization (Miyato et al., 2018), which enforces Lipschitz continuity on D by normalizing its parameters with running estimates of their first singular values, inducing backwards dynamics that adaptively regularize the top singular direction. Relatedly Odena et al. (2018) analyze the condition number of the Jacobian of G and find that performance is dependent on G’s conditioning. Zhang et al. (2018) find that employing Spectral Normalization in G improves stability, allowing for fewer D steps per iteration. We extend on these analyses to gain further insight into the pathology of GAN training. Other works focus on the choice of architecture, such as SA-GAN (Zhang et al., 2018) which adds the self-attention block from (Wang et al., 2018) to improve the ability of both G and D to model global structure. ProGAN (Karras et al., 2018) trains high-resolution GANs in the single-class setting by training a single model across a sequence of increasing resolutions. | 因此,最近的许多研究都集中在对香草GAN程序的修改上,以提高稳定性,借鉴了越来越多的经验和理论见解(Nowozin等人,2016;Sønderby等人,2017;Fedus等人,2018)。其中一行工作的重点是改变目标函数(Arjovsky等人,2017;Lim & Ye, 2017;Bellemare等人,2017;Salimans等人,2018)鼓励融合。另一条线专注于通过梯度惩罚约束D (Gulrajani et al., 2017;Kodali等,2017;Mescheder et al., 2018)或归一化(Miyato et al., 2018),两者都是为了抵消无界损失函数的使用,并确保D为G提供无处不在的梯度。 与我们的工作特别相关的是光谱归一化(Miyato等人,2018年),它通过对D的参数进行归一化,对其第一个奇异值进行估计,从而实现D的利普希茨连续性,从而诱导自适应地正则化顶部奇异方向的向后动态。Odena等人(2018)分析了G的雅可比矩阵的条件数,发现性能依赖于G的条件。Zhang等人(2018)发现,在G中使用光谱归一化可以提高稳定性,每次迭代允许更少的D步。我们扩展这些分析,以进一步深入了解GAN训练的病理。 其他的工作集中在架构的选择上,如SA-GAN (Zhang et al., 2018),它增加了(Wang et al., 2018)的自注意块,以提高G和D对全局结构建模的能力。ProGAN (Karras等人,2018)在单类环境中训练高分辨率GANs,方法是在一系列不断增加的分辨率中训练单个模型。 |
In conditional GANs (Mirza & Osindero, 2014) class information can be fed into the model in various ways. In (Odena et al., 2017) it is provided to G by concatenating a 1-hot class vector to the noise vector, and the objective is modified to encourage conditional samples to maximize the corresponding class probability predicted by an auxiliary classifier. de Vries et al. (2017) and Dumoulin et al. (2017) modify the way class conditioning is passed to G by supplying it with class- conditional gains and biases in BatchNorm (Ioffe & Szegedy, 2015) layers. In Miyato & Koyama (2018), D is conditioned by using the cosine similarity between its features and a set of learned class embeddings as additional evidence for distinguishing real and generated samples, effectively encouraging generation of samples whose features match a learned class prototype. | 在有条件的GANs (Mirza & Osindero, 2014)中,类信息可以以各种方式输入模型。在(Odena et al., 2017)中,通过将1-hot类向量连接到噪声向量,将其提供给G,并修改目标以鼓励条件样本,使辅助分类器预测的相应类概率最大化。de Vries等人(2017)和Dumoulin等人(2017)通过在BatchNorm (Ioffe & Szegedy, 2015)层中向G提供类条件增益和偏差,修改了类条件反射传递给G的方式。在Miyato和Koyama(2018)中,D通过使用其特征与一组学习过的类嵌入之间的余弦相似度作为区分真实样本和生成样本的额外证据,有效地鼓励生成特征与学习过的类原型匹配的样本。 |
Table 1: Frechet Inception Distance (FID, lower is better) and Inception Score (IS, higher is better) for ablations of our proposed modifications. Batch is batch size, Param is total number of param-eters, Ch. is the channel multiplier representing the number of units in each layer, Shared is using shared embeddings, Skip-z is using skip connections from the latent to multiple layers, Ortho. is Orthogonal Regularization, and Itr indicates if the setting is stable to 106 iterations, or it collapses at the given iteration. Other than rows 1-4, results are computed across 8 random initializations. | 表1:Frechet Inception Distance (FID,越低越好)和Inception Score (is,越高越好)用于我们提出的修改的消融。Batch是批大小,Param是参数的总数,Ch.是表示每层单元数量的通道乘法器,Shared使用共享嵌入,skip -z使用从潜在层到多层的跳过连接,Ortho。为正交正则化,Itr表示该设置是否稳定到106次迭代,或者在给定迭代时崩溃。除了第1-4行,结果是在8个随机初始化中计算的。 |
Objectively evaluating implicit generative models is difficult (Theis et al., 2015). A variety of works have proposed heuristics for measuring the sample quality of models without tractable likelihoods (Salimans et al., 2016; Heusel et al., 2017; Bin´kowski et al., 2018; Wu et al., 2017). Of these, the Inception Score (IS, Salimans et al. (2016)) and Fre´chet Inception Distance (FID, Heusel et al. (2017)) have become popular despite their notable flaws (Barratt & Sharma, 2018). We employ them as approximate measures of sample quality, and to enable comparison against previous work. | 客观评估隐式生成模型是困难的(Theis et al., 2015)。各种工作提出了测量模型样本质量的启发式方法,没有可处理的可能性(Salimans et al., 2016;Heusel等人,2017;Bin´kowski等人,2018;Wu等人,2017)。其中,Inception Score (IS, Salimans et al.(2016))和Fre´chet Inception Distance (FID, Heusel et al.(2017))尽管存在明显的缺陷(Barratt & Sharma, 2018),但已经变得流行起来。我们使用它们作为样本质量的近似度量,并与以前的工作进行比较。 |
Table 1: Fr´echet Inception Distance (FID, lower is better) and Inception Score (IS, higher is better) for ablations of our proposed modifications. Batch is batch size, Param is total number of param-eters, Ch. is the channel multiplier representing the number of units in each layer, Shared is using shared embeddings, Skip-z is using skip connections from the latent to multiple layers, Ortho. is Orthogonal Regularization, and Itr indicates if the setting is stable to 106 iterations, or it collapses at the given iteration. Other than rows 1-4, results are computed across 8 random initializations.
表1:Fr´echet初始距离(FID,越低越好)和初始分数(is,越高越好)用于我们所提议的修改的消融。Batch是批大小,Param是参数的总数,Ch.是表示每层单元数量的通道乘法器,Shared使用共享嵌入,skip -z使用从潜在层到多层的跳过连接,Ortho。为正交正则化,Itr表示该设置是否稳定到106次迭代,或者在给定迭代时崩溃。除了第1-4行,结果是在8个随机初始化中计算的。
In this section, we explore methods for scaling up GAN training to reap the performance benefits of larger models and larger batches. As a baseline, we employ the SA-GAN architecture of Zhang et al. (2018), which uses the hinge loss (Lim & Ye, 2017; Tran et al., 2017) GAN objective. We provide class information to G with class-conditional BatchNorm (Dumoulin et al., 2017; de Vries et al., 2017) and to D with projection (Miyato & Koyama, 2018). The optimization settings follow Zhang et al. (2018) (notably employing Spectral Norm in G) with the modification that we halve the learning rates and take two D steps per G step. For evaluation, we employ moving averages of G’s weights following Karras et al. (2018); Mescheder et al. (2018); Yazc et al. (2018), with a decay of 0.9999. We use Orthogonal Initialization (Saxe et al., 2014), whereas previous works used N (0, 0.02I) (Radford et al., 2016) or Xavier initialization (Glorot & Bengio, 2010). Each model is trained on 128 to 512 cores of a Google TPUv3 Pod (Google, 2018), and computes BatchNorm statistics in G across all devices, rather than per-device as is typical. We find progressive growing (Karras et al., 2018) unnecessary even for our 512×512 models. Additional details are in Appendix C. | 在本节中,我们将探索扩大GAN训练的方法,以获得更大模型和更大批量的性能优势。作为基线,我们采用了Zhang等人(2018)的SA-GAN架构,该架构使用了铰链损失(Lim & Ye, 2017;Tran等人,2017)GAN目标。我们用类条件BatchNorm向G提供类信息(Dumoulin et al., 2017;de Vries等人,2017)和用投影到D (Miyato & Koyama, 2018)。优化设置遵循Zhang等人(2018)(特别是在G中使用谱范数)的修改,即我们将学习率减半,每G步采取两个D步。为了进行评估,我们采用了Karras等人(2018)之后G的权重的移动平均;Mescheder等人(2018);Yazc等人(2018),衰减为0.9999。我们使用正交初始化(Saxe等人,2014),而之前的工作使用N (0,0.02 i) (Radford等人,2016)或Xavier初始化(Glorot & Bengio, 2010)。每个模型在谷歌TPUv3 Pod(谷歌,2018)的128到512个核上训练,并在所有设备上计算G中的BatchNorm统计信息,而不是典型的每个设备。我们发现渐进式增长(Karras等人,2018)即使对于我们的512×512模型也是不必要的。更多细节见附录C。 |
We begin by increasing the batch size for the baseline model, and immediately find tremendous benefits in doing so. Rows 1-4 of Table 1 show that simply increasing the batch size by a factor of 8 improves the state-of-the-art IS by 46%. We conjecture that this is a result of each batch covering more modes, providing better gradients for both networks. One notable side effect of this scaling is that our models reach better final performance in fewer iterations, but become unstable and undergo complete training collapse. We discuss the causes and ramifications of this in Section 4. For these experiments, we report scores from checkpoints saved just before collapse. We then increase the width (number of channels) in each layer by 50%, approximately doubling the number of parameters in both models. This leads to a further IS improvement of 21%, which we posit is due to the increased capacity of the model relative to the complexity of the dataset. Doubling the depth did not initially lead to improvement – we addressed this later in the BigGAN-deep model, which uses a different residual block structure. | 我们从增加基线模型的批大小开始,并立即发现这样做的巨大好处。表1中的1-4行显示,只需将批大小增加8倍,就可以将最先进的IS提高46%。我们推测这是由于每批覆盖了更多的模式,为两个网络提供了更好的梯度。这种扩展的一个显著副作用是,我们的模型在更少的迭代中达到更好的最终性能,但变得不稳定并经历完全的训练崩溃。我们将在第4节讨论这种情况的原因和后果。对于这些实验,我们报告崩溃前保存的检查点的分数。 然后,我们将每一层的宽度(通道数量)增加50%,大约是两个模型中参数数量的两倍。这导致IS进一步提高了21%,我们假设这是由于模型容量相对于数据集复杂性的增加。深度加倍一开始并没有带来改善——我们后来在BigGAN-deep模型中解决了这个问题,该模型使用了不同的剩余块结构。 |
We note that class embeddings c used for the conditional BatchNorm layers in G contain a large number of weights. Instead of having a separate layer for each embedding (Miyato et al., 2018; Zhang et al., 2018), we opt to use a shared embedding, which is linearly projected to each layer’s gains and biases (Perez et al., 2018). This reduces computation and memory costs, and improves training speed (in number of iterations required to reach a given performance) by 37%. Next, we add direct skip connections (skip-z) from the noise vector z to multiple layers of G rather than just the initial layer. The intuition behind this design is to allow G to use the latent space to directly in- fluence features at different resolutions and levels of hierarchy. In BigGAN, this is accomplished by splitting z into one chunk per resolution, and concatenating each chunk to the conditional vector c which gets projected to the BatchNorm gains and biases. In BigGAN-deep, we use an even simpler design, concatenating the entire z with the conditional vector without splitting it into chunks. Pre- vious works (Goodfellow et al., 2014; Denton et al., 2015) have considered variants of this concept; our implementation is a minor modification of this design. Skip-z provides a modest performance improvement of around 4%, and improves training speed by a further 18%. | 我们注意到G中用于条件BatchNorm层的类嵌入c包含大量的权重。而不是每个嵌入都有一个单独的层(Miyato等人,2018;Zhang等人,2018),我们选择使用共享嵌入,该嵌入线性投影到每一层的增益和偏差(Perez等人,2018)。这减少了计算和内存成本,并将训练速度(达到给定性能所需的迭代次数)提高了37%。接下来,我们将直接跳过连接(skip-z)从噪声向量z添加到G的多个层,而不仅仅是初始层。这种设计背后的直觉是允许G使用潜在空间直接影响不同分辨率和层次结构的特征。在BigGAN中,这是通过将z分成每个分辨率的一个块来实现的,并将每个块连接到条件向量c,该向量c被投影到BatchNorm增益和偏差中。在BigGAN-deep中,我们使用更简单的设计,将整个z与条件向量连接起来,而不将其分割成块。前期工作(Goodfellow et al., 2014;Denton等人,2015)考虑了这个概念的变体;我们的实现是对这个设计的一个小修改。Skip-z提供了大约4%的适度性能改进,并将训练速度进一步提高了18%。 |
Figure 2: (a) The effects of increasing truncation. From left to right, the threshold is set to 2, 1, 0.5,0.04. (b) Saturation artifacts from applying truncation to a poorly conditioned model.
图2:(a)增加截断的影响。阈值从左到右依次为2、1、0.5、0.04。(b)对条件差的模型应用截断产生的饱和伪影。
Unlike models which need to backpropagate through their latents, GANs can employ an arbitrary prior p(z), yet the vast majority of previous works have chosen to draw z from either N (0, I) or U [−1, 1]. We question the optimality of this choice and explore alternatives in Appendix E. Remarkably, our best results come from using a different latent distribution for sampling than was used in training. Taking a model trained with z ∼ N (0, I) and sampling z from a truncated nor- mal (where values which fall outside a range are resampled to fall inside that range) immediately provides a boost to IS and FID. We call this the Truncation Trick: truncating a z vector by re- sampling the values with magnitude above a chosen threshold leads to improvement in individual sample quality at the cost of reduction in overall sample variety. Figure 2(a) demonstrates this: as the threshold is reduced, and elements of z are truncated towards zero (the mode of the latent dis- tribution), individual samples approach the mode of G’s output distribution. Related observations about this trade-off were made in (Marchesi, 2016; Pieters & Wiering, 2014). | 与需要通过其潜伏期反向传播的模型不同,GANs可以使用任意先验p(z),然而绝大多数先前的工作都选择从N (0, I)或U[- 1,1]中绘制z。我们质疑这一选择的最佳性,并在附录E中探索其他选择。 值得注意的是,我们最好的结果来自于使用不同的潜在分布进行抽样,而不是在训练中使用。采用z ~ N (0, I)训练的模型,并从截断的nor- mal(在此范围之外的值被重新采样以落入该范围内)中采样z,立即提供了IS和FID的提升。我们称之为截断技巧:通过对高于选定阈值的值重新采样来截断z向量,从而以减少总体样本多样性为代价来改善单个样本质量。图2(a)说明了这一点:随着阈值的降低,z的元素被截断为零(潜在分布的模式),单个样本接近G的输出分布的模式。关于这种权衡的相关观察见于(Marchesi, 2016;Pieters & Wiering, 2014)。 |
This technique allows fine-grained, post-hoc selection of the trade-off between sample quality and variety for a given G. Notably, we can compute FID and IS for a range of thresholds, obtaining the variety-fidelity curve reminiscent of the precision-recall curve (Figure 17). As IS does not penal- ize lack of variety in class-conditional models, reducing the truncation threshold leads to a direct increase in IS (analogous to precision). FID penalizes lack of variety (analogous to recall) but also rewards precision, so we initially see a moderate improvement in FID, but as truncation approaches zero and variety diminishes, the FID sharply drops. The distribution shift caused by sampling with different latents than those seen in training is problematic for many models. Some of our larger models are not amenable to truncation, producing saturation artifacts (Figure 2(b)) when fed trun- cated noise. To counteract this, we seek to enforce amenability to truncation by conditioning G to be smooth, so that the full space of z will map to good output samples. For this, we turn to Orthogonal Regularization (Brock et al., 2017), which directly enforces the orthogonality condition: | 该技术允许对给定g的样本质量和品种之间的权衡进行细粒度的事后选择,值得注意的是,我们可以计算一系列阈值的FID和IS,获得与精度-召回率曲线相似的品种-保真度曲线(图17)。由于IS不会惩罚类条件模型中缺乏多样性,减少截断阈值会导致IS的直接增加(类似于精度)。FID惩罚缺乏多样性(类似于回忆),但也奖励精确性,所以我们最初看到FID有适度的改善,但随着截断接近零和多样性减少,FID急剧下降。对于许多模型来说,与训练中看到的不同的抽样引起的分布转移是有问题的。我们的一些较大的模型是不适应截断,产生饱和伪影(图2(b))时,馈入旋转噪声。为了抵消这一点,我们试图通过使G平滑来加强截断的适应性,以便z的整个空间将映射到良好的输出样本。为此,我们求助于正交正则化(Brock et al., 2017),它直接加强了正交条件: |
where W is a weight matrix and β a hyperparameter. This regularization is known to often be too limiting (Miyato et al., 2018), so we explore several variants designed to relax the constraint while still imparting the desired smoothness to our models. The version we find to work best removes the diagonal terms from the regularization, and aims to minimize the pairwise cosine similarity between filters but does not constrain their norm:
| 其中W是权重矩阵,β是超参数。众所周知,这种正则化通常限制太多(Miyato等人,2018年),因此我们探索了几种变体,旨在放松约束,同时仍然为我们的模型提供所需的平滑性。我们发现最有效的版本是从正则化中移除对角线项,并旨在最小化滤波器之间的成对余弦相似度,但不限制它们的范数: |
where 1 denotes a matrix with all elements set to 1. We sweep β values and select 10−4, finding this small added penalty sufficient to improve the likelihood that our models will be amenable to truncation. Across runs in Table 1, we observe that without Orthogonal Regularization, only 16% of models are amenable to truncation, compared to 60% when trained with Orthogonal Regularization. | 其中1表示所有元素都为1的矩阵。我们扫描β值并选择10−4,发现这个小的附加惩罚足以提高我们的模型将适应截断的可能性。在表1的运行中,我们观察到,如果没有正交正则化,只有16%的模型可以被截断,而当使用正交正则化训练时,这个比例为60%。 |
We find that current GAN techniques are sufficient to enable scaling to large models and distributed, large-batch training. We find that we can dramatically improve the state of the art and train models up to 512×512 resolution without need for explicit multiscale methods like Karras et al. (2018). Despite these improvements, our models undergo training collapse, necessitating early stopping in practice. In the next two sections we investigate why settings which were stable in previous works become unstable when applied at scale. | 我们发现,目前的GAN技术足以扩展到大型模型和分布式的大批量训练。我们发现,我们可以显著提高现有水平,并将模型训练到512×512分辨率,而不需要像Karras等人(2018)那样显式的多尺度方法。尽管有这些改进,但我们的模型会经历训练崩溃,需要在实践中尽早停止。在接下来的两节中,我们将研究为什么在以前的工作中是稳定的设置在大规模应用时变得不稳定。 |
Figure 3: A typical plot of the first singular value σ0 in the layers of G (a) and D (b) before Spectral Normalization. Most layers in G have well-behaved spectra, but without constraints a small sub-set grow throughout training and explode at collapse. D’s spectra are noisier but otherwise better-behaved. Colors from red to violet indicate increasing depth.
图3谱归一化前G (A)和D (b)层第一奇异值σ0的典型图。G中的大多数层具有良好的光谱,但如果没有约束,一小子集在训练过程中增长并在崩溃时爆炸。D的光谱更嘈杂,但在其他方面表现更好。从红色到紫色表示深度的增加。
Much previous work has investigated GAN stability from a variety of analytical angles and on toy problems, but the instabilities we observe occur for settings which are stable at small scale, necessitating direct analysis at large scale. We monitor a range of weight, gradient, and loss statistics during training, in search of a metric which might presage the onset of training collapse, similar to (Odena et al., 2018). We found the top three singular values σ0, σ1, σ2 of each weight matrix to be the most informative. They can be efficiently computed using the Alrnoldi iteration method (Golub & der Vorst, 2000), which extends the power iteration method, used in Miyato et al. (2018), to estimation of additional singular vectors and values. A clear pattern emerges, as can be seen in Figure 3(a) and Appendix F: most G layers have well-behaved spectral norms, but some layers (typically the first layer in G, which is over-complete and not convolutional) are ill-behaved, with spectral norms that grow throughout training and explode at collapse. | 以前的许多工作已经从不同的分析角度和玩具问题研究了GAN的稳定性,但我们观察到的不稳定性发生在小范围稳定的设置中,需要在大范围直接分析。我们在训练期间监测一系列权重、梯度和损失统计数据,以寻找可能预示训练崩溃开始的指标,类似于(Odena等人,2018)。我们发现每个权重矩阵的前三个奇异值σ0、σ1、σ2信息量最大。使用Alrnoldi迭代方法(Golub & der Vorst, 2000)可以有效地计算它们,该方法扩展了Miyato等人(2018)使用的幂迭代方法,以估计额外的奇异向量和值。一个清晰的模式出现了,如图3(A)和附录F所示:大多数G层具有良好的光谱规范,但一些层(典型的是G中的第一层,它是过度完整的且不是卷积的)表现不佳,其光谱规范在整个训练过程中不断增长,并在崩溃时爆炸。 |
To ascertain if this pathology is a cause of collapse or merely a symptom, we study the effects of imposing additional conditioning on G to explicitly counteract spectral explosion. First, we directly regularize the top singular values σ0 of each weight, either towards a fixed value σreg or towards some ratio r of the second singular value, r · sg(σ1) (with sg the stop-gradient operation to prevent the regularization from increasing σ1). Alternatively, we employ a partial singular value decomposition to instead clamp σ0. Given a weight W , its first singular vectors u0 and v0, and σclamp the value to which the σ0 will be clamped, our weights become: W = W − max(0, σ0 − σclamp)v0uT, (4) where σclamp is set to either σreg or r · sg(σ1). We observe that both with and without Spectral Normalization these techniques have the effect of preventing the gradual increase and explosion of either σ0 or σ0 , but even though in some cases they mildly improve performance, no combination prevents training collapse. This evidence suggests that while conditioning G might improve stability, it is insufficient to ensure stability. We accordingly turn our attention to D. | 为了确定这种病理是崩溃的原因还是仅仅是一种症状,我们研究了对G施加额外的条件作用以明确地抵消光谱爆炸的影响。首先,我们直接正则化每个权重的顶部奇异值σ0,要么是朝着一个固定值σreg,要么是朝着第二个奇异值r·sg(σ1)的某个比例r (sg是停止梯度操作,以防止正则化增加σ1)。或者,我们采用部分奇异值分解代替钳制σ0。给定一个权重W,它的第一个奇异向量u0和v0, σ钳制σ0将被钳制到的值,我们的权重变成: W = W−max(0, σ0−σ钳)v0uT, (4) 其中σclamp设置为σreg或r·sg(σ1)。我们观察到,无论有没有谱归一化,这些技术都有防止σ0或σ0的逐渐增加和爆发的效果,但即使在某些情况下,它们也能轻微地提高性能,但没有组合 防止训练崩溃。这一证据表明,虽然条件G可能提高稳定性,但不足以确保稳定性。因此,我们将注意力转向d |
As with G, we analyze the spectra of D’s weights to gain insight into its behavior, then seek to stabilize training by imposing additional constraints. Figure 3(b) displays a typical plot of σ0 for D (with further plots in Appendix F). Unlike G, we see that the spectra are noisy, σ0 is well-behaved, and the singular values grow throughout training but only jump at collapse, instead of exploding. The spikes in D’s spectra might suggest that it periodically receives very large gradients, but we observe that the Frobenius norms are smooth (Appendix F), suggesting that this effect is primarily concentrated on the top few singular directions. We posit that this noise is a result of optimization through the adversarial training process, where G periodically produces batches which strongly per- turb D . If this spectral noise is causally related to instability, a natural counter is to employ gradient penalties, which explicitly regularize changes in D’s Jacobian. We explore the R1 zero-centered gradient penalty from Mescheder et al. (2018):
| 与G一样,我们分析D的权值谱以深入了解其行为,然后通过施加额外的约束来寻求稳定训练。图3(b)显示了D的典型σ0图(附录F中有进一步的图)。与G不同的是,我们看到频谱是有噪声的,σ0表现良好,并且奇异值在整个训练过程中增长,但只在崩溃时跳跃,而不是爆炸。 D光谱中的峰值可能表明它周期性地接收到非常大的梯度,但我们观察到Frobenius范数是平滑的(附录F),这表明这种效应主要集中在顶部的几个奇异方向上。我们假设这种噪声是通过对抗性训练过程优化的结果,其中G周期性地产生每浊度D强烈的批次。如果这种光谱噪声与不稳定性有因果关系,自然的对抗方法是使用梯度惩罚,它明确地正则化D的雅可比矩阵的变化。我们探索了Mescheder等人(2018)的R1零中心梯度惩罚: |
With the default suggested γ strength of 10, training becomes stable and improves the smoothness and boundedness of spectra in both G and D, but performance severely degrades, resulting in a 45% reduction in IS. Reducing the penalty partially alleviates this degradation, but results in increasingly ill-behaved spectra; even with the penalty strength reduced to 1 (the lowest strength for which sud- den collapse does not occur) the IS is reduced by 20%. Repeating this experiment with various strengths of Orthogonal Regularization, DropOut (Srivastava et al., 2014), and L2 (See Appendix I for details), reveals similar behaviors for these regularization strategies: with high enough penalties on D, training stability can be achieved, but at a substantial cost to performance. We also observe that D’s loss approaches zero during training, but undergoes a sharp upward jump at collapse (Appendix F). One possible explanation for this behavior is that D is overfitting to the train- ing set, memorizing training examples rather than learning some meaningful boundary between real and generated images. As a simple test for D’s memorization (related to Gulrajani et al. (2017)), we evaluate uncollapsed discriminators on the ImageNet training and validation sets, and measure what percentage of samples are classified as real or generated. While the training accuracy is consistently above 98%, the validation accuracy falls in the range of 50-55%, no better than random guessing (regardless of regularization strategy). This confirms that D is indeed memorizing the training set; we deem this in line with D’s role, which is not explicitly to generalize, but to distill the training data and provide a useful learning signal for G. Additional experiments and discussion are provided in Appendix G. | 在默认建议的γ强度为10的情况下,训练变得稳定,并提高了G和D中光谱的平滑性和有界性,但性能严重退化,导致IS减少45%。减少惩罚在一定程度上缓解了这种退化,但导致越来越多的不良行为的光谱;即使惩罚强度降低到1(不会发生突然倒塌的最低强度),IS也降低了20%。用正交正则化、DropOut (Srivastava et al., 2014)和L2(详见附录I)的不同强度重复这个实验,揭示了这些正则化策略的类似行为:在D上有足够高的惩罚,可以实现训练稳定性,但以性能为代价。 我们还观察到D的损失在训练期间接近于零,但在崩溃时经历了急剧的上升(附录F)。对这种行为的一个可能的解释是D对训练集过度拟合,记忆训练示例,而不是学习真实图像和生成图像之间的一些有意义的边界。作为D记忆的一个简单测试(与Gulrajani等人(2017)有关),我们评估ImageNet训练和验证集上的未分解鉴别器,并测量样本被分类为真实或生成的百分比。虽然训练准确率始终高于98%,但验证准确率落在50-55%的范围内,并不比随机猜测(无论正则化策略)好。这证实了D确实在记忆训练集;我们认为这符合D的角色,它不是明确的泛化,而是提取训练数据,为G提供有用的学习信号。附加的实验和讨论在附录G中提供。 |
Table 2: Evaluation of models at different resolutions. We report scores without truncation (Column 3), scores at the best FID (Column 4), scores at the IS of validation data (Column 5), and scores at the max IS (Column 6). Standard deviations are computed over at least three random initializations.
表2:不同分辨率下模型的评价。我们报告了没有截断的分数(第3列),最佳FID的分数(第4列),验证数据IS的分数(第5列)和最大IS的分数(第6列)。在至少三次随机初始化上计算标准差。
We find that stability does not come solely from G or D, but from their interaction through the adversarial training process. While the symptoms of their poor conditioning can be used to track and identify instability, ensuring reasonable conditioning proves necessary for training but insufficient to prevent eventual training collapse. It is possible to enforce stability by strongly constraining D, but doing so incurs a dramatic cost in performance. With current techniques, better final performance can be achieved by relaxing this conditioning and allowing collapse to occur at the later stages of training, by which time a model is sufficiently trained to achieve good results. | 我们发现,稳定性并不仅仅来自G或D,而是来自它们通过对抗性训练过程的相互作用。虽然他们糟糕的训练症状可以用来跟踪和识别不稳定性,但确保合理的训练被证明是训练所必需的,但不足以防止最终的训练崩溃。通过强约束D来加强稳定性是可能的,但是这样做会导致巨大的性能损失。使用目前的技术,可以通过放松这种条件反射,并允许在训练的后期发生崩溃来获得更好的最终性能,到那时,模型已经得到了充分的训练,可以获得良好的结果。 |
Figure 4: Samples from our BigGAN model with truncation threshold 0.5 (a-c) and an example of class leakage in a partially trained model (d).
图4:截断阈值为0.5 (a-c)的BigGAN模型样本和部分训练模型中的类泄漏示例(d)。
We evaluate our models on ImageNet ILSVRC 2012 (Russakovsky et al., 2015) at 128×128, 256×256, and 512×512 resolutions, employing the settings from Table 1, row 8. The samples generated by our models are presented in Figure 4, with additional samples in Appendix A, and on- line 2. We report IS and FID in Table 2. As our models are able to trade sample variety for quality, it is unclear how best to compare against prior art; we accordingly report values at three settings, with complete curves in Appendix D. First, we report the FID/IS values at the truncation setting which attains the best FID. Second, we report the FID at the truncation setting for which our model’s IS is the same as that attained by the real validation data, reasoning that this is a passable measure of max- imum sample variety achieved while still achieving a good level of “objectness.” Third, we report FID at the maximum IS achieved by each model, to demonstrate how much variety must be traded off to maximize quality. In all three cases, our models outperform the previous state-of-the-art IS and FID scores achieved by Miyato et al. (2018) and Zhang et al. (2018). In addition to the BigGAN model introduced in the first version of the paper and used in the majority of experiments (unless otherwise stated), we also present a 4x deeper model (BigGAN-deep) which uses a different configuration of residual blocks. As can be seen from Table 2, BigGAN-deep sub- stantially outperforms BigGAN across all resolutions and metrics. This confirms that our findings extend to other architectures, and that increased depth leads to improvement in sample quality. Both BigGAN and BigGAN-deep architectures are described in Appendix B. | 我们使用表1第8行中的设置,在ImageNet ILSVRC 2012 (Russakovsky等人,2015)的128×128, 256×256和512×512分辨率上评估我们的模型。由我们的模型生成的样本如图4所示,附加样本在附录A和联机2中。我们在表2中报告IS和FID。由于我们的模型能够以样品品种换取质量,因此尚不清楚如何最好地与现有技术进行比较;我们相应地报告了三种设置下的值,在附录d中有完整的曲线。首先,我们报告了截断设置下的FID/IS值,该设置获得了最佳FID。其次,我们报告了截断设置下的FID,其中我们的模型的IS与实际验证数据获得的相同,推理这是一个可以通过的最大样本多样性的测量,同时仍然实现了良好的“客观性”水平。第三,我们报告每个模型达到的最大IS时的FID,以演示必须权衡多少品种才能使质量最大化。在这三种情况下,我们的模型都优于Miyato等人(2018)和Zhang等人(2018)获得的先前最先进的IS和FID分数。 除了论文第一版中介绍的并在大多数实验中使用的BigGAN模型(除非另有说明),我们还提出了一个4倍深的模型(BigGAN-deep),它使用了不同的剩余块配置。从表2中可以看出,BigGAN-deep在所有分辨率和指标上都大大优于BigGAN。这证实了我们的发现可以扩展到其他架构,并且深度的增加会导致样本质量的提高。附录B描述了BigGAN和BigGAN深度架构。 |
Our observation that D overfits to the training set, coupled with our model’s sample quality, raises the obvious question of whether or not G simply memorizes training points. To test this, we perform class-wise nearest neighbors analysis in pixel space and the feature space of pre-trained classifier networks (Appendix A). In addition, we present both interpolations between samples and class-wise interpolations (where z is held constant) in Figures 8 and 9. Our model convincingly interpolates between disparate samples, and the nearest neighbors for its samples are visually distinct, suggesting that our model does not simply memorize training data. We note that some failure modes of our partially-trained models are distinct from those previously observed. Most previous failures involve local artifacts (Odena et al., 2016), images consisting of texture blobs instead of objects (Salimans et al., 2016), or the canonical mode collapse. We observe class leakage, where images from one class contain properties of another, as exemplified by Figure 4(d). We also find that many classes on ImageNet are more difficult than others for our model; our model is more successful at generating dogs (which make up a large portion of the dataset, and are mostly distinguished by their texture) than crowds (which comprise a small portion of the dataset and have more large-scale structure). Further discussion is available in Appendix A. | 我们观察到D对训练集过拟合,加上我们模型的样本质量,提出了一个明显的问题,即G是否只是简单地记忆训练点。为了测试这一点,我们在像素空间和预训练分类器网络的特征空间中执行按类别的最近邻分析(附录A)。此外,我们在图8和图9中展示了样本之间的插值和按类别的插值(其中z保持不变)。我们的模型令人信服地在不同的样本之间进行插值,其样本的最近邻居在视觉上是不同的,这表明我们的模型不是简单地记忆训练数据。 我们注意到部分训练模型的一些失效模式与之前观察到的不同。以前的大多数失败都涉及局部工件(Odena等人,2016),由纹理斑点而不是对象组成的图像(Salimans等人,2016),或规范模式崩溃。我们观察到类泄漏,其中来自一个类的图像包含另一个类的属性,如图4(d)所示。我们还发现,对于我们的模型来说,ImageNet上的许多类比其他类更难;我们的模型在生成狗(狗占数据集的很大一部分,主要是通过它们的纹理来区分)方面比人群(人群只占数据集的一小部分,具有更大规模的结构)更成功。进一步的讨论可在附录A中找到。 |
Table 3: BigGAN results on JFT-300M at 256×256 resolution. The FID and IS columns report these scores given by the JFT-300M-trained Inception v2 classifier with noise distributed as z ∼ N (0, I)(non-truncated). The (min FID) / IS and FID / (max IS) columns report scores at the best FID and IS from a sweep across truncated noise distributions ranging from σ = 0 to σ = 2. Images from the JFT-300M validation set have an IS of 50.88 and FID of 1.94.
To confirm that our design choices are effective for even larger and more complex and diverse datasets, we also present results of our system on a subset of JFT-300M (Sun et al., 2017). The full JFT-300M dataset contains 300M real-world images labeled with 18K categories. Since the category distribution is heavily long-tailed, we subsample the dataset to keep only images with the 8.5K most common labels. The resulting dataset contains 292M images – two orders of magnitude larger than ImageNet. For images with multiple labels, we sample a single label randomly and independently whenever an image is sampled. To compute IS and FID for the GANs trained on this dataset, we use an Inception v2 classifier (Szegedy et al., 2016) trained on this dataset. Quantitative results are presented in Table 3. All models are trained with batch size 2048. We compare an ablated version of our model – comparable to SA-GAN (Zhang et al., 2018) but with the larger batch size-against a “full” BigGAN model that makes uses of all of the techniques applied to obtain the best results on ImageNet (shared embedding, skip-z, and orthogonal regularization). Our results show that these techniques substantially improve performance even in the setting of this much larger dataset at the same model capacity (64 base channels). We further show that for a dataset of this scale, we see significant additional improvements from expanding the capacity of our models to 128 base channels, while for ImageNet GANs that additional capacity was not beneficial. | 为了确认我们的设计选择对更大、更复杂和多样化的数据集有效,我们还展示了我们的系统在JFT-300M子集上的结果(Sun et al., 2017)。完整的JFT-300M数据集包含300M张标有18K类别的真实图像。由于类别分布严重长尾,我们对数据集进行子抽样,只保留8.5K最常见标签的图像。生成的数据集包含292M张图像——比ImageNet大两个数量级。对于具有多个标签的图像,无论何时对图像进行采样,我们都会随机且独立地对单个标签进行采样。为了计算在此数据集上训练的GANs的IS和FID,我们使用在此数据集上训练的Inception v2分类器(Szegedy等人,2016年)。量化结果见表3。所有模型训练批次大小为2048。我们比较了我们模型的消融版本-与SA-GAN (Zhang等人,2018)相当,但具有更大的批处理大小-与“完整”BigGAN模型进行了比较,该模型使用了所有应用的技术,以在ImageNet上获得最佳结果(共享嵌入、leap -z和正交正则化)。我们的结果表明,即使在相同的模型容量(64个基本通道)下,这些技术也大大提高了性能。我们进一步表明,对于这种规模的数据集,将模型的容量扩展到128个基本通道后,我们看到了显著的额外改进,而对于ImageNet GANs来说,额外的容量并不是有益的。 |
In Figure 19 (Appendix D), we present truncation plots for models trained on this dataset. Unlike for ImageNet, where truncation limits of σ ≈ 0 tend to produce the highest fidelity scores, IS is typically maximized for our JFT-300M models when the truncation value σ ranges from 0.5 to 1. We suspect that this is at least partially due to the intra-class variability of JFT-300M labels, as well as the relative complexity of the image distribution, which includes images with multiple objects at a variety of scales. Interestingly, unlike models trained on ImageNet, where training tends to collapse without heavy regularization (Section 4), the models trained on JFT-300M remain stable over many hundreds of thousands of iterations. This suggests that moving beyond ImageNet to larger datasets may partially alleviate GAN stability issues. The improvement over the baseline GAN model that we achieve on this dataset without changes to the underlying models or training and regularization techniques (beyond expanded capacity) demon- strates that our findings extend from ImageNet to datasets with scale and complexity thus far un- precedented for generative models of images. | 在图19(附录D)中,我们展示了在此数据集上训练的模型的截断图。与ImageNet不同的是,截断值σ≈0往往会产生最高的保真度分数,当截断值σ在0.5到1之间时,JFT-300M模型的IS通常会最大化。我们怀疑,这至少部分是由于JFT-300M标签的类内可变性,以及图像分布的相对复杂性,其中包括在各种尺度上具有多个对象的图像。有趣的是,与在ImageNet上训练的模型不同,在ImageNet上训练的模型在没有大量正则化的情况下往往会崩溃(第4节),在JFT-300M上训练的模型在数十万次迭代中保持稳定。这表明从ImageNet转移到更大的数据集可能在一定程度上缓解GAN稳定性问题。 我们在这个数据集上实现的基线GAN模型的改进没有改变底层模型或训练和正则化技术(超出扩展容量),证明我们的发现从ImageNet扩展到迄今为止对图像生成模型来说规模和复杂性都是前所未有的数据集。 |
We have demonstrated that Generative Adversarial Networks trained to model natural images of multiple categories highly benefit from scaling up, both in terms of fidelity and variety of the gen- erated samples. As a result, our models set a new level of performance among ImageNet GAN models, improving on the state of the art by a large margin. We have also presented an analysis of the training behavior of large scale GANs, characterized their stability in terms of the singular values of their weights, and discussed the interplay between stability and performance. | 我们已经证明,在生成样本的保真度和多样性方面,经过训练可以对多个类别的自然图像进行建模的生成对抗网络,从放大中受益匪浅。因此,我们的模型在ImageNet GAN模型中设定了一个新的性能水平,大大提高了目前的技术水平。我们还分析了大规模GANs的训练行为,用其权值的奇异值表征了它们的稳定性,并讨论了稳定性与性能之间的相互作用。 |
We would like to thank Kai Arulkumaran, Matthias Bauer, Peter Buchlovsky, Jeffrey Defauw, Sander Dieleman, Ian Goodfellow, Ariel Gordon, Karol Gregor, Dominik Grewe, Chris Jones, Jacob Menick, Augustus Odena, Suman Ravuri, Ali Razavi, Mihaela Rosca, and Jeff Stanway. | 我们要感谢Kai Arulkumaran, Matthias Bauer, Peter Buchlovsky, Jeffrey Defauw, Sander Dieleman, Ian Goodfellow, Ariel Gordon, Karol Gregor, Dominik Grewe, Chris Jones, Jacob Menick, Augustus Odena, Suman Ravuri, Ali Razavi, Mihaela Rosca和Jeff Stanway。 |
Martin Arjovsky, Soumith Chintala, and Le´on Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
Shane Barratt and Rishi Sharma. A note on the Inception Score. In arXiv preprint arXiv:1801.01973, 2018.
Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Re´mi Munos. The Cramer distance as a solution to biased Wasserstein gra- dients. In arXiv preprint arXiv:1705.10743, 2017.
Mikolaj Bin´kowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In ICLR, 2018.
Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In ICLR, 2017.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
Gene Golub and Henk Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 123:35–65, 2000.
Google. Cloud TPUs. https://cloud.google.com/tpu/, 2018.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In CVPR, 2016.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014. Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of
GANs. In arXiv preprint arXiv:1705.07215, 2017.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. Jae Hyun Lim and Jong Chul Ye. Geometric GAN. In arXiv preprint arXiv:1705.02894, 2017.
Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, and Zhen Wang. Least squares generative adversarial networks. In arXiv preprint arXiv:1611.04076, 2016.
Marco Marchesi. Megapixel size image creation using generative adversarial networks. In arXiv preprint arXiv:1706.00082, 2016.
Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. In arXiv preprint arXiv:1411.1784, 2014.
Takeru Miyato and Masanori Koyama. cGANs with projection discriminator. In ICLR, 2018.
Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts Distill, 2016.
Mathijs Pieters and Marco Wiering. Comparing generative adversarial network techniques for image creation and modificatio. In arXiv preprint arXiv:1803.09093, 2014.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. ImageNet large scale visual recognition challenge. IJCV, 115:211–252, 2015.
Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving GANs using optimal transport. In ICLR, 2018.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
Lucas Theis, Aa¨ron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In arXiv preprint arXiv:1511.01844, 2015.
Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In
CVPR, 2018.
Yasin Yazc, Chuan-Sheng Foo, Stefan Winkler, Kim-Hui Yap, Georgios Piliouras, and Vijay Chandrasekhar. The unusual effectiveness of averaging in gan training. In arXiv preprint arXiv:1806.04498, 2018.
Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In arXiv preprint arXiv:1805.08318, 2018.
Additional Samples, Interpolations, And Nearest Neighbors From Imagenet Models
Figure 5: Samples generated by our BigGAN model at 256256 resolution.
Figure 6: Samples generated by our BigGAN model at 512512 resolution
Figure 7: Comparing easy classes (a) with difficult classes (b) at 512×512. Classes such as dogs which are largely textural, and common in the dataset, are far easier to model than classes involving unaligned human faces or crowds. Such classes are more dynamic and structured, and often have details to which human observers are more sensitive. The difficulty of modeling global structure is further exacerbated when producing high-resolution images, even with non-local blocks.
图7:512×512上比较简单类(a)和困难类(b)。像狗这样的类在很大程度上是纹理的,在数据集中很常见,比涉及未对齐的人脸或人群的类更容易建模。这样的类更加动态和结构化,并且通常具有人类观察者更加敏感的细节。在生成高分辨率图像时,即使是非局部块,建模全局结构的难度也会进一步加剧。
Figure 8: Interpolations between z, c pairs.
Figure 9: Interpolations between c with z held constant. Pose semantics are frequently maintained between endpoints (particularly in the final row). Row 2 demonstrates that grayscale is encoded in the joint z, c space, rather than in z.
Figure 10: Nearest neighbors in VGG-16-fc7 (Simonyan & Zisserman, 2015) feature space. The generated image is in the top left.
Figure 11: Nearest neighbors in ResNet-50-avgpool (He et al., 2016) feature space. The generated image is in the top left.
Figure 12: Nearest neighbors in pixel space. The generated image is in the top left.
Figure 13: Nearest neighbors in VGG-16-fc7 (Simonyan & Zisserman, 2015) feature space. The generated image is in the top left.
Figure 14: Nearest neighbors in ResNet-50-avgpool (He et al., 2016) feature space. The generated image is in the top left.
Figure 15: (a) A typical architectural layout for BigGAN’s G; details are in the following tables.(b) A Residual Block (ResBlock up) in BigGAN’s G. (c) A Residual Block (ResBlock down) in BigGAN’s D.
Figure 16: (a) A typical architectural layout for BigGAN-deep’s G; details are in the following tables. (b) A Residual Block (ResBlock up) in BigGAN-deep’s G. (c) A Residual Block (ResBlock down) in BigGAN-deep’s D. A ResBlock (without up or down) in BigGAN-deep does not include the Upsample or Average Pooling layers, and has identity skip connections.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。