Generative Adversarial Text to Image Synthesis_two sub-problems

Reference[原文]: Joselynzhao.top & 夏木青 | Generative Adversarial Text to Image Synthesis


In this work, we develop a novel deeparchitecture and GAN formulation to effectivelybridge these advances in text and image model-ing, translating visual concepts from charactersto pixels.


In this work we are interested in translating text in the formof single-sentence human-written descriptions directly intoimage pixels.

Motivated by these works, we aim to learn a mapping di-rectly from words and characters to image pixels.

To solve this challenging problem requires solving two sub-problems:
first, learn a text feature representation that cap-tures the important visual details;
and second, use these fea-tures to synthesize a compelling image that a human mightmistake for real.


However, one difficult remaining issue not solved by deeplearning alone is that
the distribution of images conditionedon a text description is highly multimodal, in the sense thatthere are very many plausible configurations of pixels thatcorrectly illustrate the description.


This conditional multi-modality is thus a very natural ap-plication for generative adversarial networks (Goodfellowet al., 2014), in which the generator network is optimized tofool the adversarially-trained discriminator into predictingthat synthetic images are real.

Our main contribution in this work is to develop a sim-ple and effective GAN architecture and training strat-egy that enables compelling text to image synthesis ofbird and flower images from human-written descriptions.


In this section we briefly describe several previous worksthat our method is built upon.


Generative adversarial networks


Deep symmetric structured joint embedding



Our approach is to train a deep convolutional generativeadversarial network (DC-GAN) conditioned on text fea-tures encoded by a hybrid character-level convolutional-recurrent neural network.

Network architecture



Matching-aware discriminator (GAN-CLS)




Learning with manifold interpolation (GAN-INT)


Note that t1 and t2 may comefrom different images and even different categories.1

Inverting the generator for style transfer


Our implementation was builton top of dcgan.torch2.


