GANs workshop CVPR18 (Funding: NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla, Amazon, DARPA-SRC) Do GANs actually learn the distribution? Some theory and empirics. Sanjeev Arora Princeton University Institute for Advanced Study eneralization and Rong Ge uilibrium in GANs ICML17 + Ge, Liang, Ma, Zhang GANs actually learn the tribution? Some theory

d empirics. ICLR18 Duke + Risteski, Zhang] Yingyu Liang Tengyu Ma UW (Madison) FAIR + Stanford Andrej Risteski MIT Yi Zhang For purposes of this talk:

GANs Distribution learners (in line with canonical framework in [Goodfellow et al14] + many followup works) NB: in many vision applications GANs used for image-toimage maps; not much theory that Im aware of. How to quantify success of training for such implicit models? Dominant mode of analysis: Assume very large deep nets, training time, # training samples, etc. Main message of rest of talk: Signs of trouble. Things wed like to understand about GANs What distributions can be generated (possibly approximately) by your favorite ConvNet architecture of size N?

Does an equilibrium exist? Under what conditions (incl. how much training time + samples) does training converge to an equilibrium? Assuming equilibrium is reached what guarantees do we have about ellow et Alls well when nets,Is training training time are large en theal.14] generators distribution? it closesamples, to the training distribution? [This talk] If discriminator is finite and modest-sized, this message is incorrect. (regardless of training time, # samples, training objective etc..)

Part 1: Equilibrium and generalization for finite size nets (from A, Ge, Liang, Ma, Zhang ICML17) In search of generalization notion for generative models Manifold Traditional training: Generative Model Max Ex[log p(x)] Generalization= similar value for training/ test data. N(O, I) Denoising Auto encoders (Vincent et al08)

GANs come with no Variational Autoencoder (Kingma-Welling14) estimate of loglikelihood. Data Distribution Dreal Generative Adversarial Nets (GANs) [Goodfellow et al. 2014] Real (1) or Fake (0) Dv Dreal Discriminator trained to output 1 on real inputs, and 0 on synthetic inputs.

Dsynth Gu Generator trained to produce synthetic inputs that make discriminator output high values. h = trainable parameters of Generator net = trainable parameters of Discriminator net [Excellent resource: [Goodfellows survey] GANs (W-GAN variant) Difference in expected output on real vs synthetic images Wasserstein GAN

[Arjovsky et al17] ** Real (1) or Fake (0) Dv Dreal Discriminator trained to output 1 on real inputs, and 0 on synthetic inputs. Dsynth Gu h Generator trained to produce synthetic inputs that make discriminator output 1.

(** NB: Our results will apply to most standard objectives.) trainable parameters of Generator GANs net = trainable parameters of Discriminator net Difference in expected output on real vs synthetic images Wasserstein GAN [Arjovsky et GANs: Winning/Equilibrium al17] Real (1) or Fake (0) Dv Dreal

Discriminator trained to output 1 on real inputs, and 0 on synthetic inputs. Dsynth Gu h Generator trained to produce synthetic inputs that make discriminator output 1. Generator wins if objective 0 and further training of discriminator doesnt help. (An Equilibrium.) u= trainable parameters of Generator net v = trainable parameters of Discriminator net Bad news [A., Ge, Liang, Ma, Zhang ICML17] : If discriminator size = N, then generator that generates a

distribution supported on O(Nlog N) images, and still wins against all possible discriminators. (for all standard training objectives: JS, Wasserstein etc.) (NB: Set of all possible images presumably has infinite support..) all discriminators inherently incapable of detecting mode collapse. ounterintuitive nature of high dimensional geometry In Rd, exp(d) directions whose pairwise angle is > 60 degrees and exp(d/e) special directions s.t. every other direction has angle at most e with one of these (e-net) Why low-support distribution can fool small discriminators (e-net argument; used in most proofs of the paper) Idea 1: Deep nets are Lipschitz with respect

to trainable parameters. (Changing parameters by d changes the deep nets output by < C d for some small C.) Idea 2: If # of parameters = N, then exp(N/e) fundamentally distinct deep nets. (all others are e-close to one of these) (e-net) Dsynth : random sample of Nlog N/e2 images from Dreal . For every fixed discriminator, Pr[it e-discriminates between Dreal & Dsynth] exp(-N/e) ea 2 + Union bound => No discriminator can e-discriminate. Interesting. Original GANs idea seems to have problems. I think EncoderDecoder GANs get around the issues you show BiGAN [Donahue et al17] Adversarially Learned Inference [Domoulin et al17]

Berkeley, Spring 2017 Part 2: Encoder-Decoder GANs do not get around these issu ( A, Risteski, Zhang ICLR18) Encoder/Decoder Intuition: Manifold assumption 0/1 X : Image Dv Imag e Z: Its code on manifold Player 1 Cod

e Player 2 GeneratorEncoder E G Encoder/Decoder Intuition: Manifold assumption 0/1 x : image z: Its code on manifold D Player 1 v (Discriminato Ima Co

ge de G E Player 2 Encoder Generator Intuition behind BiGANs: Manifold assumption 0/1 X : Image Dv Imag e Setting 1:

Z: Its code on manifold Player 1 Code E( ) Generator Encoder Player 2 E G Intuition behind BiGANs: Manifold assumption 0/1

X : Image Player 1 trains to Player 1 Dv maximize discrimination probability between Imag Cod the two settings e e SettingG( 2: Z: Its code on manifold Player 2 trains to minimize .

) Generator Encoder Player 2 E G Setting 1: E( ) Theorem: approx. equilibrium s.t. player 1 is fooled, but generator produces distrib. of low support and encoder maps Worst images to white noise. imaginable

failure mode roof idea: Assume all images corrupted with white noise in fixed pattern: e.g., every 50th pixel is random. (clearly, still look realistic) Denote by x + h ncoder E : Map image x + h to its noise pattern h rator G : Randomly partition all Gaussian seeds into small number of r associate each region with a single image. h Given seed h, take image X associated with its region and output X + h How to check support size of generators distribution?? Theorem suggests GANs training objective not guaranteed to avoid mode-collapse (generator can win using distributions with low support)

Does this happen during real life training??? art 3: Empirically detecting mode collapse (Birthday Paradox om A, Risteski, Zhang ICLR18) If you put 23 random people in a room, chance is > 1/2 that two of them share a birthday. Suppose a distribution is supported on N images. Then Pr[sample of size N has a duplicate image] > . radox test* [A, Risteski, Zhang] : If a sample of size s has near-duplica th prob. > 1/2, then distribution has only s2 distinct images. Duplicates: DC-GAN on CelebA [Radford et al15] (faces)

r-Duplicates found among 500 samples (Implied support size 5002 250k) (Training set has size 200k) Diversity 1M for BiGANs [Donahue et al17] and ALI [Dumoulin et al17] (Encoder/Decoder + GANS) port size seems to grow near-linearly with discriminator capacity; consistent w/ th ds more complete testing) Figure 1: Duplicate pairs found in a batch of 640 generated faces samples from a DCGAN Stacked GAN on CIFAR-10 Duplicate Images Nearest image in training set Sample size for collisions: Truck: 100

Horse: 200 Dog: 300 (Training set: 5K per Category) [Huang et al 2016] Recap of talk Expository articles at www.offconvex.org (Off the convex path ) Theory suggests that GANs training can appear to succeed (nearequilibrium; generator wins, etc.) and yet the learnt distribution is far from target distribution. Tinkering with objective doesnt help Follow-up empirical study suggested this problem does arise in real-life training. Open: can real-life GANs training be modified to avoid such issues? (Analysis must take into account ome properties of the optimization path, or of the real-life distribution)

(Recent paper [Bai,Ma,Risteski18] takes 1st step in accounting for distribution..) Where to go from here? Expository articles at www.offconvex.org (Off the convex path ) Empirically check all GANs variants to see if any of them evade mode-collaps Formalize other ways in which GANs are useful (e.g., as image-to-image mappings). Experiments suggest usefulness, but need a theoretical formalization..... Perhaps distribution learning is not the right way to do representation learnin (see expository article on www.offconvex.org) A more complete theory of GANs, and formalization of generalization (ours is a first step..). Also, understanding of dynamics of training.. THANK YOU!