METHOD AND SYSTEM FOR TRAINING A MODEL FOR IMAGE GENERATION
20220237905 · 2022-07-28
Assignee
Inventors
- Daniel Olmeda Reino (Brussels, BE)
- Apratim BHATTACHARYYA (Saarbrücken, DE)
- Mario FRITZ (Saarbrücken, DE)
- Bernt SCHIELE (Saarbrücken, DE)
Cpc classification
International classification
G06V10/774
PHYSICS
Abstract
A method and system for training a model for image generation. The model includes a hybrid variational auto-encoder (VAE)—generative adversarial network (GAN) framework. The method includes the steps of: multiple input of an input image into the VAE which outputs in response multiple distinct output image samples, determining the best of the multiple output image samples as a best-of-many sample, the best-of-many sample having the minimum reconstruction cost, and training the model based on a predefined training objective, the predefined training objective integrating the best-of-many sample reconstruction cost and a GAN-based synthetic likelihood term.
Claims
1.-15. (canceled)
16. A method of training a model for image generation, the model comprising a hybrid variational auto-encoder (VAE)—generative adversarial network (GAN) framework, the method comprising the steps of: a—multiple input of an input image into the VAE which outputs in response multiple distinct output image samples, b—determine the best of the multiple output image samples as a best-of-many sample, the best-of-many sample having the minimum reconstruction cost, c—train the model based on a predefined training objective, the predefined training objective integrating the best-of-many sample reconstruction cost and a GAN-based synthetic likelihood term.
17. The method according to claim 16, wherein the model is trained by using only the best-of-many sample for training the model and by disregarding the further multiple output image samples.
18. The method according to claim 16, wherein the model is trained based on the best-of-many sample in relation to the input image according to a predefined VAE objective.
19. The method according to claim 16, wherein the model is a deep neural network or comprises at least one deep neural network.
20. The method according to claim 16, wherein the model comprises: a variational auto-encoder (VAE) including a recognition network and a generator, and a generative adversarial network (GAN) including a generator and a discriminator.
21. The method according to claim 20, wherein the variational auto-encoder (VAE) and the generative adversarial network (GAN) share a common generator.
22. The method according to claim 16, wherein the model is trained in step c based on the GAN-based synthetic likelihood term to learn generating sharper images by leveraging a discriminator of the GAN which is jointly trained to distinguish between real and generated images.
23. The method according to claim 22, wherein during each training iteration the latent distribution of the input image is sampled by: multiple input of the input image into a recognition network which outputs in response respective regions in a latent space, and generation of respective output image samples in the image space by inputting the respective regions in the latent space into a generator.
24. The method according to claim 16, wherein the output image samples are inputted into a discriminator of the GAN which outputs the GAN-based synthetic likelihood term.
25. The method according to claim 16, wherein only the worst of the multiple output image samples is inputted into a discriminator of the GAN which outputs the GAN-based synthetic likelihood term.
26. The method according to claim 16, wherein the Lipschitz constant of the GAN-based synthetic likelihood term is constrained to be equal to a predetermined value using Spectral Normalization.
27. The method according to claim 26, wherein the predetermined value is equal to 1.
28. A system for training a model for image generation, the model comprising a hybrid variational auto-encoder (VAE)—generative adversarial network (GAN) framework, the system comprising: a module A configured for a multiple input of an input image into the VAE which outputs in response multiple distinct output image samples, a module B for determining the best of the multiple output image samples as a best-of-many sample, the best-of-many sample having the minimum reconstruction cost, and a module C for training the model based on a predefined training objective, the predefined training objective integrating the best-of-many sample reconstruction cost and a GAN-based synthetic likelihood term.
29. The system according to claim 28, further comprising the model.
30. A system for generating an image sample, comprising one of the trained model of step c of claim 16 and the trained module C of claim 16, wherein the Lipschitz constant of the GAN-based synthetic likelihood term is constrained to be equal to a predetermined value using Spectral Normalization.
31. A computer program comprising instructions for executing the steps of the method according to claim 16, when the program is executed by a computer.
32. A recording medium readable by a computer and having recorded thereon a computer program including instructions for executing the steps of a method according to claim 16.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0037]
[0038]
[0039]
DESCRIPTION OF THE EMBODIMENTS
[0040] Reference will now be made in detail to exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
[0041]
[0042] The aim of the training method is to learn generative models for image distributions x˜p(x) that transform a latent distribution z˜p(z) to a learned distribution {circumflex over (x)}˜p.sub.θ(x) approximating p(x). The samples from the learned distribution {circumflex over (x)}˜p.sub.θ(x) must be sharp and realistic (likely under p(x)) and diverse—covering all modes of the distribution p(x).
[0043] In a first step S01 the same input image in inputted multiple times into the VAE which outputs in response respective multiple distinct output image samples. This allows the encoder multiple chances to draw desired samples.
[0044] In a subsequent step S02 the best of the multiple output image samples is determined. Said best output image is referred to in the following as a “best-of-many sample”. The best-of-many sample is characterized by having the minimum reconstruction cost compared to the other output samples.
[0045] In a further step S03 the model is trained based on a predefined training objective. Said predefined training objective integrates (or is based on or comprises) the best-of-many sample reconstruction cost and a GAN-based synthetic likelihood term.
[0046] Due to this objective the encoder is enabled to maintain low divergence to the prior while generating realistic images. Further desirable details of the training method are described in the following, also in context of
[0047]
[0048] In this figure, a system 200 for training a model for image generation has been represented. The model comprises a hybrid variational auto-encoder (VAE)—generative adversarial network (GAN) framework. This system 200, which may be a computer, comprises a processor 201 and a non-volatile memory 202. The system 200 may not only be configured for training the model for image generation. It may also apply the trained model to another algorithm 400. For example the trained model may be applied to a computer vision system 400. In other words, a computer vision system for processing an input image sample 400 may comprise a pre-processor module configured to generate image samples based, the pre-processor module comprising said trained model.
[0049] As an option, the system 200 may further be connected to a (passive) optical sensor 300, in particular a digital camera. The digital camera 300 is configured such that it can take pictures which may be used as input image samples provided to the model.
[0050] In the non-volatile memory 202, a set of instructions is stored and this set of instructions comprises instructions to perform a method for training a model.
[0051] In particular, these instructions and the processor 201 may respectively form a plurality of modules:
a module A configured for a multiple input of an input image into the VAE which outputs in response multiple distinct output image samples,
a module B for determining the best of the multiple output image samples as a best-of-many sample, the best-of-many sample having the minimum reconstruction cost, and
a module C for training the model based on a predefined training objective, the predefined training objective integrating the best-of-many sample reconstruction cost and a GAN-based synthetic likelihood term.
[0052]
[0053] The model thus leverages the strengths of VAEs and GANs to attain the two goals set out above. The GAN portion (G.sub.θ,D.sub.I) alone cangenerate realistic images, but has trouble covering all modes. The VAE portion (R.sub.ϕ,G.sub.θ,D.sub.L) can cover all modes of the distribution p(x). However, this comes at a cost—it is difficult to maintain both the VAE latent space close to Gaussain and cover all modes of the distribution p(x) at the same time. Therefore, in contrast to previous hybrid VAE-GAN approaches (Rosca et. al. as cited above), a novel objective is employed which leverages “Best-of-Many” samples to cover all modes of the distribution p(x) while generating realistic images and maintaining a latent space as close to Gaussian as possible.
[0054] The following detailed description begins with an explanation of the VAE objective and its shortcomings, followed by the proposed “Best-of-Many” objective for image generation which address its shortcomings.
Shortcomings of the VAE Objective
[0055] The VAE objective maximizes the log-likelihood of the data (x˜p(x)). The log-likelihood, assuming the latent space to be distributed according to p(z) is,
log(p.sub.θ(x))=log(∫p.sub.θ(x|z)p(z)dz) (1)
[0056] Here, p(z) is usually Gaussian and the log-likelihood p.sub.θ(x|z) is usually the L.sub.1/L.sub.2 norm based reconstruction (e.sup.−λ∥x−{circumflex over (x)}∥n). This requires the generator G.sub.θ to generate samples that reconstruct every training example x for a likely z˜p(z). This ensures that the decoder θ covers all modes of the data distribution x˜p(x). In contrast, GANs never directly maximize the (reconstruction based) likelihood and there is no direct incentive to cover all modes.
[0057] However, the integral in (1) is intractable. Variational inference may use an (approximate) variational distribution q.sub.ϕ(z|x), which is jointly learned using an encoder,
[0058] During training, samples may be drawn instead from a recognition network q.sub.ϕ(z|x)(R.sub.ϕ) and the variational auto-encoder based objective may be maximized,
.sub.VAE=
.sub.q.sub.
[0059] This objective has two important shortcomings. Firstly, this objective severly constrains the recognition network q.sub.ϕ(z|x) (R.sub.ϕ) as high data log-likelihood and low divergence to the prior are at odds. As the expected log-likelihood is considered, the recognition network has to always generate latent samples {circumflex over (z)} which are decoded by the generator close to x. Otherwise, the expected data log-likelihood would be low. Thus, the encoder is forced to trade-off between a good estimate of the data log-likelihood and the divergence to the true latent p(z) distribution, which causes the generated latent space (by the recognition network) to be far from a Gaussain. Secondly, it considers only a reconstruction-based log-likelihood which is known to lead to blurry image generations.
[0060] Next, it is described how multiple samples can be effectively leveraged from q.sub.ϕ(z|x) to deal with the first shortcoming. Finally, a synthetic likelihood term is integrated to deal with blurriness.
Leveraging Multiple Samples
[0061] An alternative variational approximation of (1) may be derived, which uses multiple samples to relax the constrains on the recognition network. For example, the Mean-value theorem of Integration may be used, in order to derive a unconditional version of the (conditional) multi-sample objective starting from (2) (full derivation in Suppmat),
.sub.MS=log(∫p.sub.θ(x|z)q.sub.ϕ(z|x)dz)−KL(p(z)∥q.sub.ϕ(z|x)) (4)
[0062] In comparision to the VAE objective (3), in (4) the likelihood is computed considering all the generated samples. The recognition network gets multiple chances to draw samples with high likelihood. This encourages diversity in the generated samples and the recognition network can provide a good estimate of the data log-likelihood while not diverging from the prior p(z)—without trade-off.
[0063] However, also a good estimate of the likelihood p.sub.θ(x|z) is desirable. Considering only L.sub.1 or L.sub.2 reconstruction based likelihoods would lead to the generation of blurry images. Therefore, (and because of the intractability of (1)), GANs instead use an adversary that provides indirect information of the likelihood—classifier that is jointly trained to distinguish between generated samples and real data samples.
[0064] Next, it is described how it can be leveraged such a classifier to directly obtain synthetic estimates of the likelihood that lead to the generation of crisp images.
Integrating Synthetic Likelihoods with the “Best-of-Many” Samples
[0065] Synthetic estimates of the likelihood leads to the generation of sharper images by leveraging a classifier which is jointly trained to distinguish between real and generated images. A generated image which is indistinguishable from a real image is assigned higher likelihood. Starting from (4), a synthetic likelihood term (with weight 1−α) is integrated to both encourage the generator to generate realistic images and to cover all modes (L.sub.1 reconstruction loss), thus meeting the initial two goals. First the likelihood term is converted to a likelihood ratio form which allows for synthetic estimates,
[0066] Now the likelihood ratio p.sub.θ(x|z)/p(x) can be estimated using a classifier. To do this, the auxiliary variable y is introduced where, y=1 denotes that the sample was generated and y=0 denotes that the sample is from the true distribution. Now (6) can be written as (using Bayes theorem),
[0067] The probability p.sub.θ(y=1|z,x) may be estimated using a classifier D.sub.I(x) (image discriminator in
[0068] Note that the synthetic likelihood D.sub.I(x) is usually estimated using a softmax layer and the likelihood p.sub.θ(x|z) takes the form e.sup.−∥x−{circumflex over (x)}∥n in (7). Both these log-sum-exps are numerically unstable. It can be dealt with the first log-sum-exp using the Jenson-Shannon inequality,
[0069] As stochastic gradient descent is performed, it can be dealt with the second log-sum-exp after stochastic (MC) sampling of the data points. The log-sum-exp can be well estimated using the max—the “Best-of-Many” samples,
[0070] The “Best-of-Many” samples objective takes the form (ignoring the constant log (T) term and λ≥(1−α)),
[0071] Furthermore, the generator G.sub.θ may be penalized using only the least realistic sample, and the likelihood ratio be estimated directly using D.sub.I,
[0072] To further ensure smoothness, the Lipschitz constant K of D.sub.I may be directly controlled, by setting it to be equal to 1, using Spectral Normalization, T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. ICLR, 2018.
[0073] The synthetic likelihood ratio term is namely unstable during training—as it is the ratio of outputs of a classifier, any instability in the output of the classifier is magnified. Therefore it is proposed to directly estimate the ratio using a network with a controlled Lipschitz constant, which leads to significantly improved stability.
[0074] In contrast to prior work (e.g. Rosca et.al.), (8) provides multiple chances to the recognition network to generate samples likely under the reconstruction based likelihood. Furthermore, the synthetic likelihood term ensures that every generated sample is realistic.
[0075] Intuitively, this objective can be seen as a generalization of prior hybrid VAE-GAN based models. If it is set T=1 in (8) the exact objective used in the a-GAN model is recovered. Moreover, in e.g. Rosca et.al. for every sample x˜p(x), the recognition network is used to obtain the exact {circumflex over (z)} from latent space. In contrast, the objective (8) only requires the recognition network to only point to the appropriate region in the latent space.
[0076] Next, a detailed description of the optimization of the hybrid VAE-GAN model is provided using the “Best-of-Many” samples objective, which is called BMS-GAN.
Optimization
[0077] As recent works (e.g. Rosca et.al.) have shown, point-wise minimization of the KL-divergence using its analytical form leads to degradation in generatated image quality. The KL-divergence term can also be recast in a likelihood ratio form (similar as (6)) allowing to leverage synthetic likelihoods using a classifier and minimize it globally instead of point-wise. The latent space discriminator D.sub.L is used to enforce the KL-divergence constraint p(z)q.sub.ϕ(z|x) in (8).
[0078] During optimization, samples from the true data distribution x˜p(x) are first sampled. For each x, the recognition network R.sub.ϕ, gives a region of the latent space q.sub.ϕ(z|x). It is assumed q.sub.ϕ(z|x)=(μ(x), σ(x)). The generator G.sub.θ now generates samples in the data (image) space {circumflex over (x)}˜p.sub.θ(x|z)q.sub.ϕ(z|x) from that region of the latent space. These samples are then given as input to the data (image) discriminator D.sub.I, which provides a synthetic estimate of the likelihood. The latent space discriminator D.sub.L uses the latent samples {circumflex over (z)}˜q.sub.ϕ(z|x) to provide a synthetic estimate of the divergence KL(p(z)∥q.sub.ϕ(z|x)).
[0079] Based on the generated samples and synthetic likelihood estimates, it is now updated: 1. D.sub.I and D.sub.L using the standard GAN update rule (using true and generated samples x and {circumflex over (x)}, z and {circumflex over (z)}). 2. R.sub.ϕ using synthetic likelihood estimates from D.sub.I, D.sub.L and the “Best-of-Many” reconstruction cost max.sub.i log (p.sub.θ(x|{circumflex over (z)}.sup.i)). 3. G.sub.θ using synthetic likelihood estimate from D.sub.I and the “Best-of-Many” reconstruction cost.
[0080] Throughout the description, including the claims, the term “comprising a” should be understood as being synonymous with “comprising at least one” unless otherwise stated. In addition, any range set forth in the description, including the claims should be understood as including its end value(s) unless otherwise stated. Specific values for described elements should be understood to be within accepted manufacturing or industry tolerances known to one of skill in the art, and any use of the terms “substantially” and/or “approximately” and/or “generally” should be understood to mean falling within such accepted tolerances.
[0081] Although the present disclosure herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present disclosure.
[0082] It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.