Stable Diffusion Models

A brief introduction to image synthesis applications, stable diffusion models and its applications.

Image Synthesis

  • Artificially generating images with some desired content
  • Typically provided with some prompts for generation
  • Multi-modal generation
  • Generative models as solutions -
    • Guided Synthesis
    • Image editing

Previous Work

Generative Adversarial Networks

A Generative Adversarial Network consists of a generator and a discriminator pair to synthesize images representative of the dataset. These two ‘agents’ compete against each other, and try to optimize opposite cost functions. The generator is responsible for creating synthetic data that should resemble the real data from a specific dataset, such as images, audio, or text. The discriminator, on the other hand, evaluates the data it receives and tries to distinguish between real data from the dataset and fake data generated by the generator.

A variant of GANs, called conditional GANs operate in a latent space. That is, we have multi-dimensional space where each point corresponds to a set of parameters that can be mapped to data in the target distribution. The generator is then ‘conditioned’ on a random vector sampled from this latent space as input. These vectors act as a source of randomness that the generator uses to produce diverse data samples. By exploring different points in the latent space, the generator can generate a wide variety of data, allowing it to produce novel and creative outputs.

GANs

The loss function for a Generative Adversarial Network (GAN) consists of two components: the generator loss and the discriminator loss. Here’s the LaTeX code for the GAN loss function:

\[\mathcal{L}_{\text{GAN}}(G, D) = \mathbb{E}_{x \sim p_{\text{data}}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]\]

In this equation:

  • \(\mathcal{L}_{\text{GAN}}(G, D)\) represents the GAN loss.
  • \(G\) is the generator network.
  • \(D\) is the discriminator network.
  • \(x\) represents real data samples drawn from the true data distribution \(p_{\text{data}}(x)\) .
  • \(z\) represents noise samples drawn from a prior distribution \(p_z(z)\) .
  • \(G(z)\) is the generator’s output when given noise \(z\) .
  • \(D(x)\) represents the discriminator’s output when given a real data sample \(x\) .
  • \(1 - D(G(z))\) represents the discriminator’s output when given a generated sample \(G(z)\) .

The goal of training a GAN is to find the generator and discriminator that minimize this loss function. The generator aims to generate data that fools the discriminator (maximizing the second term), while the discriminator aims to distinguish between real and generated data (maximizing the first term). This adversarial training process leads to the generation of realistic data by the generator.

Issues

  • Mode collapse - The process of training prompts generator to find one plausible example
  • Monotonous output/Lack of diversity - unable to capture the complete dataset
  • Difficult to optimize, unstable training, vanishing gradient

Diversity and Fidelity tradeoff
Cost functions may not converge using gradient descent in a minimax game


AutoRegressive (AR) Transformers

AR models treat an image as a sequence of pixels and represent its probability as the product of the conditional probabilities of all pixels

Models inherently forced to capture entire distribution unlike GANs

Allows us to use Likelihood Maximization

Not very high-resolution images due to memory constraints

Stable training process as compared to GANs

DALL-E uses transformers


pixelCNN

  • Issues -
  • Accumulated errors - Since pixels generated in sequence
  • Computationally Expensive
    • Pixel based
    • Likelihood maximization unnecessary
    • Captures barely perceptible\, high frequency details

Diffusion Models

Destroy training data by adding noise to generate Gaussian noise.

Then learn to recover the data by reversing this process

A Diffusion Model is trained by finding the reverse Markov transitions that maximize the likelihood of the training data.

x0 is the image and xT is the noise!


T is in the order of 1000s

β’s are progressively increased typically to make xT pure Gaussian

Smaller variances are preferred for better posterior estimation

With a large number of steps - process is reversible - mathematically shown

Training motivated from VAEs


T is in the order of 1000s Explain with diagram 2nd point Single network needed for diffusion models since forward process is fixed in DMs

__Reverse Diffusion __ - The parameters have to be learned via a neural network

Encoder-Decoder type architecture


https://arxiv.org/pdf/2006.11239.pdf


T is in the order of 1000s Explain with diagram https://arxiv.org/pdf/2006.11239.pdf

Forward process allows sampling at an arbitrary _t _

This allows efficient training using stochastic gradient descent for optimizing random terms

The simplified cost function in diffusion models is - \(DDPM\)

The model learns the data distribution p \(x\) by denoising a normal variable

These can be interpreted as a sequence of denoising autoencoders


https://arxiv.org/pdf/2006.11239.pdf

Difference from GANs

Not adversarial

Not in latent space


How is t added? Check out this paper

Inductive bias of images via U-Net architecture

Latent space captures low-level features and up-scales them

Skip connections preserve the high-level features

  • Advantages -

    • Stable from mode collapse - since based on likelihood

    • Smooth distribution due to diffusion

  • Disadvantages -

    • Computationally very expensive - repeated evaluations. 5 days on A100 GPU

    • Image generation time is high


Variational encoder - makes distribution U Nets are encoder decoder with skip connections

Latent Diffusion Models

  • Denoising every pixel is unnecessary and computationally expensive.
  • Work in the __latent space __ to deal with lower dimensions
  • Image generation done in two-steps
    • The perceptual compression stage removes high-frequency details
    • Generative model learns the semantic variation in __semantic compression __ stage
  • __Motivation - __ Perceptually equivalent but computationally suitable space
  • Advantages
  • Autoencoding done only once and various diffusion models can be explored
  • Other conditional generation tasks

Features

High-resolution synthesis of images

Application to multiple tasks such as unconditional image synthesis\, inpainting\, stochastic super-resolution

General-purpose conditioning mechanism based on cross-attention\, enabling multi-modal training

Variable compression rate for the latent space

Latent space once generated\, can be used for multiple DM models

Latent Spaces - Autoencoding

Explicit separation of compressing and generation stages

Exploit the inductive bias of DMs in the latent space via U-Nets due to perceptual equivalence

Process

Given an image _x _ in RGB\, the encoder E encodes the image into z = E \(x\)

Encoder downsamples the image by a factor of f

Decoder D\, reconstructs image from latent space as x’ = D \(z\) = D \(E\)x\(\)

Stable Diffusion Models

Refined version of LDMs

Employ a frozen CLIP text encoder\, allowing it to generate images based on text prompts

Contrastive Language-Image Pre-Training

Generates text and image embeddings

Images and relevant text will have similar representations

A neural network trained on a variety of \(image\, text\) pairs

It can be instructed in natural language to predict the most relevant text snippet\, given an image\, without directly optimizing for the task

Zero-shot - model attempts to predict a class it saw zero times in the training data

An image encoder and a text-encoder

Attention

For conditioning the generation process

The prior is often either a text\, an image\, or a semantic map

A Transformer network encodes the condition text/image into a latent embedding which is in turn mapped to the intermediate layers of the U-Net via a cross-attention layer

Attention mechanism will learn the best way to combine the input and conditioning inputs in this latent space

These merged inputs are now the initial noise for the diffusion process.

Experiments

Perceptual Compression Tradeoffs

The downsampling factor in the universal autoencoder has to be chosen

Low downsampling factors \(1\, 2\) result in slow training progress - the diffusion model has to do the compression job

High factors cause stagnating fidelity after comparably few training steps - sample quality is limited due to information loss

Factors 4\, 8 and 16 strike a good balance

Image Generation

This model has half the parameters and requires 4 times lesser computation resources!

Precision and recall to assess the mode-coverage


Precision and recall estimated by nearest neighbours

Conditional Tasks

Transformer Encoders for LDMs

For text-to-image generation\, BERT-tokenizer is used to infer a latent code that is mapped into the U-Net via cross-attention

Image-to-Image translations

Semantic representations in the latent spaces are simply concatenated

Super-Resolution

The low-resolution image is simply concatenated as the conditioned input after bi-cubic interpolation

Image Inpainting

Summary

  • Generative Models
    • Autoregressive Transformers
    • GANs
    • Diffusion Models
  • DMs
    • Principles
    • Issues
  • Latent Diffusion Models
    • Latent Spaces and Perceptible Compression
    • Cross-attention for multi-modal generation
    • Open Source