# Denoising Diffusion Implicit Models (DDIM)

This is a learning note of this series of videos.

Paper Link: https://arxiv.org/abs/2010.02502

# 1. Background

**Background:**

DDPM Pro: high quality image generation

DDPM Con: require many steps for sampling

**Goal:**

Accelerate sampling process

**How:**

Generalize the forward diffusion process used by DDPMs, which is Markovian, to non-Markovian ones. The resulting training objective is the same as the objective used in DDPM. Therefore, we can choose from existing generative models and only modify the sampling process.

DDIM can massively increase sample efficiency only at a minor cost in sample quality.

**Empirical Benefits of DDIMs over DDPMs**:

- Sampling can be accelerated by 10x to 100x.

- Sample Consistency: when the initial latent variable is fixed and generate several samples with Markov chains of various lengths, the resulting samples have similar high-level features.

- Due to consistency, DDIMs can perform semantically meaningful image interpolation by manipulating the initial latent variables.

# 2. Denoising Objective of DDPM

Suppose the true data distribution is $q(\bold{x}_0)$, we want to learn an approximate distribution $p_{\theta}(\bold{x}_0)$ which is close to $q(\bold{x}_0)$.

DDPMs are latent variable models of the form

where $\bold{x}_1, \dots, \bold{x}_T$ are latent variables. The parameters $\theta$ are learned to fit the data distribution $q(\bold{x}_0)$ by maximizing a variational lower bound:

where $q(\bold{x}_{1:T}|\bold{x}_0)$is some inference distribution (the forward process). DDPM consider it as the following Markov chain with Gaussian transitions parameterized by a decreasing sequence $\alpha_{1:T} \in (0, 1]^T$:

A special property of the forward process is that:

Note: we need to make sure $\alpha_T \rightarrow 0$, so that the data becomes standard Gaussian after adding noise.

By reparameterization trick, $\bold{x}_t$ can be expressed as a linear combination of $\bold{x}_0$ and a noise variable $\epsilon$:

If all the conditionals are modeled as Gaussians with trainable mean functions and fixed variances, the variation lower bound objective can be simplified to:

where $\epsilon_{\theta} \coloneqq \{ \epsilon_{\theta}^{(t)}\}_{t=1}^T$ is a set of $T$ functions, each $\epsilon_{\theta}^{(t)}: \mathcal{X} \rightarrow \mathcal{X}$ is a function with trainable parameters $\theta^{(t)}$, and $\gamma \coloneqq [\gamma_1, \dots, \gamma_T]$ is a vector of positive coefficients in the objective that depends on $\alpha_{1:T}$.

# 3. Why DDIM uses non-Markovian forward process

Note that in the loss of DDPM, $||\epsilon_{\theta}^{(t)} (\sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon_t) - \epsilon_t||_2^2$ only relies on the “overall step” distribution $q(\bold{x}_t | \bold{x}_0)$, but not directly on $q(\bold{x}_{1:T} | \bold{x}_0)$.

There are many choices of inference distribution $q(\bold{x}_{1:T} | \bold{x}_0)$ that have the same $q(\bold{x}_t | \bold{x}_0)$.

If we can keep the overall step $q(\bold{x}_t | \bold{x}_0)$ unchanged, then the objective $L_{\gamma}$ doesn’t need to change(don’t need to train the model again).

## Non-Markovian Forward Processes for a Discrete Case

The non-Markovian forward process works beyond Gaussian cases.(Appendix A)

Consider a categorical observation $\bold{x}_0$ that is a one-hot vector with K possible values, the forward process is defined as follow. First, we have $q(\bold{x}_t | \bold{x}_0)$ as the following categorical distribution:

where $\bold{1}_K \in \mathbb{R}^K$ is a vector with all entries being $1 / K$, and $\alpha_t$ decreaseing from $\alpha_0 = 1$ to $\alpha_T = 0$. Then we define $q(\bold{x}_{t-1} | \bold{x}_t , \bold{x}_0)$ as the following mixture distribution:

or equivalently:

This definition is consistent with how we defined $q(\bold{x}_t | \bold{x}_0)$.

**Why consistent?**

Plug in $\bold{x}_t = \alpha_t \bold{x}_0 + (1 - \alpha_t) \bold{1}_K$ into $\sigma_t \bold{x}_t + (\alpha_{t-1} - \sigma_t \alpha_t ) \bold{x}_0 + ((1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t) \bold{1}_K$, we get:

**End**

Similarly, we can define the reverse process $p_{\theta}(\bold{x}_{t-1} | \bold{x}_t)$ as:

where $f_{\theta}^{(t)}(\bold{x}_t)$ maps $\bold{x}_t$ to a K-dimensional vector. As $(1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t \rightarrow 0$, the sampling process becomes less stochastic.

The objective will be the KL divergence:

which is simply the KL divergence between tow categoricals. Since the KL divergence is convex, we can apply Jensen Inequality here and get the objective upperbound:

This also shows that how $\sigma_t$ changes doesn’t affect the objective (up to re-weighting).

# 4. The Core Sampling Method of DDIM (and why)

## Non-Markovian Forward Processes

Consider a family $\mathcal{Q}$ of inference distributions, indexed by a real vector $\sigma \in \mathbb{R}^T_{\geq 0}$

where $q_{\sigma}(\bold{x}_T | \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_T} \bold{x}_0 , (1 - \alpha_T) \bold{I})$ and for all $t > 1$,

The mean function is chosen to ensure that $q_{\sigma} (\bold{x}_t | \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I})$ for all $t$, so that the joint inference distribution matches the marginals as desired (matches the original DDPM’s overall step so that don’t need to train the model again).

## How the sampling process is defined?

**(1) DDPM Sampling**: $\bold{x}_{t} \rightarrow \bold{x}_{t-1} \rightarrow \dots \rightarrow \bold{x}_{0}$

The overall steps $p(\bold{x}_t | \bold{x}_0)$ and $p(\bold{x}_{t-1} | \bold{x}_0)$ are fixed, and $p(\bold{x}_t | \bold{x}_{t-1}, \bold{x}_0)$ can be simplified by Markov property to $p(\bold{x}_t | \bold{x}_{t-1})$.

**(2) Skip Steps (non-Markovian)**

Consider $s < k$, the sampling distribution can be written as:

The overall steps still are still known and doesn’t change:

Now the problem is how to set the sampling distribution (undetermined coefficients method).

We still set $p(\bold{x}_s | \bold{x}_k, \bold{x}_0)$ to a Gaussian. Assume that the mean function is a linear combination of $\bold{x}_k$ and $\bold{x}_0$:

where $n, m, \sigma^2$ are unknown and to be determined.

Reparameterization gives:

Reparameterization of the overall step gives:

Therefore we can derive:

The overall step also gives $\bold{x}_s$:

Using the method of undetermined coefficients, we have:

Here we choose $\sigma$ as the free variable and solve $n$ and $m$:

So we have:

Change the DDPM notation to DDIM notaion:

When $\sigma = 0$, it’s DDIM. When $\sigma = \sigma_{\text{DDPM}}$, it’s DDPM. One can also interpolate in between.

# 5. DDIM’s Overall Step is Unchanged

(Appendix B, Lemma 1)

**Lemma 1.** For $q_{\sigma}(\bold{x}_{1:T} | \bold{x}_0) \coloneqq q_{\sigma}(\bold{x}_T | \bold{x}_0) \prod_{t=2}^T q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)$ and $q_{\sigma}(\bold{x}_{t-1} | \bold{x}_{t}, \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_{t-1}} \bold{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \frac{\bold{x}_t - \sqrt{\alpha_t} \bold{x}_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2 \bold{I})$, we have:

**Proof:**

Prove by induction for $t$ from $T$ to $1$.

By definition of $q_{\sigma}(\bold{x}_T | \bold{x}_0)$, $t = T$ holds.

Assume for any $t \leq T$, $q_{\sigma} (\bold{x}_t | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I})$ holds.

First, we have:

and

From Bishop (2006) (2.115), we have that $q_{\sigma} (\bold{x}_{t-1} | \bold{x}_0)$ is Gaussian, denoted as $\mathcal{N}(\mu_{t-1}, \Sigma_{t-1})$

where

and

Therefore, $q_{\sigma} (\bold{x}_{t-1} | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_{t-1}} \bold{x}_0, (1 - \alpha_{t-1}) \bold{I})$, which allows the induction.

**End of Proof.**

# 6. Generative Process and Unified Variational Inference Objective

## Generative Process

Intuitively, given a noisy observation $\bold{x}_t$, we first make a prediction of the corresponding $\bold{x}_0$, and the use it to obtain a sample $\bold{x}_{t-1}$ througth the reverse conditional distribution $q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)$, which is defined above.

For some $\bold{x}_0 \sim q(\bold{x}_0)$ and $\epsilon_t \sim \mathcal{N}(\bold{0}, \bold{I})$, $\bold{x}_t$ can be obtained by the overall step of the forward process:

Then model $\epsilon_{\theta}^{(t)}(\bold{x}_t)$ then attempts to predict $\epsilon_t$ without the knowledge of $\bold{x}_0$. By rewriting the above overall step, we can get hte prediction of $\bold{x}_0$ give $\bold{x}_t$:

We can define a fixed prior $p_{\theta}(\bold{x}_T) = \mathcal{N}(\bold{0}, \bold{I})$ for the generative process and

## Variational Inference Objective

$\theta$ can be optimized following variational inference objective (which is a functional over $\epsilon_{\theta}$):

where $q_{\sigma} (\bold{x}_{1:T} | \bold{x}_0)$ is factorized by $q_{\sigma}(\bold{x}_{1:T} | \bold{x}_0) \coloneqq q_{\sigma}(\bold{x}_T | \bold{x}_0) \prod_{t=2}^T q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)$ and $p_{\theta}(\bold{x}_{0:T})$ is factorized by $p_{\theta} (\bold{x}_{0:T}) \coloneqq p_{\theta} (\bold{x}_T) \prod_{t=1}^T p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)$

It can be proved that $J_{\sigma}$ is equivalent to $L_{\gamma}$ (objective of DDPM) for certain weights $\gamma$. Therefore, the model $\epsilon_{\theta}$ doesn’t require retraining, and we can just use the model trained by DDPM.

**Theorem 1.** For all $\sigma > 0$, there exists $\gamma \in \mathbb{R}_{>0}^{(t)}$ and $C \in \mathbb{R}$, such that $J_{\sigma} = L_{\gamma} + C$.

**Proof:**

From the definition of $J_{\sigma}$:

where we use $\equiv$ to denote “equal up to a value that does not depend on $\epsilon_{\theta}$ (but may depend on $q_{\sigma}$)”.

How is the $\equiv$ derived?

Consider the part of $\sum_{t=2}^T \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)}$ :

For $t > 1$, by the definition of $p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)$: