Score-Based Generative Modeling through Stochastic Differential Equations

This is a learning note of this series of videos.

Paper Link: https://arxiv.org/abs/2011.13456

1. Why we use SDE to describe the diffusion process?

We want to perturb the data with multiple noise scales(why??). The idea is to use SDE to provide an infinite number of noise scales(continuously changing).

The SDE (which is continuous) can be used for theoretical analysis. In practice, we discretize the SDE for numerical computation.

Goal: construct a diffusion process $\{\bold{x}(t)\}^T_{t=0}$ indexed by a continuous time variable $t \in [0, T]$ , such that $\bold{x}(0) \sim p_0$ , for which we have a dataset of i.i.d. samples, and $\bold{x}(T) \sim p_T$ , for which we have a tractable form to generate samples efficiently. This diffusion process can be modeled as the solution to an Ito SDE:

d \bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w}

$w$ : a Brownian Motion whose increments follow the gaussian distribution, and variance increase with time:

\bold{w}(t+\Delta t) - \bold{w}(t) \sim \mathcal{N}(0, \Delta t)

d \bold{w} = 0 + \sqrt{\Delta t} \epsilon \qquad \text{where } \epsilon \sim \mathcal{N}(0, \bold{I})

$\bold{f} ( \cdot, t): \mathbb{R}^d \rightarrow \mathbb{R}^d$ is a vector valued function called the drift coefficient of $\bold{x}(t)$ , and $g(\cdot): \mathbb{R} \rightarrow \mathbb{R}$ is a scalar function known as the diffusion coefficient of $\bold{x}(t)$ .

Starting from samples of $\bold{x}(T) \sim p_T$ and reversing the process, we can obtain samples $\bold{x}(0) \sim p_0$ . It is proved that the reverse of a diffusion process is also a diffusion process, running backward in time and given by the reverse-time SDE:

d\bold{x} = [\bold{f}(\bold{x}, t) - g(t)^2 \nabla_{\bold{x}} \log p_t(\bold{x})] dt + g(t) d \bar{\bold{w}}

where $\bar{\bold{w}}$ is a standard Wiener process when time flows backwards from $T$ to 0, and $dt$ is a n infinitesimal negative timestep. Once the score of each marginal distribution, $\nabla_{\bold{x}} \log p_t(\bold{x})$ , is known for all t, we can derive the above reverse diffusion process and simulate it to sample from $p_0$ .

2. How to derive the reverse-time SDE?

Forward SDE: $d\bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w}$

Reverse SDE: $d\bold{x} = [\bold{f}(\bold{x}, t) - g(t)^2 \nabla_{\bold{x}} \log p_t(\bold{x})] dt + g(t) d \bar{\bold{w}}$

Proof 1:

Important Assumption: the diffusion coefficient is $g(t)$ rather than $g(\bold{x}, t)$ .

Characteristics of the Brownian Motion $\{ w(t), t\geq 0\}$ :

Gaussian Increments: $w(t) - w(s) \sim \mathcal{N}(0, t-s)$ ; $w(t) - w(0) \sim \mathcal{N}(0, t)$

Independent Increments: If $0 \leq u \leq s \leq t$ , then $w(t) - w(s)$ and $w(s) - w(u)$ are independent.

Path Continuity: $w(t)$ is a continuous function of $t$ .

Discretization of the forward SDE gives:

\bold{x}_{t + \Delta t} - \bold{x}_t = \bold{f}(\bold{x}_t, t) \Delta t + g(t) \sqrt{\Delta t} \epsilon \qquad \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})

\Leftrightarrow \bold{x}_{t + \Delta t} = \bold{x}_t + \bold{f}(\bold{x}_t, t) \Delta t + g(t) \sqrt{\Delta t} \epsilon

This indicates that:

\begin{align*} p(\bold{x}_{t + \Delta t} | \bold{x}_t) &= \mathcal{N}(\bold{x}_{t + \Delta t} | \bold{x}_t + \bold{f}(\bold{x}_t, t) \Delta t, g(t)^2 \Delta t \cdot \bold{I}) \\ & \propto \exp \left( - \frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} \right)\end{align*}

According to Bayesian theorem:

\begin{align*} p(\bold{x}_t | \bold{x}_{t+\Delta t}) &= \frac{p(\bold{x}_{t + \Delta t} | \bold{x}_t) p(\bold{x}_t)}{p(\bold{x}_{t + \Delta t)}} = p(\bold{x}_{t + \Delta t} | \bold{x}_t) \exp \left( \log p(\bold{x}_t) - \log p(\bold{x}_{t+\Delta t}) \right) \\ &\propto \exp \left( -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} + \log p(\bold{x}_t) - \log p(\bold{x}_{t+\Delta t})\right) \end{align*}

Note: First order Taylor Expansion of $f(x)$ at $x_0$ is $f(x) \approx f(x_0) + f'(x_0) (x - x_0)$

Note: $\log p(x_t)$ is actually a function of both $x_t$ and $t$ , therefore when taking derivatives, both of them should be considered.

By Taylor Expansion, we have

\log p(\bold{x}_{t + \Delta t}) \approx \log p(\bold{x}_t) + (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) + \Delta t \frac{\partial}{\partial t} \log p(\bold{x}_t)

Plug in the above equation into $p(\bold{x}_t | \bold{x}_{t + \Delta t})$ :

\begin{align*} p(\bold{x}_t | \bold{x}_{t + \Delta t}) &\propto \exp \left( -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} - (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) - \Delta t \frac{\partial}{\partial t} \log p(\bold{x}_t) \right)\end{align*}

\begin{align*} & -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} - (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) \\ &= -\frac{1}{2 g(t)^2 \Delta t} \left( ||\bold{x}_{t+\Delta t} - \bold{x}_t||^2 - 2(\bold{x}_{t+\Delta t} - \bold{x}_t) \bold{f}(\bold{x}_t, t) \Delta t + \bold{f}(\bold{x}_t, t)^2 \Delta t^2 + 2 g(t)^2 \Delta t (\bold{x}_{t + \Delta t} - \bold{x}_t)\nabla_{\bold{x}_t} \log p(\bold{x}_t)\right) \\ &= -\frac{1}{2 g(t)^2 \Delta t} \left( ||\bold{x}_{t+\Delta t} - \bold{x}_t||^2 - 2(\bold{x}_{t + \Delta t} - \bold{x}_t)[\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t + C(\Delta t) \right) \\ &= -\frac{1}{2 g(t)^2 \Delta t}\left( ||\bold{x}_{t+\Delta t} - \bold{x}_t - [\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t||^2 + C(\Delta t) \right)\end{align*}

Here $C(\Delta t)$ is a polynomial of $\Delta t$ without a constant term. It goes to $0$ when $\Delta t \rightarrow 0$ . Therefore,

\begin{align*} p(\bold{x}_t | \bold{x}_{t + \Delta t}) & \propto \exp \left( - \frac{||\bold{x}_{t+\Delta t} - \bold{x}_t - [\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t||^2}{2g(t)^2 \Delta t} \right) \\ &\approx \exp \left( - \frac{||\bold{x}_{t} - \bold{x}_{t + \Delta t} - [\bold{f}(\bold{x}_{t+\Delta t}, t + \Delta t) - g(t+\Delta t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_{t+\Delta t})}] (-\Delta t)||^2}{2g(t + \Delta t)^2 \Delta t} \right)\end{align*}

Previously, we derived $p(\bold{x}_{t + \Delta t} | \bold{x}_t) \propto \exp \left( - \frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} \right)$ from $d\bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d\bold{w}$ . Therefore we can conclude that the corresponding reverse-time SDE is:

d \bold{x} = [\bold{f}(\bold{x}_{t}, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_{t})}]dt + g(t) d \bold{w}

3. Core settings of two diffusion models: SMLD and DDPM

3.1 Denoising score matching with Langevin Dynamics (SMLD)

Let the perturbation kernel to be: $p_{\sigma} (\tilde{\bold{x}} | \bold{x}) \coloneqq \mathcal{N}(\tilde{\bold{x}}; \bold{x} , \sigma^2 \bold{I})$ , and $p_{\sigma}(\tilde{\bold{x}}) \coloneqq \int p_{data}(\bold{x}) p_{\sigma}(\tilde{\bold{x}} | \bold{x}) d\bold{x}$ , where $p_{data}(\bold{x})$ denotes the data distribution. Consider a sequence of positive noise scales $\sigma_{\min} = \sigma_1 < \sigma_2 < \dots < \sigma_N = \sigma_{\max}$ . Usually $\sigma_{\min}$ is small enough such that $p_{\sigma_{\min}}(\bold{x}) \approx p_{data}(\bold{x})$ and $\sigma_{\max}$ is large enough such that $p_{\sigma_{\max}}(\bold{x}) \approx \mathcal{N}(\bold{x}; \bold{0}, \sigma_{\max}^2 \bold{I})$

Overall Step (derived from single step): $p(\bold{x}_t | \bold{x}_0) = \mathcal{N}(\bold{x}_0, \sigma_t^2 \bold{I})$

Single Step (by design): $p(\bold{x}_t | \bold{x}_{t-1}) = \mathcal{N}(\bold{x}_{t-1}, (\sigma_t^2 - \sigma_{t-1}^2) \bold{I})$

Previous work propose to train a Noise Conditional Score Network $\bold{s}_{\theta}(\bold{x}, \sigma)$ with a weighted sum of denoising score matching objectives:

\theta^* = \argmin_{\theta} \sum_{i=1}^N \sigma_i^2 \mathbb{E}_{p_{data}(\bold{x})} \mathbb{E}_{p_{\sigma_i}(\tilde{\bold{x}}|\bold{x})}\left[ ||\bold{s}_{\theta}(\tilde{\bold{x}}, \sigma_i) - \nabla_{\tilde{\bold{x}}} \log p_{\sigma_i}(\tilde{\bold{x}}|\bold{x})||^2_2 \right]

Here, $\nabla_{\tilde{\bold{x}}} \log p_{\sigma_i}(\tilde{\bold{x}}|\bold{x}) = - \frac{\tilde{\bold{x}} - \bold{x}}{\sigma_i^2}$ . The optimal score-based model $\bold{s}_{\theta^*}(\bold{x}, \sigma)$ can be obtained given sufficient data and model capacity. It matches $\nabla{\bold{x}} \log p_{\sigma}(\bold{x})$ almost everywhere for $\sigma \in \{ \sigma_i \}_{i=1}^N$ .

For sampling, the work uses $M$ steps of Langevin MCMC to get a sample for each $p_{\sigma_i}(\bold{x})$ sequentially:

\bold{x}_i^m = \bold{x}_i^{m-1} + \epsilon_i \bold{s}_{\theta^*}(\bold{x}_i^{m-1}, \sigma_i) + \sqrt{2 \epsilon_i} \bold{z}_i^m \qquad m=1,2,\dots,M,

where $\epsilon_i > 0$ is the step size, and $\bold{z}_i^m$ is standard normal. The above process is repeated for $i = N, N-1, \dots, 1$ in turn with $\bold{x}_N^0 \sim \mathcal{N}(\bold{x} | \bold{0}, \sigma_{\max}^2 \bold{I})$ and $\bold{x}_i^0 = \bold{x}_{i+1}^M$ when $i < N$ .

Note: $i$ means different noise levels; $M$ means denoise $M$ steps at each noise level; Therefore, to generate a sample, we need to run the score function $N \times M$ times:

\bold{x}_N^0 \rightarrow \bold{x}_N^1 \rightarrow \dots \rightarrow \bold{x}_N^M \rightarrow \bold{x}_{N-1}^1 \rightarrow \dots \rightarrow \bold{x}_{N-1}^M \rightarrow \dots \rightarrow \bold{x}_{0}^M

3.2 Denoising Diffusion Probabilistic Models (DDPM)

Consider a sequence of positive noise scales $0 < \beta_1, \beta_2, \dots, \beta_N < 1$ . For each training data point $\bold{x}_0 \sim p_{data}(\bold{x})$ , construct a discrete Markov Chain $\{\bold{x}_0, \bold{x}_1, \dots, \bold{x}_N\}$ . The single step and overall step of this chain is:

Single step (by design): $p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N} (\bold{x}_i; \sqrt{1 - \beta_i} \bold{x}_{i-1}, \beta_i \bold{I})$

Overall step (derived): $p_{\alpha_i}(\bold{x}_i | \bold{x}_0) = \mathcal{N} (\bold{x}_i ; \sqrt{\alpha_i} \bold{x}_0, (1 - \alpha_i) \bold{I})$

Here $\alpha_i \coloneqq \Pi_{j=1}^i (1 - \beta_j)$

Proof of overall step:

Since $\bold{x}_i = \sqrt{1 - \beta_i}\bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_i$ and $\bold{x}_{i+1} = \sqrt{1 - \beta_{i+1}} \bold{x}_i + \sqrt{\beta_{i+1}} \bold{z}_{i+1}$ , we have:

\begin{align*} \bold{x}_{i+1} &= \sqrt{1 - \beta_{i+1}} (\sqrt{1 - \beta_i}\bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_i) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt{(1 - \beta_{i+1})\beta_i} \bold{z}_i + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt {\beta_i + \beta_{i+1} - \beta_i \beta_{i+1}} \bold{z} \qquad (\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})) \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt{1 - (1 - \beta_{i+1})(1 - \beta_i)} \bold{z}\end{align*}

By induction, we can conclude that:

\begin{align*} \bold{x}_i &= \sqrt{\Pi_{j=1}^i (1 - \beta_j)} \bold{x}_0 + \sqrt{1 - \Pi_{j=1}^i (1 - \beta_j)} \bold{z} \\ &= \sqrt{\alpha_i} \bold{x}_0 + \sqrt{1 - \alpha_i} \bold{z} \qquad (\bold{z} \sim \mathcal{N} (\bold{0}, \bold{I}))\end{align*}

Therefore, $p_{\alpha_i}(\bold{x}_i | \bold{x}_0) = \mathcal{N} (\bold{x}_i ; \sqrt{\alpha_i} \bold{x}_0, (1 - \alpha_i) \bold{I})$ .

End of proof.

The noise scales are prescribed such that $\bold{x}_N$ is approximately distributed according to $\mathcal{N} ( \bold{0}, \bold{I})$

The denoising distribution derived from Bayesian formula is:

p_{\theta}(\bold{x}_{i-1} | \bold{x}_i) = \mathcal{N} (\bold{x}_{i-1}; \frac{1}{\sqrt{1 - \beta_i}}(\bold{x}_i + \beta_i \bold{s}_{\theta}(\bold{x}_i, i)), \beta_i \bold{I})

and it can be trained with a re-weighted variant of the evidence lower bound (ELBO):

\theta^* = \argmin_{\theta} \sum_{i=1}^N (1 - \alpha_i) \mathbb{E}_{p_{data}(\bold{x})} \mathbb{E}_{p_{\alpha_i}}(\tilde{\bold{x}} | \bold{x}) [||\bold{s}_{\theta}(\tilde{\bold{x}} , i) - \nabla_{\tilde{\bold{x}}}\log p_{\alpha_i}(\tilde{\bold{x}} | \bold{x})||^2_2]

After solving the above equation we get the optimal model $\bold{s}_{\theta^*} (\bold{x}_i, i)$ , then samples can be generated by starting from $\bold{x}_N \sim \mathcal{N}(\bold{0}, \bold{I})$ and following the estimated reverse Markov chain as below:

\bold{x}_{i-1} = \frac{1}{\sqrt{1 - \beta_i}} (\bold{x}_i + \beta_i \bold{s}_{\theta^*}(\bold{x}_i, i)) + \sqrt{\beta_i} \bold{z}_i, \qquad i=N, N-1, \dots, 1.

NOTE: we can notice that the weights of the i-th summand in both SMLD and DDPM loss functions, namely $\sigma_i^2$ and $(1 - \alpha_i)$ , are related to the corresponding perturbation kernels in the same functional form: $\sigma_i^2 \propto 1 / \mathbb{E} [||\nabla_{\bold{x}} \log p_{\sigma_i}(\tilde{\bold{x}}| \bold{x})||^2_2]$ and $(1 - \alpha_i) \propto 1 / \mathbb{E} [||\nabla_{\bold{x}} \log p_{\alpha_i}(\tilde{\bold{x}}| \bold{x})||^2_2]$

4. VE, VP SDEs: Derive SDEs from Conditional Distributions

The noise perturbations used in SMLD and DDPM can be regarded as discretizations of two different SDEs.

VE Model

When using a total of N noise scales, each perturbation kernel $p_{\sigma_i}(\bold{x} | \bold{x}_0)$ of SMLD corresponds to the distribution of $\bold{x}_i$ in the following Markov chain:

\bold{x}_i = \bold{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \bold{z}_{i-1} \qquad i=1,\dots,N.

where $\bold{z}_{i-1} \sim \mathcal{N}(\bold{0}, \bold{I})$ , and we have introduced $\sigma_0 = 0$ to simplify the notation.

(Recall that single step perturbation of SMLD is: $p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N}(\bold{x}_i; \bold{x}_{i-1}, (\sigma_i^2 - \sigma_{i-1}^2) \bold{I})$

When $N \rightarrow \infty$ , $\{ \sigma_i \}_{i=1}^N$ becomes a function $\sigma(t)$ , $\bold{z}_i$ becomes $\bold{z}(t)$ , and the Markov chain $\{ \bold{x}_i \}_{i=1}^N$ becomes a continuous stochastic process $\{ \bold{x}(t) \}_{t=0}^1$ , where we have used a continuous time variable $t \in [0, 1]$ . The process $\{ \bold{x} (t) \}_{t=0}^1$ is given by the following SDE:

d \bold{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w}

VP Model

Similarly for DDPM, the perturbation kernels are $p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N} (\sqrt{1 - \beta_i} x_{i-1}, \beta_i \bold{I})$ . Then the discrete Markov chain is:

\bold{x}_i = \sqrt{1 - \beta_i} \bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_{i-1} \qquad i=1,\dots,N.

As $N \rightarrow \infty$ , the Markov chain converges to the following SDE:

d \bold{x} = - \frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d \bold{w}

Proof for SMLD(VE):

\bold{x}_i = \bold{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \bold{z}_{i-1} \qquad i=1,\dots,N.

Define some notations first: $\bold{x}(\frac{i}{N}) = \bold{x}_i$ , $\sigma(\frac{i}{N}) = \sigma_i$ , and $\bold{z} (\frac{i}{N}) = \bold{z}_i$ for $i=1,\dots,N$ . We can rewrite the Markov chain as follows with $\Delta t = \frac{1}{N}$ and $t \in \{ i, \frac{1}{N}, \dots, \frac{N-1}{N} \}$ :

\bold{x}(t + \Delta t) = \bold{x}(t) + \sqrt{\sigma^2(t + \Delta t) - \sigma^2 (t)} \bold{z}(t) \approx \bold{x}(t) + \sqrt{\frac{d[\sigma^2(t)]}{dt}\Delta t} \bold{z}(t)

The approximation is given by the definition of derivative: $\frac{d[\sigma^2(t)]}{dt} = \lim_{\Delta t \rightarrow 0} \frac{\sigma^2(t + \Delta t) - \sigma^2 (t)}{\Delta t}$

As $\Delta t \rightarrow 0$ , $\sqrt{\Delta t} z(t) = d \bold{w}$ where $\bold{w}$ is a Wiener process. This is because $\bold{w}(t + \Delta t) - \bold{w}(t) \sim \mathcal{N}(0, \Delta t)$ .

End of Proof.

Proof for DDPM(VP):

\bold{x}_i = \sqrt{1 - \beta_i} \bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_{i-1} \qquad i=1,\dots,N.

Define an auxiliary set of noise scales $\{ \bar{\beta_i} = N \beta_i \}_{i=1}^N$ and rewrite the Markov chain as below:

\bold{x}_i = \sqrt{1 - \frac{\bar{\beta_i}}{N}} \bold{x}_{i-1} + \sqrt{\frac{\bar{\beta_i}}{N}} \bold{z}_{i-1} \qquad i=1,\dots, N

In the limit of $N \rightarrow \infty$ , $\{ \bar{\beta_i}\}_{i=1}^N$ becomes a function $\beta(t)$ indexed by $t \in [0, 1]$ . Let $\beta(\frac{i}{N}) = \bar{\beta_i}$ , $\bold{x}(\frac{i}{N}) = \bold{x}_i$ , $\bold{z}(\frac{i}{N}) = \bold{z}_i$ . We can rewrite the above Markov chain as the following with $\Delta t = \frac{1}{N}$ and $t \in \{ 0, 1, \dots, \frac{N-1}{N} \}$ :

\begin{align*} \bold{x}(t + \Delta t) &= \sqrt{1 - \beta(t + \Delta t) \Delta t} \bold{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \bold{z}(t) \\ &\approx \bold{x}(t) - \frac{1}{2} \beta(t + \Delta t) \Delta t \bold{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \bold{z}(t) \\ &\approx \bold{x}(t) - \frac{1}{2} \beta(t) \Delta t \bold{x}(t) + \sqrt{\beta(t) \Delta t} \bold{z}(t)\end{align*}

The first approximation comes from Taylor Expansion: $\sqrt{1 - x} = 1 - \frac{x}{2} - \frac{x^2}{8} - \dots \approx 1 - \frac{x}{2}$

Therefore, in the limit of $\Delta t \rightarrow 0$ , the Markov chain converges to the following VP SDE:

d \bold{x} = - \frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d \bold{w}

5. How to solve the SDE?

Idea: Solve $\mathbb{E}(\bold{x}_t)$ and $\text{Var}(\bold{x}_t)$ , then under the Gaussian assumption, we know that $p_{0t}(\bold{x}_t | \bold{x}_0) \sim \mathcal{N}(\cdot, \cdot)$

Theorem 5.1 (simplified from Equation (5.50), Equation (5.51) in Applied Stochastic Differential Equations)

If the SDE takes the form:

d \bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w}

Then the expectation $\bold{m}(t)$ and covariance matrix $\bold{P}(t)$ of $\bold{x}(t)$ have:

\frac{d \bold{m}(t)}{dt} = \mathbb{E} [\bold{f}(\bold{x}, t)]

\frac{d \bold{P}(t)}{dt} = \mathbb{E}[\bold{f}(\bold{x}, t) (\bold{x} - \bold{m}(t))^T] + \mathbb{E}[ (\bold{x} - \bold{m}(t))\bold{f}^T(\bold{x}, t)] + \mathbb{E}[g^2(t)]

Solution to VP SDE:

\text{VP Model}: \quad d \bold{x} = -\frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d\bold{w}

By Theorem 5.1, we have

\begin{align*} \frac{d \bold{m}}{dt} &= \mathbb{E}[-\frac{1}{2} \beta(t) \bold{x}] = -\frac{1}{2} \beta(t) \mathbb{E}[\bold{x}(t)] \\ &= -\frac{1}{2} \beta(t) \bold{m} \\ \Rightarrow \frac{d\bold{m}}{\bold{m}} &= -\frac{1}{2} \beta(t) dt \\ \ln \bold{m}(t) - \ln \bold{m}(0) & = -\frac{1}{2} \int_0^t \beta(s) ds \\ \Rightarrow \bold{m}(t) &= e^{\ln \bold{m}(0) - \frac{1}{2} \int_0^t \beta(s) ds} = \bold{m}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}\end{align*}

Therefore,

\mathbb{E}[\bold{x} (t) | \bold{x}(0)] = \bold{x}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}

For covariance matrix $P(t)$ :

\begin{align*} \frac{dP(t)}{dt} &= \mathbb{E} [-\frac{1}{2} \beta(t)\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] + \mathbb{E} [-\frac{1}{2} \beta(t) (\bold{x}(t) - \bold{m}(t)) \bold{x}(t)^T] + \mathbb{E}[\beta(t)] \\ &= -\beta(t)\mathbb{E}[\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] + \beta(t) \end{align*}

Since $\mathbb{E}[\bold{x}(t) - \bold{m}(t)] = 0$ , we have:

\begin{align*} \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] &= \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] - 0 \\ &= \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] - \bold{m}(t) \mathbb{E} [(\bold{x}(t) - \bold{m}(t))^T] \\ &= \mathbb{E} [(\bold{x}(t) - \bold{m}(t))(\bold{x}(t) - \bold{m}(t))^T] \\ &= P(t)\end{align*}

\begin{align*} \Rightarrow \frac{d P(t)}{dt} &= -\beta(t) P(t) + \beta(t) \\ &= \beta(t) (\bold{I} - P(t)) \\ \frac{d P(t)}{\bold{I} - P(t)} &= \beta(t) dt \\ -\ln (\bold{I} - P(t)) + \ln (\bold{I} - P(0)) &= \int_0^t \beta(s) ds \\ \bold{I} - P(t) &= \exp\{\ln(\bold{I} - P(0)) -\int_0^t \beta(s)ds \} \\ P(t) &= \bold{I} - (\bold{I} - P(0)) e^{-\int_0^t \beta(s)ds} \end{align*}

Since $Cov(\bold{x}_0 | \bold{x}_0) = 0$ , we have:

Cov(\bold{x}_t | \bold{x}_0) = \bold{I} - \bold{I} e^{-\int_0^t \beta(s)ds}

Therefore, the solution to VP model is:

p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}, \bold{I} - \bold{I} e^{-\int_0^t \beta(s)ds}) \qquad (\text{VP SDE})

Solution to VE SDE:

\text{VE Model}: \quad d\bold{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w} \qquad (\bold{f}(\bold{x}, t) = 0)

By Theorem 5.1:

\frac{d\bold{m}}{dt} = 0 \Rightarrow \bold{m}(t) = C = \bold{m}(0) = \bold{x}_0

\begin{align*}\frac{d P(t)}{dt} &= \mathbb{E} [\frac{d[\sigma^2(t)]}{dt}] = \frac{d[\sigma^2(t)]}{dt} \\ dP(t) &= d[\sigma^2(t)] \\ P(t) - P(0) &= \sigma^2(t) - \sigma^2(0) \\ P(t) &= \sigma^2(t) - \sigma^2(0)\end{align*}

Therefore, the solution to VE SDE is:

p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) , [\sigma^2(t) - \sigma^2(0)] \bold{I}) \qquad (\text{VE SDE})

6. Derive mean and variance of the perturbation kernel from sub-VP SDE

d \bold{x} = -\frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t) (1 - e^{-2 \int_0^t \beta(s) ds})} d\bold{w}

Why sub-VP SDE:

Perform well on likelihood

Variance is bounded by VP SDE

Since VE, VP and sub-VP SDEs all have affine drift coefficients, their perturbation kernels $p_{0t} (\bold{x}(t) | \bold{x}(0))$ are all Gaussian and can be computed in closed-forms. This makes training with the score-matching loss:

\theta^* = \argmin_{\theta} \mathbb{E}_t \{ \lambda(t) \mathbb{E}_{\bold{x}(0)} \mathbb{E}_{\bold{x}(t) | \bold{x}(0)} [||\bold{s}_{\theta}(\bold{x}(t), t) - \nabla_{\bold{x}(t)} \log p_{0t}(\bold{x}(t) | \bold{x}(0))||^2_2] \}

Solution to sub-VP SDE:

Corollary 6.1:

Given an ODE of the form $y'(x) + p(x)y(x) = f(x)$ , the solution is given by:

y(x) = \frac{1}{\mu(x)} \left( \int f(\xi) \mu(\xi) d\xi + C \right)

where $\mu(t) = \exp \left( \int^t p(\xi)d\xi \right)$ .

Similar to VP SDE, we have:

\mathbb{E} [\bold{x}(t) | \bold{x}(0)] = \bold{x}(0) e^{-\frac{1}{2} \int_0^t \beta(s) ds}

By Theorem 5.1:

\begin{align*} \frac{d P(t)}{dt} &= -\beta(t) P(t) + \beta(t) (1 - \exp \{ -2 \int_0^t \beta(s) ds \}) \\ P'(t) + \beta(t) P(t) &= \beta(t) (1 - \exp \{ -2 \int_0^t \beta(s) ds \})\end{align*}

By Corollary 6.1:

\begin{align*} P(t) &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left( \int_0^t \beta(s) \left[ 1 - \exp\{ -2 \int_0^s \beta(\xi) d\xi \} \right] \exp \{ \int_0^s \beta(\xi) d\xi \} ds + C \right) \\ &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left( \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds - \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds + C \right)\end{align*}

Denote $\textcircled{1} = \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds$ and $\textcircled{2} = \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds$ . Solve them separately.

\begin{align*} \textcircled{1} &= \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds \\ &= \exp \{ \int_0^s \beta(\xi) d\xi \} |_{s=0}^{s=t} \\ &= \exp \{ \int_0^t \beta(s) ds \} - 1 \end{align*}

\begin{align*} \textcircled{2} &= \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds \\ &= - \exp \{ -\int_0^s \beta(\xi) d\xi \} |_{s=0}^{s=t} \\ &= - \exp \{ -\int_0^t \beta(s) ds \} + 1\end{align*}

Therefore:

\begin{align*} P(t) &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left[ \exp \{ \int_0^t \beta(s) ds \} + \exp \{ - \int_0^t \beta(s) ds \} + C \right] \\ &= \bold{I} \cdot \left[ 1 + \exp \{ -2\int_0^t \beta(s) ds \} + \exp \{ -\int_0^t \beta(s) ds \} \cdot C\right]\end{align*}

Plug in $t=0$ , we have $P(0) = \bold{I} (2 + C) \Rightarrow C \bold{I} = P(0) - 2\bold{I}$ . Then:

P(t) = \bold{I} + e^{-2 \int_0^t \beta(s) ds} \bold{I} + e^{- \int_0^t \beta(s) ds} (P(0) - 2 \bold{I})

Note: If $\lim_{t \rightarrow \infty} \int_0^t \beta(s) ds = \infty$ , we can observe that $\lim_{t \rightarrow \infty} P(t) = \bold{I}$ . This justifies the use of sub-VP SDEs for score-based generative modeling, since they can perturb any data distribution to standard Gaussian under suitable conditions.

Since $P(0) = 0$ , we have:

P(t) = \bold{I} + e^{-2 \int_0^t \beta(s) ds} \bold{I} - 2 e^{- \int_0^t \beta(s) ds} \bold{I} = [1 - e^{- \int_0^t \beta(s) ds}]^2 \bold{I}

Therefore, the solution to sub-VP SDEs is:

p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) e^{-\frac{1}{2} \int_0^t \beta(s) ds} , [1 - e^{- \int_0^t \beta(s) ds}]^2 \bold{I}) \qquad \text{(sub-VP SDE)}

7. How to choose the noise scale

7.1 SMLD (VE SDEs)

In SMLD, the noise scales $\{ \sigma_i \}_{i=1}^N$ is typically a geometric sequence where $\sigma_{\text{min}} = 0.01$ and $\sigma_{\text{max}}$ is chosen to Technique 1 in Song & Ermon (2020).

Technique 1: Choose $\sigma_{\text{max}}$  to be as large as the maximum Euclidean distance between all pairs of training data points. (Usually, SMLD models normalize image inputs to the range [0,1])

Since $\{ \sigma_i \}_{i=1}^N$ is a geometric sequence, we have $\sigma(\frac{i}{N}) = \sigma_i = \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^{\frac{i-1}{N-1}}$ for $i = 1, \dots, N$ . When $N \rightarrow \infty$ , $\sigma(t) = \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^t$ for $t \in (0, 1]$ .

Then the corresponding VE SDE is:

\begin{align*} d \bold{x} &= \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w} \\ &= \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^t \sqrt{2 \ln \frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}} d\bold{w}\end{align*}

The perturbation kernel can be derived according to Section 5 VE SDEs:

p_{0t} (\bold{x}(t) | \bold{x}(0)) = \mathcal{N} \left( \bold{x}(t); \bold{x}(0), \sigma^2_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^{2t} \bold{I} \right), \quad t \in (0, 1]

7.2 DDPM (VP SDEs)

In DDPM, $\{ \beta_i \}_{i=1}^N$ is typically an arithmetic sequence where $\beta_i = \frac{\bar{\beta}_{\text{min}}}{N} + \frac{i-1}{N(N-1)}(\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}})$ for $i=1,\dots,N$ . As $N \rightarrow \infty$ , $\beta(t) = \bar{\beta}_{\text{min}} + t (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}})$ for $t \in [0, 1]$ . This correspinds to the following instantialtion of VP SDE:

d \bold{x} = -\frac{1}{2} (\bar{\beta}_{\text{min}} + t (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}})) \bold{x} dt + \sqrt{\bar{\beta}_{\text{min}} + t (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}})} d \bold{w}, \quad t\in[0,1].

where $\bold{x}(0) \sim p_{\text{data}}(\bold{x})$ . Note: $\bar{\beta}_{\text{min}} = 0.1$ and $\bar{\beta}_{\text{max}} = 20$ is set to match previous work.

Then the perturbation kernel becomes:

p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} \left( \bold{x}(t); e^{-\frac{1}{4}t^2 (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}})- \frac{1}{2} t \bar{\beta}_{\text{min}}}\bold{x}(0), \bold{I} - \bold{I} e^{-\frac{1}{2} t^2 (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}}) -t \bar{\beta}_{\text{min}}} \right), t\in[0,1]

7.3 sub-VP SDEs

For sub-VP SDEs, the same $\beta(t)$ is used as VP SDEs. This leads to the following perturbation kernel:

p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} \left( \bold{x}(t); e^{-\frac{1}{4}t^2 (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}})- \frac{1}{2} t \bar{\beta}_{\text{min}}}\bold{x}(0), [1- e^{-\frac{1}{2} t^2 (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}}) -t \bar{\beta}_{\text{min}}}]^2 \bold{I} \right), t\in[0,1]

8. Two ways for sampling from SDEs: ODE, Reverse-SDE

Solution1: Convert an SDE to an ODE

Solution2: Convert an SDE to a reverse-SDE

8.1 Reverse Diffusion Sampling

Given a forward SDE:

d \bold{x} = \bold{f}(\bold{x}, t) dt + \bold{G}(t) d\bold{w}

the following iteration rule is a discretization of it:

\bold{x}_{i+1} = \bold{x}_i + \bold{f}_i(\bold{x}_i) + \bold{G}_i \bold{z}_i, \qquad i=0,1,\dots,N

where $\bold{z}_i \sim \mathcal{N}(\bold{0}, \bold{I})$ . The discretization schedule of time is absorbed into the notations of $\bold{f}_i$ and $\bold{G}_i$ .

Then we want to discretize the reverse-time SDE

d\bold{x} = [\bold{f}(\bold{x}, t) - \bold{G}(t) \bold{G}^T(t) \nabla_{\bold{x}}\log p_t(\bold{x})]dt + \bold{G}(t) d \bold{\bar{w}}

With a similar functional form, the discretization is given by:

\bold{x}_i = \bold{x}_{i+1} - \bold{f}_{i+1}(\bold{x}_{i+1}) + \bold{G}_{i+1}\bold{G}^T_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \bold{G}_{i+1} \bold{z}_{i+1}

where the trained score-based model $\bold{s}_{\theta^*}(\bold{x}_i, i)$ is conditioned on iteration number $i$ .

Note: the discretization is derived from $d \bold{x} = \bold{x}_i - \bold{x}_{i+1}$ , $dt = \Delta t = -1$ , and $d \bold{\bar{w}} = \bold{w}_{t} - \bold{w}_{t + \Delta t} = \sqrt{|\Delta t|} \bold{z} \sim \mathcal{N}(0, |\Delta t| \bold{I})$ .

8.2 Probability Flow ODE

Consider a general form SDE:

d \bold{x} = \bold{f}(\bold{x}, t) dt + \bold{G}(\bold{x}, t) d\bold{w}

where $\bold{f}(\bold{x}, t): \mathbb{R}^d \rightarrow \mathbb{R}^d$ and $\bold{G}(\bold{x}, t): \mathbb{R}^d \rightarrow \mathbb{R}^{d \times d}$ . The corresponding probability flow ODE is:

d\bold{x} = \{ \bold{f}(\bold{x}, t) - \frac{1}{2}\nabla \cdot [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T] - \frac{1}{2}\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T \nabla_{\bold{x}} \log p_t(\bold{x})\} dt

where we define $\nabla \cdot \bold{F}(\bold{x}) \coloneqq (\nabla \cdot \bold{f}^1(\bold{x}), \nabla \cdot \bold{f}^2 (\bold{x}), \dots, \nabla \cdot \bold{f}^d (\bold{x}))^T: \mathbb{R}^d \rightarrow \mathbb{R}^d$ for a matrix-valued function $\bold{F} (\bold{x}) \coloneqq (\bold{f}^1(\bold{x}), \bold{f}^2(\bold{x}), \dots, \bold{f}^d(\bold{x}))^T: \mathbb{R}^d \rightarrow \mathbb{R}^{d \times d}$ .

Recap on Vector Differential Calculus

Let $\bold{u}: \mathbb{R}^d \rightarrow \mathbb{R}$ be a scalar-valued function which takes an d-dimensional vector as input.
Gradient of a Scalar-valued Function: $\nabla \bold{u} = (\partial_1 \bold{u}, \dots, \partial_d \bold{u})^T$ .

Let $\bold{u}: \mathbb{R}^d \rightarrow \mathbb{R}^d$ ( $\bold{u}(\bold{x}) = (u_1(\bold{x}), \dots, u_d(\bold{x}))^T$ )be a vector-valued function which takes an d-dimensional vector as input and output an d-dimensional vector.
Divergence of a Vector-valued Function: $\nabla \cdot \bold{u} = \sum_{i=1}^d \partial_i u_i$ .

Let $\bold{U}(\bold{x}): \mathbb{R}^d \rightarrow \mathbb{R}^{d \times d}$ be a matrix-valued function which takes an d-dimensional vector as input and output an $d \times d$ matrix:
Divergence of Matrix-valued Function:

\bold{U}(\bold{x}) = \begin{bmatrix} u_{11}(\bold{x}) & \dots & u_{1d}(\bold{x}) \\ \vdots & \ddots & \vdots \\ u_{d1}(\bold{x}) & \dots & u_{dd}(\bold{x})\end{bmatrix} = \begin{bmatrix} \bold{u}_1(\bold{x}) \\ \vdots \\ \bold{u}_d(\bold{x})\end{bmatrix}

\nabla \cdot \bold{U}(\bold{x}) = \begin{bmatrix} \nabla \cdot \bold{u}_1(\bold{x}) \\ \vdots \\ \nabla \cdot \bold{u}_d(\bold{x}) \end{bmatrix} = \begin{bmatrix} \sum_{j=1}^d \partial_j u_{1j}(\bold{x}) \\ \vdots \\ \sum_{j=1}^d \partial_j u_{dj}(\bold{x}) \end{bmatrix}

Derivation of Probability Flow ODE:

Let $p_t(\bold{x}(t))$ denote the marginal probability of $\bold{x}(t)$ . According to Kolmogorov’s forward equation (Fokker-Planck equation), we have:

\frac{\partial p_t(\bold{x})}{\partial t} = - \sum_{i=1}^d \frac{\partial}{\partial x_i}[f_i(\bold{x}, t) p_t(x)] + \frac{1}{2} \sum_{i=1}^d \sum_{j=1}^d \frac{\partial^2}{\partial x_i \partial x_j} \left[\sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk}(\bold{x}, t) p_t(\bold{x})\right].

This equation can be rewrite into:

\frac{\partial p_t(\bold{x})}{\partial t} = - \sum_{i=1}^d \frac{\partial}{\partial x_i}[f_i(\bold{x}, t) p_t(x)] + \frac{1}{2} \sum_{i=1}^d \frac{\partial}{\partial x_i} \left[ \sum_{j=1}^d \frac{\partial}{\partial x_j} \left[ \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk} (\bold{x}, t)p_t(\bold{x}) \right] \right]

Note that:

\begin{align*} &\sum_{j=1}^d \frac{\partial}{\partial x_j} \left[ \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk} (\bold{x}, t)p_t(\bold{x}) \right] = \sum_{j=1}^d \frac{\partial}{\partial x_j} \left[ p_t(\bold{x}) \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk} (\bold{x}, t) \right] \\ =& \sum_{j=1}^d \frac{\partial}{\partial x_j} \left[ \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk} (\bold{x}, t) \right] p_t(\bold{x}) + \sum_{j=1}^d \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk}(\bold{x}, t) p_t(\bold{x}) \frac{\partial}{\partial x_j} \log p_t(\bold{x}) \end{align*}

Explanation 1: $f(x) \frac{d \log f(x)}{dx} = f(x) \cdot \frac{1}{f(x)} \frac{d f(x)}{dx} = \frac{df(x)}{dx}$ .

For simplicity, let’s denote $\bold{F} (\bold{x}, t) = (\bold{F}_{ij}(\bold{x}, t))_{d \times d} = \bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T$ , where $\bold{F}_{ij}(\bold{x}, t) = \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk}(\bold{x}, t)$ . Then we can rewrite the above equation:

\begin{align*} &\sum_{j=1}^d \frac{\partial}{\partial x_j} \left[ \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk} (\bold{x}, t)p_t(\bold{x}) \right] \\ =& \sum_{i=1}^d \frac{\partial}{\partial x_j}\bold{F}_{ij}(\bold{x}, t) p_t(\bold{x}) + \sum_{j=1}^d \bold{F}_{ij}(\bold{x}, t) p_t(\bold{x}) \frac{\partial}{\partial x_j} \log p_t(\bold{x}) \\ =& p_t(\bold{x})\sum_{i=1}^d \frac{\partial}{\partial x_j}\bold{F}_{ij}(\bold{x}, t) + p_t(\bold{x})\sum_{j=1}^d \bold{F}_{ij}(\bold{x}, t) \frac{\partial}{\partial x_j} \log p_t(\bold{x}) \\ =& p_t(\bold{x}) [\nabla \cdot \bold{F} (\bold{x}, t)]_i + p_t(\bold{x}) [\bold{F} (\bold{x}, t)^T \cdot \nabla_{\bold{x}} \log p_t(\bold{x})]_i \end{align*}

Therefore, go back to the Fokker-Planck Equation:

\begin{align*} \frac{\partial p_t(\bold{x})}{\partial t} &= - \sum_{i=1}^d \frac{\partial}{\partial x_i}[f_i(\bold{x}, t) p_t(x)] + \frac{1}{2} \sum_{i=1}^d \frac{\partial}{\partial x_i} \left[p_t(\bold{x}) [\nabla \cdot \bold{F} (\bold{x}, t)]_i + p_t(\bold{x}) [\bold{F} (\bold{x}, t)^T \cdot \nabla_{\bold{x}} \log p_t(\bold{x})]_i \right] \\ &= -\sum_{i=1}^d \frac{\partial}{\partial x_i} \left\{ p_t(\bold{x}) \left[ f_i(\bold{x}, t) - \frac{1}{2} [\nabla \cdot \bold{F} (\bold{x}, t)]_i - \frac{1}{2} [\bold{F} (\bold{x}, t)^T \cdot \nabla_{\bold{x}} \log p_t(\bold{x})]_i \right] \right\} \\ &= -\sum_{i=1}^d \frac{\partial}{\partial x_i}[\tilde{f}_i(\bold{x}, t)p_t(\bold{x})] \end{align*}

Here:

\begin{align*} \bold{\tilde{f}}(\bold{x}, t) &= \bold{f}(\bold{x}, t) - \frac{1}{2} \nabla \cdot \bold{F}(\bold{x}, t) - \frac{1}{2} \bold{F}(\bold{x}, t)^T \nabla_{\bold{x}} \log p_t(\bold{x}) \\ &= \bold{f}(\bold{x}, t) - \frac{1}{2} \nabla \cdot [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)] - \frac{1}{2} [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T] \nabla_{\bold{x}} \log p_t(\bold{x}) \end{align*}

Therefore, this Fokker-Planck Equation can not only be derived from $d \bold{x} = \bold{f}(\bold{x}, t) dt + \bold{G}(\bold{x}, t) d\bold{w}$ , but also can be derived from another SDE with $\bold{\tilde{G}}(\bold{x}, t) \coloneqq 0$ :

d \bold{x} = \bold{\tilde{f}}(\bold{x}, t) dt + \bold{\tilde{G}}(\bold{x}, t) d\bold{w}

which is essentially an ODE:

d \bold{x} = \bold{\tilde{f}}(\bold{x}, t) dt = \left\{ \bold{f}(\bold{x}, t) - \frac{1}{2} \nabla \cdot [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)] - \frac{1}{2} [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T] \nabla_{\bold{x}} \log p_t(\bold{x}) \right\} dt

Hence, we have derived the corresponding ODE of the original SDE

9. Sample from VE SDE and VP SDE

Refresh, the forward SDE takes the form:

d \bold{x} = \bold{f}(\bold{x}, t) dt + \bold{G}(t) d \bold{w}

9.1 Reverse Diffusion Sampling

In Section8.1, given the reverse-time SDE:

d \bold{x} = \left[ \bold{f}(\bold{x}, t) - \bold{G}(t)\bold{G}(t)^T \nabla_{\bold{x}} \log p_t(\bold{x}) \right] dt + \bold{G}(t) d\bold{\bar{w}}

we have derived its discretization form:

\bold{x}_i = \bold{x}_{i+1} - \bold{f}_{i+1}(\bold{x}_{i+1}) + \bold{G}_{i+1}\bold{G}_{i+1}^T \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \bold{G}_{i+1}\bold{z}_{i+1}

Then, by applying this equation to VE SDE and VP SDE, we can obtain their sampler respectively.

VE SDE Sampler:

d \bold{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w}

For VE SDE, $\bold{f}(\bold{x}, t) = 0$ and $\bold{G}(t) = \sqrt{\frac{d[\sigma^2(t)]}{dt}}$ . Therefore, the VE SDE sampler is:

\begin{align*} \bold{x}_i &= \bold{x}_{i+1} + \frac{d[\sigma^2(t)]}{dt} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\frac{d[\sigma^2(t)]}{dt}} \bold{z}_{i+1} \\ &= \bold{x}_{i+1} + (\sigma_{i+1}^2 - \sigma_i^2) \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + (\sigma_{i+1}^2 - \sigma_i^2) \bold{z}_{i+1} \end{align*}

VP SDE Sampler:

d \bold{x} = -\frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d\bold{w}

For VP SDE, $\bold{f}(\bold{x}, t) = -\frac{1}{2} \beta(t) \bold{x}$ and $\bold{G}(t) = \sqrt{\beta(t)}$ . Therefore, the VP SDE sampler is:

\begin{align*} \bold{x}_i &= \bold{x}_{i+1} + \frac{1}{2} \beta_{i+1} x_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= (1 + \frac{1}{2} \beta_{i+1}) x_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &\approx (2 - \sqrt{1 - \beta_{i+1}}) x_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \end{align*}

9.2 Probability Flow Sampling

In section 8.2, given a general form SDE:

d \bold{x} = \bold{f}(\bold{x}, t) dt + \bold{G}(\bold{x}, t) d\bold{w}

we derived the corresponding probability flow ODE:

d \bold{x} = \left\{ \bold{f}(\bold{x}, t) - \frac{1}{2} \nabla \cdot [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)] - \frac{1}{2} [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T] \nabla_{\bold{x}} \log p_t(\bold{x}) \right\} dt

When $\bold{G}(\bold{x}, t) = \bold{G}(t)$ , the second term degrades to zero, then the ODE becomes:

d \bold{x} = \left\{ \bold{f}(\bold{x}, t) - \frac{1}{2} [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T] \nabla_{\bold{x}} \log p_t(\bold{x}) \right\} dt

We can employ any numerical method to integrate the probability flow ODE backwards in time and substitute the score function with the trained neural network for sample generation. One discretization takes the following form:

\bold{x}_i = \bold{x}_{i+1} - \bold{f}_{i+1} (\bold{x}_{i+1}) + \frac{1}{2} \bold{G}_{i+1} \bold{G}_{i+1}^T \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1), \qquad i=0,1,\dots,N-1

By analogy to section 9.1, we can obtain the probability flow sampling rules for SMLD(VE model) and DDPM(VP model).

SMLD ODE Sampling:

\bold{x}_i = \bold{x}_{i+1} + \frac{1}{2}(\sigma_{i+1}^2 - \sigma_i^2) \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)

DDPM ODE Sampling:

\bold{x}_i = (2 - \sqrt{1 - \beta_{i+1}}) x_{i+1} + \frac{1}{2}\beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)

10. Equivalence between DDPM Sampling and VP Reverse-SDE

Claim: The DDPM sampling is equivalent to a numerical solution of VP reverse-SDE

The DDPM sampling takes the form:

\bold{x}_i = \frac{1}{\sqrt{1 - \beta_{i+1}}} (\bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)) + \sqrt{\beta_{i+1}} \bold{z}_{i+1}

According to the first order Taylor expansion of $f(x) = (1-x)^{-\frac{1}{2}}$ at $x=0$ ,

\begin{align*} \bold{x}_i &= \left(1 + \frac{1}{2} \beta_{i+1} + o(\beta_{i+1})\right)(\bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &\approx \left( 1 + \frac{1}{2} \beta_{i+1} \right)(\bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \left( 1 + \frac{1}{2} \beta_{i+1} \right) \bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \frac{1}{2}\beta_{i+1}^2 \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) +\sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &\approx \left( 1 + \frac{1}{2} \beta_{i+1} \right) \bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \left[ 2 - (1 - \frac{1}{2} \beta_{i+1}) \right] \bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &\approx ( 2 - \sqrt{1 - \beta_{i+1}} ) \bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1}\end{align*}

11. Predictor-Corrector Samplers

Algorithm 1 Predictor-Corrector (PC) sampling

Require:

$N$ : Number of discretization steps for the reverse-time SDE

$M$ : Number of corrector steps

Initialize $\bold{X}_N \sim p_T(\bold{x})$

for $i = N - 1$ to $0$ do

$\bold{x}_i \leftarrow \text{Predictor}(\bold{x}_{i+1})$

for $j=1$ to $M$ do

$\bold{x}_i \leftarrow \text{Corrector}(\bold{x}_i)$

return $\bold{x}_0$

Predictor-Corrector (PC) sampling

The predictor can be any numerical solver for the reverse-time SDE with a fixed discretization strategy. The corrector can be any score-based MCMC approach.

For example, when we use the reverse diffusion SDE solver (Section 9.1) as the predictor and annealed Langevin dynamics as the corrector, we have the following two algorithms for VE and VP SDEs respectively, where $\{ \epsilon_i \}_{i=0}^{N-1}$ are step sizes for Langevin dynamics.

Algorithm 2 PC sampling (VE SDE)

$\bold{x}_N \sim \mathcal{N}(\bold{0}, \sigma_{\text{max}}^2 \bold{I})$

for $i=N-1$ to $0$ do

$\bold{x}_i' \leftarrow \bold{x}_{i+1} + (\sigma_{i+1}^2 - \sigma_i^2) \bold{s}_{\theta^*}(\bold{x}_{i+1}, \sigma_{i+1})$

$\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})$

$\bold{x}_i \leftarrow \bold{x}_i' + \sqrt{\sigma_{i+1}^2 - \sigma_i^2} \bold{z}$

for $j=1$ to $M$ do

$\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})$

$\bold{x}_i \leftarrow \bold{x}_i + \epsilon_i \bold{s}_{\theta^*}(\bold{x}_i, \sigma_i) + \sqrt{2 \epsilon_i} \bold{z}$

return $\bold{x}_0$

Algorithm 3 PC sampling (VP SDE)

$\bold{x}_N \sim \mathcal{N}(\bold{0}, \bold{I})$

for $i = N-1$ to $0$ do

$\bold{x}_i' \leftarrow (2-\sqrt{1-\beta_{i+1}}) \bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)$

$\bold{z} \sim \mathcal{N} (\bold{0}, \bold{I})$

$\bold{x}_i \leftarrow \bold{x}_i' + \sqrt{\beta_{i+1}} \bold{z}$

for $j = 1$ to $M$ do

$\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})$

$\bold{x}_i \leftarrow \bold{x}_i + \epsilon_i \bold{s}_{\theta^*}(\bold{x}_i, i) + \sqrt{2 \epsilon_i} \bold{z}$

return $\bold{x}_0$

The corrector algorithms

The corrector algorithms are given in the following Algorithm 4 and 5, where $r$ is the “signal-to-noise” ratio. The Langevin Dynamics step size $\epsilon$ is determined using the norm of the Gaussian noise $||z||_2$ , norm of the score-based model $|| \bold{s}_{\theta^*}||_2$ and the signal-to-noise ratio $r$ .

Trick: when sampling a large batch of samples together, the norm $||\cdot||_2$ is replaced with the average norm across the mini-batch. When the batch size is small, $||\bold{z}||_2$ is replaced by $\sqrt{d}$ , where $d$ is the dimensionality of $\bold{z}$ .

Algorithm 4 Corrector algorithm (VE SDE).

Require: $\{ \sigma_i \}_{i=1}^N, r, N, M$

$\bold{x}_N^0 \sim \mathcal{N} (\bold{0}, \sigma_{\text{max}}^2 \bold{I})$

for $i \leftarrow N$ to $1$ do

for $j \leftarrow 1$ to $M$ do

$\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})$

$\bold{g} \leftarrow \bold{s}_{\theta^*}(\bold{x}_i^{j-1}, \sigma_i)$

$\epsilon \leftarrow 2 (r ||\bold{z}||_2 / ||\bold{g}||_2)^2$

$\bold{x}_i^j \leftarrow \bold{x}_i^{j-1} + \epsilon \bold{g} + \sqrt{2 \epsilon} \bold{z}$

$\bold{x}_{i-1}^0 \leftarrow \bold{x}_i^M$

return $\bold{x}_0^0$

Algorithm 5 Corrector algorithm (VP SDE).

Require: $\{ \beta_i \}_{i=1}^N , \{ \alpha_i \}_{i=1}^N, r, N, M$

$\bold{x}_N^0 \sim \mathcal{N} (\bold{0}, \bold{I})$

for $i \leftarrow N$ to $1$ do

for $j \leftarrow 1$ to $M$ do

$\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})$

$\bold{g} \leftarrow \bold{s}_{\theta^*}(\bold{x}_i^{j-1}, i)$

$\epsilon \leftarrow 2 \alpha_i (r ||\bold{z}||_2 / ||\bold{g}||_2)^2$

$\bold{x}_i^j \leftarrow \bold{x}_i^{j-1} + \epsilon \bold{g} + \sqrt{2 \epsilon} \bold{z}$

$\bold{x}_{i-1}^0 \leftarrow \bold{x}_i^M$

return $\bold{x}_0^0$

Denoising

For both SMLD and DDPM models, the generated samples typically contain small noise that is hard to detect by humans. This is part of the reason why NCSN models trained with SMLD has been performing worse than DDPM models in terms of FID, because the former doesn’t use a denoising step at the end of sampling. One can use Tweedie’s formula at the end of the sampling process.

For a Gaussian variable $\bold{z} \sim \mathcal{N} (\bold{z}; \mu_{z}, \Sigma_z)$ , Tweedie’s Formula states that:

\mathbb{E}[\mu_z | \bold{z}] = \bold{z} + \Sigma_z \nabla_{\bold{z}} \log p(\bold{z})

This is equivalent to running a predictor step without adding the random noise.