Score-Based Generative Modeling through Stochastic Differential Equations

This is a learning note of this series of videos.

Paper Link: https://arxiv.org/abs/2011.13456

1. Why we use SDE to describe the diffusion process?

We want to perturb the data with multiple noise scales(why??). The idea is to use SDE to provide an infinite number of noise scales(continuously changing).

The SDE (which is continuous) can be used for theoretical analysis. In practice, we discretize the SDE for numerical computation.

Goal: construct a diffusion process {x(t)}t=0T\{\bold{x}(t)\}^T_{t=0} indexed by a continuous time variable t[0,T]t \in [0, T], such that x(0)p0\bold{x}(0) \sim p_0, for which we have a dataset of i.i.d. samples, and x(T)pT\bold{x}(T) \sim p_T, for which we have a tractable form to generate samples efficiently. This diffusion process can be modeled as the solution to an Ito SDE:

dx=f(x,t)dt+g(t)dwd \bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w}

ww: a Brownian Motion whose increments follow the gaussian distribution, and variance increase with time:

w(t+Δt)w(t)N(0,Δt)\bold{w}(t+\Delta t) - \bold{w}(t) \sim \mathcal{N}(0, \Delta t)
dw=0+Δtϵwhere ϵN(0,I)d \bold{w} = 0 + \sqrt{\Delta t} \epsilon \qquad \text{where } \epsilon \sim \mathcal{N}(0, \bold{I})

f(,t):RdRd\bold{f} ( \cdot, t): \mathbb{R}^d \rightarrow \mathbb{R}^d is a vector valued function called the drift coefficient of x(t)\bold{x}(t), and g():RRg(\cdot): \mathbb{R} \rightarrow \mathbb{R} is a scalar function known as the diffusion coefficient of x(t)\bold{x}(t).

Starting from samples of x(T)pT\bold{x}(T) \sim p_T and reversing the process, we can obtain samples x(0)p0\bold{x}(0) \sim p_0. It is proved that the reverse of a diffusion process is also a diffusion process, running backward in time and given by the reverse-time SDE:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dwˉd\bold{x} = [\bold{f}(\bold{x}, t) - g(t)^2 \nabla_{\bold{x}} \log p_t(\bold{x})] dt + g(t) d \bar{\bold{w}}

where wˉ\bar{\bold{w}} is a standard Wiener process when time flows backwards from TT to 0, and dtdt is a n infinitesimal negative timestep. Once the score of each marginal distribution, xlogpt(x)\nabla_{\bold{x}} \log p_t(\bold{x}), is known for all t, we can derive the above reverse diffusion process and simulate it to sample from p0p_0.

2. How to derive the reverse-time SDE?

Forward SDE: dx=f(x,t)dt+g(t)dwd\bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w}

Reverse SDE: dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dwˉd\bold{x} = [\bold{f}(\bold{x}, t) - g(t)^2 \nabla_{\bold{x}} \log p_t(\bold{x})] dt + g(t) d \bar{\bold{w}}

Proof 1:

Important Assumption: the diffusion coefficient is g(t)g(t) rather than g(x,t)g(\bold{x}, t).

Characteristics of the Brownian Motion {w(t),t0}\{ w(t), t\geq 0\}:

  1. Gaussian Increments: w(t)w(s)N(0,ts)w(t) - w(s) \sim \mathcal{N}(0, t-s); w(t)w(0)N(0,t)w(t) - w(0) \sim \mathcal{N}(0, t)
  1. Independent Increments: If 0ust0 \leq u \leq s \leq t, then w(t)w(s)w(t) - w(s) and w(s)w(u)w(s) - w(u) are independent.
  1. Path Continuity: w(t)w(t) is a continuous function of tt.

Discretization of the forward SDE gives:

xt+Δtxt=f(xt,t)Δt+g(t)ΔtϵϵN(0,I)\bold{x}_{t + \Delta t} - \bold{x}_t = \bold{f}(\bold{x}_t, t) \Delta t + g(t) \sqrt{\Delta t} \epsilon \qquad \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})
xt+Δt=xt+f(xt,t)Δt+g(t)Δtϵ\Leftrightarrow \bold{x}_{t + \Delta t} = \bold{x}_t + \bold{f}(\bold{x}_t, t) \Delta t + g(t) \sqrt{\Delta t} \epsilon

This indicates that:

p(xt+Δtxt)=N(xt+Δtxt+f(xt,t)Δt,g(t)2ΔtI)exp(xt+Δtxtf(xt,t)Δt22g(t)2Δt)\begin{align*} p(\bold{x}_{t + \Delta t} | \bold{x}_t) &= \mathcal{N}(\bold{x}_{t + \Delta t} | \bold{x}_t + \bold{f}(\bold{x}_t, t) \Delta t, g(t)^2 \Delta t \cdot \bold{I}) \\ & \propto \exp \left( - \frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} \right)\end{align*}

According to Bayesian theorem:

p(xtxt+Δt)=p(xt+Δtxt)p(xt)p(xt+Δt)=p(xt+Δtxt)exp(logp(xt)logp(xt+Δt))exp(xt+Δtxtf(xt,t)Δt22g(t)2Δt+logp(xt)logp(xt+Δt))\begin{align*} p(\bold{x}_t | \bold{x}_{t+\Delta t}) &= \frac{p(\bold{x}_{t + \Delta t} | \bold{x}_t) p(\bold{x}_t)}{p(\bold{x}_{t + \Delta t)}} = p(\bold{x}_{t + \Delta t} | \bold{x}_t) \exp \left( \log p(\bold{x}_t) - \log p(\bold{x}_{t+\Delta t}) \right) \\ &\propto \exp \left( -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} + \log p(\bold{x}_t) - \log p(\bold{x}_{t+\Delta t})\right) \end{align*}

Note: First order Taylor Expansion of f(x)f(x) at x0x_0 is f(x)f(x0)+f(x0)(xx0)f(x) \approx f(x_0) + f'(x_0) (x - x_0)

Note: logp(xt)\log p(x_t) is actually a function of both xtx_t and tt, therefore when taking derivatives, both of them should be considered.

By Taylor Expansion, we have

logp(xt+Δt)logp(xt)+(xt+Δtxt)xtlogp(xt)+Δttlogp(xt)\log p(\bold{x}_{t + \Delta t}) \approx \log p(\bold{x}_t) + (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) + \Delta t \frac{\partial}{\partial t} \log p(\bold{x}_t)

Plug in the above equation into p(xtxt+Δt)p(\bold{x}_t | \bold{x}_{t + \Delta t}):

p(xtxt+Δt)exp(xt+Δtxtf(xt,t)Δt22g(t)2Δt(xt+Δtxt)xtlogp(xt)Δttlogp(xt))\begin{align*} p(\bold{x}_t | \bold{x}_{t + \Delta t}) &\propto \exp \left( -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} - (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) - \Delta t \frac{\partial}{\partial t} \log p(\bold{x}_t) \right)\end{align*}
xt+Δtxtf(xt,t)Δt22g(t)2Δt(xt+Δtxt)xtlogp(xt)=12g(t)2Δt(xt+Δtxt22(xt+Δtxt)f(xt,t)Δt+f(xt,t)2Δt2+2g(t)2Δt(xt+Δtxt)xtlogp(xt))=12g(t)2Δt(xt+Δtxt22(xt+Δtxt)[f(xt,t)g(t)2xtlogp(xt)]Δt+C(Δt))=12g(t)2Δt(xt+Δtxt[f(xt,t)g(t)2xtlogp(xt)]Δt2+C(Δt))\begin{align*} & -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} - (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) \\ &= -\frac{1}{2 g(t)^2 \Delta t} \left( ||\bold{x}_{t+\Delta t} - \bold{x}_t||^2 - 2(\bold{x}_{t+\Delta t} - \bold{x}_t) \bold{f}(\bold{x}_t, t) \Delta t + \bold{f}(\bold{x}_t, t)^2 \Delta t^2 + 2 g(t)^2 \Delta t (\bold{x}_{t + \Delta t} - \bold{x}_t)\nabla_{\bold{x}_t} \log p(\bold{x}_t)\right) \\ &= -\frac{1}{2 g(t)^2 \Delta t} \left( ||\bold{x}_{t+\Delta t} - \bold{x}_t||^2 - 2(\bold{x}_{t + \Delta t} - \bold{x}_t)[\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t + C(\Delta t) \right) \\ &= -\frac{1}{2 g(t)^2 \Delta t}\left( ||\bold{x}_{t+\Delta t} - \bold{x}_t - [\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t||^2 + C(\Delta t) \right)\end{align*}

Here C(Δt)C(\Delta t) is a polynomial of Δt\Delta t without a constant term. It goes to 00 when Δt0\Delta t \rightarrow 0. Therefore,

p(xtxt+Δt)exp(xt+Δtxt[f(xt,t)g(t)2xtlogp(xt)]Δt22g(t)2Δt)exp(xtxt+Δt[f(xt+Δt,t+Δt)g(t+Δt)2xtlogp(xt+Δt)](Δt)22g(t+Δt)2Δt)\begin{align*} p(\bold{x}_t | \bold{x}_{t + \Delta t}) & \propto \exp \left( - \frac{||\bold{x}_{t+\Delta t} - \bold{x}_t - [\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t||^2}{2g(t)^2 \Delta t} \right) \\ &\approx \exp \left( - \frac{||\bold{x}_{t} - \bold{x}_{t + \Delta t} - [\bold{f}(\bold{x}_{t+\Delta t}, t + \Delta t) - g(t+\Delta t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_{t+\Delta t})}] (-\Delta t)||^2}{2g(t + \Delta t)^2 \Delta t} \right)\end{align*}

Previously, we derived p(xt+Δtxt)exp(xt+Δtxtf(xt,t)Δt22g(t)2Δt)p(\bold{x}_{t + \Delta t} | \bold{x}_t) \propto \exp \left( - \frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} \right) from dx=f(x,t)dt+g(t)dwd\bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d\bold{w}. Therefore we can conclude that the corresponding reverse-time SDE is:

dx=[f(xt,t)g(t)2xtlogp(xt)]dt+g(t)dwd \bold{x} = [\bold{f}(\bold{x}_{t}, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_{t})}]dt + g(t) d \bold{w}

3. Core settings of two diffusion models: SMLD and DDPM

3.1 Denoising score matching with Langevin Dynamics (SMLD)

Let the perturbation kernel to be: pσ(x~x)N(x~;x,σ2I)p_{\sigma} (\tilde{\bold{x}} | \bold{x}) \coloneqq \mathcal{N}(\tilde{\bold{x}}; \bold{x} , \sigma^2 \bold{I}), and pσ(x~)pdata(x)pσ(x~x)dxp_{\sigma}(\tilde{\bold{x}}) \coloneqq \int p_{data}(\bold{x}) p_{\sigma}(\tilde{\bold{x}} | \bold{x}) d\bold{x}, where pdata(x)p_{data}(\bold{x}) denotes the data distribution. Consider a sequence of positive noise scales σmin=σ1<σ2<<σN=σmax\sigma_{\min} = \sigma_1 < \sigma_2 < \dots < \sigma_N = \sigma_{\max}. Usually σmin\sigma_{\min} is small enough such that pσmin(x)pdata(x)p_{\sigma_{\min}}(\bold{x}) \approx p_{data}(\bold{x}) and σmax\sigma_{\max} is large enough such that pσmax(x)N(x;0,σmax2I)p_{\sigma_{\max}}(\bold{x}) \approx \mathcal{N}(\bold{x}; \bold{0}, \sigma_{\max}^2 \bold{I})

Overall Step (derived from single step): p(xtx0)=N(x0,σt2I)p(\bold{x}_t | \bold{x}_0) = \mathcal{N}(\bold{x}_0, \sigma_t^2 \bold{I})

Single Step (by design): p(xtxt1)=N(xt1,(σt2σt12)I)p(\bold{x}_t | \bold{x}_{t-1}) = \mathcal{N}(\bold{x}_{t-1}, (\sigma_t^2 - \sigma_{t-1}^2) \bold{I})

Previous work propose to train a Noise Conditional Score Network sθ(x,σ)\bold{s}_{\theta}(\bold{x}, \sigma) with a weighted sum of denoising score matching objectives:

θ=arg minθi=1Nσi2Epdata(x)Epσi(x~x)[sθ(x~,σi)x~logpσi(x~x)22]\theta^* = \argmin_{\theta} \sum_{i=1}^N \sigma_i^2 \mathbb{E}_{p_{data}(\bold{x})} \mathbb{E}_{p_{\sigma_i}(\tilde{\bold{x}}|\bold{x})}\left[ ||\bold{s}_{\theta}(\tilde{\bold{x}}, \sigma_i) - \nabla_{\tilde{\bold{x}}} \log p_{\sigma_i}(\tilde{\bold{x}}|\bold{x})||^2_2 \right]

Here, x~logpσi(x~x)=x~xσi2\nabla_{\tilde{\bold{x}}} \log p_{\sigma_i}(\tilde{\bold{x}}|\bold{x}) = - \frac{\tilde{\bold{x}} - \bold{x}}{\sigma_i^2}. The optimal score-based model sθ(x,σ)\bold{s}_{\theta^*}(\bold{x}, \sigma) can be obtained given sufficient data and model capacity. It matches xlogpσ(x)\nabla{\bold{x}} \log p_{\sigma}(\bold{x}) almost everywhere for σ{σi}i=1N\sigma \in \{ \sigma_i \}_{i=1}^N.

For sampling, the work uses MM steps of Langevin MCMC to get a sample for each pσi(x)p_{\sigma_i}(\bold{x}) sequentially:

xim=xim1+ϵisθ(xim1,σi)+2ϵizimm=1,2,,M,\bold{x}_i^m = \bold{x}_i^{m-1} + \epsilon_i \bold{s}_{\theta^*}(\bold{x}_i^{m-1}, \sigma_i) + \sqrt{2 \epsilon_i} \bold{z}_i^m \qquad m=1,2,\dots,M,

where ϵi>0\epsilon_i > 0 is the step size, and zim\bold{z}_i^m is standard normal. The above process is repeated for i=N,N1,,1i = N, N-1, \dots, 1 in turn with xN0N(x0,σmax2I)\bold{x}_N^0 \sim \mathcal{N}(\bold{x} | \bold{0}, \sigma_{\max}^2 \bold{I}) and xi0=xi+1M\bold{x}_i^0 = \bold{x}_{i+1}^M when i<Ni < N.

Note: ii means different noise levels; MM means denoise MM steps at each noise level; Therefore, to generate a sample, we need to run the score function N×MN \times M times:

xN0xN1xNMxN11xN1Mx0M\bold{x}_N^0 \rightarrow \bold{x}_N^1 \rightarrow \dots \rightarrow \bold{x}_N^M \rightarrow \bold{x}_{N-1}^1 \rightarrow \dots \rightarrow \bold{x}_{N-1}^M \rightarrow \dots \rightarrow \bold{x}_{0}^M

3.2 Denoising Diffusion Probabilistic Models (DDPM)

Consider a sequence of positive noise scales 0<β1,β2,,βN<10 < \beta_1, \beta_2, \dots, \beta_N < 1. For each training data point x0pdata(x)\bold{x}_0 \sim p_{data}(\bold{x}), construct a discrete Markov Chain {x0,x1,,xN}\{\bold{x}_0, \bold{x}_1, \dots, \bold{x}_N\}. The single step and overall step of this chain is:

Single step (by design): p(xixi1)=N(xi;1βixi1,βiI)p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N} (\bold{x}_i; \sqrt{1 - \beta_i} \bold{x}_{i-1}, \beta_i \bold{I})

Overall step (derived): pαi(xix0)=N(xi;αix0,(1αi)I)p_{\alpha_i}(\bold{x}_i | \bold{x}_0) = \mathcal{N} (\bold{x}_i ; \sqrt{\alpha_i} \bold{x}_0, (1 - \alpha_i) \bold{I})

Here αiΠj=1i(1βj)\alpha_i \coloneqq \Pi_{j=1}^i (1 - \beta_j)

Proof of overall step:

Since xi=1βixi1+βizi\bold{x}_i = \sqrt{1 - \beta_i}\bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_i and xi+1=1βi+1xi+βi+1zi+1\bold{x}_{i+1} = \sqrt{1 - \beta_{i+1}} \bold{x}_i + \sqrt{\beta_{i+1}} \bold{z}_{i+1}, we have:

xi+1=1βi+1(1βixi1+βizi)+βi+1zi+1=(1βi+1)(1βi)xi1+(1βi+1)βizi+βi+1zi+1=(1βi+1)(1βi)xi1+βi+βi+1βiβi+1z(zN(0,I))=(1βi+1)(1βi)xi1+1(1βi+1)(1βi)z\begin{align*} \bold{x}_{i+1} &= \sqrt{1 - \beta_{i+1}} (\sqrt{1 - \beta_i}\bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_i) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt{(1 - \beta_{i+1})\beta_i} \bold{z}_i + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt {\beta_i + \beta_{i+1} - \beta_i \beta_{i+1}} \bold{z} \qquad (\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})) \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt{1 - (1 - \beta_{i+1})(1 - \beta_i)} \bold{z}\end{align*}

By induction, we can conclude that:

xi=Πj=1i(1βj)x0+1Πj=1i(1βj)z=αix0+1αiz(zN(0,I))\begin{align*} \bold{x}_i &= \sqrt{\Pi_{j=1}^i (1 - \beta_j)} \bold{x}_0 + \sqrt{1 - \Pi_{j=1}^i (1 - \beta_j)} \bold{z} \\ &= \sqrt{\alpha_i} \bold{x}_0 + \sqrt{1 - \alpha_i} \bold{z} \qquad (\bold{z} \sim \mathcal{N} (\bold{0}, \bold{I}))\end{align*}

Therefore, pαi(xix0)=N(xi;αix0,(1αi)I)p_{\alpha_i}(\bold{x}_i | \bold{x}_0) = \mathcal{N} (\bold{x}_i ; \sqrt{\alpha_i} \bold{x}_0, (1 - \alpha_i) \bold{I}).

End of proof.

The noise scales are prescribed such that xN\bold{x}_N is approximately distributed according to N(0,I)\mathcal{N} ( \bold{0}, \bold{I})

The denoising distribution derived from Bayesian formula is:

pθ(xi1xi)=N(xi1;11βi(xi+βisθ(xi,i)),βiI)p_{\theta}(\bold{x}_{i-1} | \bold{x}_i) = \mathcal{N} (\bold{x}_{i-1}; \frac{1}{\sqrt{1 - \beta_i}}(\bold{x}_i + \beta_i \bold{s}_{\theta}(\bold{x}_i, i)), \beta_i \bold{I})

and it can be trained with a re-weighted variant of the evidence lower bound (ELBO):

θ=arg minθi=1N(1αi)Epdata(x)Epαi(x~x)[sθ(x~,i)x~logpαi(x~x)22]\theta^* = \argmin_{\theta} \sum_{i=1}^N (1 - \alpha_i) \mathbb{E}_{p_{data}(\bold{x})} \mathbb{E}_{p_{\alpha_i}}(\tilde{\bold{x}} | \bold{x}) [||\bold{s}_{\theta}(\tilde{\bold{x}} , i) - \nabla_{\tilde{\bold{x}}}\log p_{\alpha_i}(\tilde{\bold{x}} | \bold{x})||^2_2]

After solving the above equation we get the optimal model sθ(xi,i)\bold{s}_{\theta^*} (\bold{x}_i, i), then samples can be generated by starting from xNN(0,I)\bold{x}_N \sim \mathcal{N}(\bold{0}, \bold{I}) and following the estimated reverse Markov chain as below:

xi1=11βi(xi+βisθ(xi,i))+βizi,i=N,N1,,1.\bold{x}_{i-1} = \frac{1}{\sqrt{1 - \beta_i}} (\bold{x}_i + \beta_i \bold{s}_{\theta^*}(\bold{x}_i, i)) + \sqrt{\beta_i} \bold{z}_i, \qquad i=N, N-1, \dots, 1.

NOTE: we can notice that the weights of the i-th summand in both SMLD and DDPM loss functions, namely σi2\sigma_i^2 and (1αi)(1 - \alpha_i), are related to the corresponding perturbation kernels in the same functional form: σi21/E[xlogpσi(x~x)22]\sigma_i^2 \propto 1 / \mathbb{E} [||\nabla_{\bold{x}} \log p_{\sigma_i}(\tilde{\bold{x}}| \bold{x})||^2_2] and (1αi)1/E[xlogpαi(x~x)22](1 - \alpha_i) \propto 1 / \mathbb{E} [||\nabla_{\bold{x}} \log p_{\alpha_i}(\tilde{\bold{x}}| \bold{x})||^2_2]

4. VE, VP SDEs: Derive SDEs from Conditional Distributions

The noise perturbations used in SMLD and DDPM can be regarded as discretizations of two different SDEs.

VE Model

When using a total of N noise scales, each perturbation kernel pσi(xx0)p_{\sigma_i}(\bold{x} | \bold{x}_0) of SMLD corresponds to the distribution of xi\bold{x}_i in the following Markov chain:

xi=xi1+σi2σi12zi1i=1,,N.\bold{x}_i = \bold{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \bold{z}_{i-1} \qquad i=1,\dots,N.

where zi1N(0,I)\bold{z}_{i-1} \sim \mathcal{N}(\bold{0}, \bold{I}), and we have introduced σ0=0\sigma_0 = 0 to simplify the notation.

(Recall that single step perturbation of SMLD is: p(xixi1)=N(xi;xi1,(σi2σi12)I)p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N}(\bold{x}_i; \bold{x}_{i-1}, (\sigma_i^2 - \sigma_{i-1}^2) \bold{I})

When NN \rightarrow \infty, {σi}i=1N\{ \sigma_i \}_{i=1}^N becomes a function σ(t)\sigma(t), zi\bold{z}_i becomes z(t)\bold{z}(t), and the Markov chain {xi}i=1N\{ \bold{x}_i \}_{i=1}^N becomes a continuous stochastic process {x(t)}t=01\{ \bold{x}(t) \}_{t=0}^1, where we have used a continuous time variable t[0,1]t \in [0, 1]. The process {x(t)}t=01\{ \bold{x} (t) \}_{t=0}^1 is given by the following SDE:

dx=d[σ2(t)]dtdwd \bold{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w}

VP Model

Similarly for DDPM, the perturbation kernels are p(xixi1)=N(1βixi1,βiI)p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N} (\sqrt{1 - \beta_i} x_{i-1}, \beta_i \bold{I}). Then the discrete Markov chain is:

xi=1βixi1+βizi1i=1,,N.\bold{x}_i = \sqrt{1 - \beta_i} \bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_{i-1} \qquad i=1,\dots,N.

As NN \rightarrow \infty, the Markov chain converges to the following SDE:

dx=12β(t)xdt+β(t)dwd \bold{x} = - \frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d \bold{w}

Proof for SMLD(VE):

xi=xi1+σi2σi12zi1i=1,,N.\bold{x}_i = \bold{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \bold{z}_{i-1} \qquad i=1,\dots,N.

Define some notations first: x(iN)=xi\bold{x}(\frac{i}{N}) = \bold{x}_i, σ(iN)=σi\sigma(\frac{i}{N}) = \sigma_i, and z(iN)=zi\bold{z} (\frac{i}{N}) = \bold{z}_i for i=1,,Ni=1,\dots,N. We can rewrite the Markov chain as follows with Δt=1N\Delta t = \frac{1}{N} and t{i,1N,,N1N}t \in \{ i, \frac{1}{N}, \dots, \frac{N-1}{N} \}:

x(t+Δt)=x(t)+σ2(t+Δt)σ2(t)z(t)x(t)+d[σ2(t)]dtΔtz(t)\bold{x}(t + \Delta t) = \bold{x}(t) + \sqrt{\sigma^2(t + \Delta t) - \sigma^2 (t)} \bold{z}(t) \approx \bold{x}(t) + \sqrt{\frac{d[\sigma^2(t)]}{dt}\Delta t} \bold{z}(t)

The approximation is given by the definition of derivative: d[σ2(t)]dt=limΔt0σ2(t+Δt)σ2(t)Δt\frac{d[\sigma^2(t)]}{dt} = \lim_{\Delta t \rightarrow 0} \frac{\sigma^2(t + \Delta t) - \sigma^2 (t)}{\Delta t}

As Δt0\Delta t \rightarrow 0, Δtz(t)=dw\sqrt{\Delta t} z(t) = d \bold{w} where w\bold{w} is a Wiener process. This is because w(t+Δt)w(t)N(0,Δt)\bold{w}(t + \Delta t) - \bold{w}(t) \sim \mathcal{N}(0, \Delta t).

End of Proof.

Proof for DDPM(VP):

xi=1βixi1+βizi1i=1,,N.\bold{x}_i = \sqrt{1 - \beta_i} \bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_{i-1} \qquad i=1,\dots,N.

Define an auxiliary set of noise scales {βiˉ=Nβi}i=1N\{ \bar{\beta_i} = N \beta_i \}_{i=1}^N and rewrite the Markov chain as below:

xi=1βiˉNxi1+βiˉNzi1i=1,,N\bold{x}_i = \sqrt{1 - \frac{\bar{\beta_i}}{N}} \bold{x}_{i-1} + \sqrt{\frac{\bar{\beta_i}}{N}} \bold{z}_{i-1} \qquad i=1,\dots, N

In the limit of NN \rightarrow \infty, {βiˉ}i=1N\{ \bar{\beta_i}\}_{i=1}^N becomes a function β(t)\beta(t) indexed by t[0,1]t \in [0, 1]. Let β(iN)=βiˉ\beta(\frac{i}{N}) = \bar{\beta_i}, x(iN)=xi\bold{x}(\frac{i}{N}) = \bold{x}_i, z(iN)=zi\bold{z}(\frac{i}{N}) = \bold{z}_i. We can rewrite the above Markov chain as the following with Δt=1N\Delta t = \frac{1}{N} and t{0,1,,N1N}t \in \{ 0, 1, \dots, \frac{N-1}{N} \}:

x(t+Δt)=1β(t+Δt)Δtx(t)+β(t+Δt)Δtz(t)x(t)12β(t+Δt)Δtx(t)+β(t+Δt)Δtz(t)x(t)12β(t)Δtx(t)+β(t)Δtz(t)\begin{align*} \bold{x}(t + \Delta t) &= \sqrt{1 - \beta(t + \Delta t) \Delta t} \bold{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \bold{z}(t) \\ &\approx \bold{x}(t) - \frac{1}{2} \beta(t + \Delta t) \Delta t \bold{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \bold{z}(t) \\ &\approx \bold{x}(t) - \frac{1}{2} \beta(t) \Delta t \bold{x}(t) + \sqrt{\beta(t) \Delta t} \bold{z}(t)\end{align*}

The first approximation comes from Taylor Expansion: 1x=1x2x281x2\sqrt{1 - x} = 1 - \frac{x}{2} - \frac{x^2}{8} - \dots \approx 1 - \frac{x}{2}

Therefore, in the limit of Δt0\Delta t \rightarrow 0, the Markov chain converges to the following VP SDE:

dx=12β(t)xdt+β(t)dwd \bold{x} = - \frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d \bold{w}

5. How to solve the SDE?

Idea: Solve E(xt)\mathbb{E}(\bold{x}_t) and Var(xt)\text{Var}(\bold{x}_t), then under the Gaussian assumption, we know that p0t(xtx0)N(,)p_{0t}(\bold{x}_t | \bold{x}_0) \sim \mathcal{N}(\cdot, \cdot)

Theorem 5.1 (simplified from Equation (5.50), Equation (5.51) in Applied Stochastic Differential Equations)

If the SDE takes the form:

dx=f(x,t)dt+g(t)dwd \bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w}

Then the expectation m(t)\bold{m}(t) and covariance matrix P(t)\bold{P}(t) of x(t)\bold{x}(t) have:

dm(t)dt=E[f(x,t)]\frac{d \bold{m}(t)}{dt} = \mathbb{E} [\bold{f}(\bold{x}, t)]
dP(t)dt=E[f(x,t)(xm(t))T]+E[(xm(t))fT(x,t)]+E[g2(t)]\frac{d \bold{P}(t)}{dt} = \mathbb{E}[\bold{f}(\bold{x}, t) (\bold{x} - \bold{m}(t))^T] + \mathbb{E}[ (\bold{x} - \bold{m}(t))\bold{f}^T(\bold{x}, t)] + \mathbb{E}[g^2(t)]

Solution to VP SDE:

VP Model:dx=12β(t)xdt+β(t)dw\text{VP Model}: \quad d \bold{x} = -\frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d\bold{w}

By Theorem 5.1, we have

dmdt=E[12β(t)x]=12β(t)E[x(t)]=12β(t)mdmm=12β(t)dtlnm(t)lnm(0)=120tβ(s)dsm(t)=elnm(0)120tβ(s)ds=m(0)e120tβ(s)ds\begin{align*} \frac{d \bold{m}}{dt} &= \mathbb{E}[-\frac{1}{2} \beta(t) \bold{x}] = -\frac{1}{2} \beta(t) \mathbb{E}[\bold{x}(t)] \\ &= -\frac{1}{2} \beta(t) \bold{m} \\ \Rightarrow \frac{d\bold{m}}{\bold{m}} &= -\frac{1}{2} \beta(t) dt \\ \ln \bold{m}(t) - \ln \bold{m}(0) & = -\frac{1}{2} \int_0^t \beta(s) ds \\ \Rightarrow \bold{m}(t) &= e^{\ln \bold{m}(0) - \frac{1}{2} \int_0^t \beta(s) ds} = \bold{m}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}\end{align*}

Therefore,

E[x(t)x(0)]=x(0)e120tβ(s)ds\mathbb{E}[\bold{x} (t) | \bold{x}(0)] = \bold{x}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}

For covariance matrix P(t)P(t):

dP(t)dt=E[12β(t)x(t)(x(t)m(t))T]+E[12β(t)(x(t)m(t))x(t)T]+E[β(t)]=β(t)E[x(t)(x(t)m(t))T]+β(t)\begin{align*} \frac{dP(t)}{dt} &= \mathbb{E} [-\frac{1}{2} \beta(t)\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] + \mathbb{E} [-\frac{1}{2} \beta(t) (\bold{x}(t) - \bold{m}(t)) \bold{x}(t)^T] + \mathbb{E}[\beta(t)] \\ &= -\beta(t)\mathbb{E}[\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] + \beta(t) \end{align*}

Since E[x(t)m(t)]=0\mathbb{E}[\bold{x}(t) - \bold{m}(t)] = 0, we have:

E[x(t)(x(t)m(t))T]=E[x(t)(x(t)m(t))T]0=E[x(t)(x(t)m(t))T]m(t)E[(x(t)m(t))T]=E[(x(t)m(t))(x(t)m(t))T]=P(t)\begin{align*} \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] &= \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] - 0 \\ &= \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] - \bold{m}(t) \mathbb{E} [(\bold{x}(t) - \bold{m}(t))^T] \\ &= \mathbb{E} [(\bold{x}(t) - \bold{m}(t))(\bold{x}(t) - \bold{m}(t))^T] \\ &= P(t)\end{align*}
dP(t)dt=β(t)P(t)+β(t)=β(t)(IP(t))dP(t)IP(t)=β(t)dtln(IP(t))+ln(IP(0))=0tβ(s)dsIP(t)=exp{ln(IP(0))0tβ(s)ds}P(t)=I(IP(0))e0tβ(s)ds\begin{align*} \Rightarrow \frac{d P(t)}{dt} &= -\beta(t) P(t) + \beta(t) \\ &= \beta(t) (\bold{I} - P(t)) \\ \frac{d P(t)}{\bold{I} - P(t)} &= \beta(t) dt \\ -\ln (\bold{I} - P(t)) + \ln (\bold{I} - P(0)) &= \int_0^t \beta(s) ds \\ \bold{I} - P(t) &= \exp\{\ln(\bold{I} - P(0)) -\int_0^t \beta(s)ds \} \\ P(t) &= \bold{I} - (\bold{I} - P(0)) e^{-\int_0^t \beta(s)ds} \end{align*}

Since Cov(x0x0)=0Cov(\bold{x}_0 | \bold{x}_0) = 0, we have:

Cov(xtx0)=IIe0tβ(s)dsCov(\bold{x}_t | \bold{x}_0) = \bold{I} - \bold{I} e^{-\int_0^t \beta(s)ds}

Therefore, the solution to VP model is:

p0t(x(t)x(0))=N(x(t);x(0)e120tβ(s)ds,IIe0tβ(s)ds)(VP SDE)p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}, \bold{I} - \bold{I} e^{-\int_0^t \beta(s)ds}) \qquad (\text{VP SDE})

Solution to VE SDE:

VE Model:dx=d[σ2(t)]dtdw(f(x,t)=0)\text{VE Model}: \quad d\bold{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w} \qquad (\bold{f}(\bold{x}, t) = 0)

By Theorem 5.1:

dmdt=0m(t)=C=m(0)=x0\frac{d\bold{m}}{dt} = 0 \Rightarrow \bold{m}(t) = C = \bold{m}(0) = \bold{x}_0
dP(t)dt=E[d[σ2(t)]dt]=d[σ2(t)]dtdP(t)=d[σ2(t)]P(t)P(0)=σ2(t)σ2(0)P(t)=σ2(t)σ2(0)\begin{align*}\frac{d P(t)}{dt} &= \mathbb{E} [\frac{d[\sigma^2(t)]}{dt}] = \frac{d[\sigma^2(t)]}{dt} \\ dP(t) &= d[\sigma^2(t)] \\ P(t) - P(0) &= \sigma^2(t) - \sigma^2(0) \\ P(t) &= \sigma^2(t) - \sigma^2(0)\end{align*}

Therefore, the solution to VE SDE is:

p0t(x(t)x(0))=N(x(t);x(0),[σ2(t)σ2(0)]I)(VE SDE)p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) , [\sigma^2(t) - \sigma^2(0)] \bold{I}) \qquad (\text{VE SDE})

6. Derive mean and variance of the perturbation kernel from sub-VP SDE

dx=12β(t)xdt+β(t)(1e20tβ(s)ds)dwd \bold{x} = -\frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t) (1 - e^{-2 \int_0^t \beta(s) ds})} d\bold{w}

Why sub-VP SDE:

  1. Perform well on likelihood
  1. Variance is bounded by VP SDE

Since VE, VP and sub-VP SDEs all have affine drift coefficients, their perturbation kernels p0t(x(t)x(0))p_{0t} (\bold{x}(t) | \bold{x}(0)) are all Gaussian and can be computed in closed-forms. This makes training with the score-matching loss:

θ=arg minθEt{λ(t)Ex(0)Ex(t)x(0)[sθ(x(t),t)x(t)logp0t(x(t)x(0))22]}\theta^* = \argmin_{\theta} \mathbb{E}_t \{ \lambda(t) \mathbb{E}_{\bold{x}(0)} \mathbb{E}_{\bold{x}(t) | \bold{x}(0)} [||\bold{s}_{\theta}(\bold{x}(t), t) - \nabla_{\bold{x}(t)} \log p_{0t}(\bold{x}(t) | \bold{x}(0))||^2_2] \}

Solution to sub-VP SDE:

Corollary 6.1:

Given an ODE of the form y(x)+p(x)y(x)=f(x)y'(x) + p(x)y(x) = f(x), the solution is given by:

y(x)=1μ(x)(f(ξ)μ(ξ)dξ+C)y(x) = \frac{1}{\mu(x)} \left( \int f(\xi) \mu(\xi) d\xi + C \right)

where μ(t)=exp(tp(ξ)dξ)\mu(t) = \exp \left( \int^t p(\xi)d\xi \right).

Similar to VP SDE, we have:

E[x(t)x(0)]=x(0)e120tβ(s)ds\mathbb{E} [\bold{x}(t) | \bold{x}(0)] = \bold{x}(0) e^{-\frac{1}{2} \int_0^t \beta(s) ds}

By Theorem 5.1:

dP(t)dt=β(t)P(t)+β(t)(1exp{20tβ(s)ds})P(t)+β(t)P(t)=β(t)(1exp{20tβ(s)ds})\begin{align*} \frac{d P(t)}{dt} &= -\beta(t) P(t) + \beta(t) (1 - \exp \{ -2 \int_0^t \beta(s) ds \}) \\ P'(t) + \beta(t) P(t) &= \beta(t) (1 - \exp \{ -2 \int_0^t \beta(s) ds \})\end{align*}

By Corollary 6.1:

P(t)=Iexp{0tβ(s)ds}(0tβ(s)[1exp{20sβ(ξ)dξ}]exp{0sβ(ξ)dξ}ds+C)=Iexp{0tβ(s)ds}(0tβ(s)exp{0sβ(ξ)dξ}ds0tβ(s)exp{0sβ(ξ)dξ}ds+C)\begin{align*} P(t) &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left( \int_0^t \beta(s) \left[ 1 - \exp\{ -2 \int_0^s \beta(\xi) d\xi \} \right] \exp \{ \int_0^s \beta(\xi) d\xi \} ds + C \right) \\ &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left( \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds - \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds + C \right)\end{align*}

Denote 1=0tβ(s)exp{0sβ(ξ)dξ}ds\textcircled{1} = \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds and 2=0tβ(s)exp{0sβ(ξ)dξ}ds\textcircled{2} = \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds. Solve them separately.

1=0tβ(s)exp{0sβ(ξ)dξ}ds=exp{0sβ(ξ)dξ}s=0s=t=exp{0tβ(s)ds}1\begin{align*} \textcircled{1} &= \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds \\ &= \exp \{ \int_0^s \beta(\xi) d\xi \} |_{s=0}^{s=t} \\ &= \exp \{ \int_0^t \beta(s) ds \} - 1 \end{align*}
2=0tβ(s)exp{0sβ(ξ)dξ}ds=exp{0sβ(ξ)dξ}s=0s=t=exp{0tβ(s)ds}+1\begin{align*} \textcircled{2} &= \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds \\ &= - \exp \{ -\int_0^s \beta(\xi) d\xi \} |_{s=0}^{s=t} \\ &= - \exp \{ -\int_0^t \beta(s) ds \} + 1\end{align*}

Therefore:

P(t)=Iexp{0tβ(s)ds}[exp{0tβ(s)ds}+exp{0tβ(s)ds}+C]=I[1+exp{20tβ(s)ds}+exp{0tβ(s)ds}C]\begin{align*} P(t) &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left[ \exp \{ \int_0^t \beta(s) ds \} + \exp \{ - \int_0^t \beta(s) ds \} + C \right] \\ &= \bold{I} \cdot \left[ 1 + \exp \{ -2\int_0^t \beta(s) ds \} + \exp \{ -\int_0^t \beta(s) ds \} \cdot C\right]\end{align*}

Plug in t=0t=0, we have P(0)=I(2+C)CI=P(0)2IP(0) = \bold{I} (2 + C) \Rightarrow C \bold{I} = P(0) - 2\bold{I}. Then:

P(t)=I+e20tβ(s)dsI+e0tβ(s)ds(P(0)2I)P(t) = \bold{I} + e^{-2 \int_0^t \beta(s) ds} \bold{I} + e^{- \int_0^t \beta(s) ds} (P(0) - 2 \bold{I})

Note: If limt0tβ(s)ds=\lim_{t \rightarrow \infty} \int_0^t \beta(s) ds = \infty, we can observe that limtP(t)=I\lim_{t \rightarrow \infty} P(t) = \bold{I}. This justifies the use of sub-VP SDEs for score-based generative modeling, since they can perturb any data distribution to standard Gaussian under suitable conditions.

Since P(0)=0P(0) = 0, we have:

P(t)=I+e20tβ(s)dsI2e0tβ(s)dsI=[1e0tβ(s)ds]2IP(t) = \bold{I} + e^{-2 \int_0^t \beta(s) ds} \bold{I} - 2 e^{- \int_0^t \beta(s) ds} \bold{I} = [1 - e^{- \int_0^t \beta(s) ds}]^2 \bold{I}

Therefore, the solution to sub-VP SDEs is:

p0t(x(t)x(0))=N(x(t);x(0)e120tβ(s)ds,[1e0tβ(s)ds]2I)(sub-VP SDE)p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) e^{-\frac{1}{2} \int_0^t \beta(s) ds} , [1 - e^{- \int_0^t \beta(s) ds}]^2 \bold{I}) \qquad \text{(sub-VP SDE)}

7. How to choose the noise scale

7.1 SMLD (VE SDEs)

In SMLD, the noise scales {σi}i=1N\{ \sigma_i \}_{i=1}^N is typically a geometric sequence where σmin=0.01\sigma_{\text{min}} = 0.01 and σmax\sigma_{\text{max}} is chosen to Technique 1 in Song & Ermon (2020).

Technique 1: Choose σmax\sigma_{\text{max}} to be as large as the maximum Euclidean distance between all pairs of training data points. (Usually, SMLD models normalize image inputs to the range [0,1])

Since {σi}i=1N\{ \sigma_i \}_{i=1}^N is a geometric sequence, we have σ(iN)=σi=σmin(σmaxσmin)i1N1\sigma(\frac{i}{N}) = \sigma_i = \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^{\frac{i-1}{N-1}} for i=1,,Ni = 1, \dots, N. When NN \rightarrow \infty, σ(t)=σmin(σmaxσmin)t\sigma(t) = \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^t for t(0,1]t \in (0, 1].

Then the corresponding VE SDE is:

dx=d[σ2(t)]dtdw=σmin(σmaxσmin)t2lnσmaxσmindw\begin{align*} d \bold{x} &= \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w} \\ &= \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^t \sqrt{2 \ln \frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}} d\bold{w}\end{align*}

The perturbation kernel can be derived according to Section 5 VE SDEs:

p0t(x(t)x(0))=N(x(t);x(0),σmin2(σmaxσmin)2tI),t(0,1]p_{0t} (\bold{x}(t) | \bold{x}(0)) = \mathcal{N} \left( \bold{x}(t); \bold{x}(0), \sigma^2_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^{2t} \bold{I} \right), \quad t \in (0, 1]

7.2 DDPM (VP SDEs)

In DDPM, {βi}i=1N\{ \beta_i \}_{i=1}^N is typically an arithmetic sequence where βi=βˉminN+i1N(N1)(βˉmaxβˉmin)\beta_i = \frac{\bar{\beta}_{\text{min}}}{N} + \frac{i-1}{N(N-1)}(\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}}) for i=1,,Ni=1,\dots,N. As NN \rightarrow \infty, β(t)=βˉmin+t(βˉmaxβˉmin)\beta(t) = \bar{\beta}_{\text{min}} + t (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}}) for t[0,1]t \in [0, 1]. This correspinds to the following instantialtion of VP SDE:

dx=12(βˉmin+t(βˉmaxβˉmin))xdt+βˉmin+t(βˉmaxβˉmin)dw,t[0,1].d \bold{x} = -\frac{1}{2} (\bar{\beta}_{\text{min}} + t (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}})) \bold{x} dt + \sqrt{\bar{\beta}_{\text{min}} + t (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}})} d \bold{w}, \quad t\in[0,1].

where x(0)pdata(x)\bold{x}(0) \sim p_{\text{data}}(\bold{x}). Note: βˉmin=0.1\bar{\beta}_{\text{min}} = 0.1 and βˉmax=20\bar{\beta}_{\text{max}} = 20 is set to match previous work.

Then the perturbation kernel becomes:

p0t(x(t)x(0))=N(x(t);e14t2(βˉmaxβˉmin)12tβˉminx(0),IIe12t2(βˉmaxβˉmin)tβˉmin),t[0,1]p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} \left( \bold{x}(t); e^{-\frac{1}{4}t^2 (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}})- \frac{1}{2} t \bar{\beta}_{\text{min}}}\bold{x}(0), \bold{I} - \bold{I} e^{-\frac{1}{2} t^2 (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}}) -t \bar{\beta}_{\text{min}}} \right), t\in[0,1]

7.3 sub-VP SDEs

For sub-VP SDEs, the same β(t)\beta(t) is used as VP SDEs. This leads to the following perturbation kernel:

p0t(x(t)x(0))=N(x(t);e14t2(βˉmaxβˉmin)12tβˉminx(0),[1e12t2(βˉmaxβˉmin)tβˉmin]2I),t[0,1]p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} \left( \bold{x}(t); e^{-\frac{1}{4}t^2 (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}})- \frac{1}{2} t \bar{\beta}_{\text{min}}}\bold{x}(0), [1- e^{-\frac{1}{2} t^2 (\bar{\beta}_{\text{max}} - \bar{\beta}_{\text{min}}) -t \bar{\beta}_{\text{min}}}]^2 \bold{I} \right), t\in[0,1]

8. Two ways for sampling from SDEs: ODE, Reverse-SDE

Solution1: Convert an SDE to an ODE

Solution2: Convert an SDE to a reverse-SDE

8.1 Reverse Diffusion Sampling

Given a forward SDE:

dx=f(x,t)dt+G(t)dwd \bold{x} = \bold{f}(\bold{x}, t) dt + \bold{G}(t) d\bold{w}

the following iteration rule is a discretization of it:

xi+1=xi+fi(xi)+Gizi,i=0,1,,N\bold{x}_{i+1} = \bold{x}_i + \bold{f}_i(\bold{x}_i) + \bold{G}_i \bold{z}_i, \qquad i=0,1,\dots,N

where ziN(0,I)\bold{z}_i \sim \mathcal{N}(\bold{0}, \bold{I}). The discretization schedule of time is absorbed into the notations of fi\bold{f}_i and Gi\bold{G}_i.

Then we want to discretize the reverse-time SDE

dx=[f(x,t)G(t)GT(t)xlogpt(x)]dt+G(t)dwˉd\bold{x} = [\bold{f}(\bold{x}, t) - \bold{G}(t) \bold{G}^T(t) \nabla_{\bold{x}}\log p_t(\bold{x})]dt + \bold{G}(t) d \bold{\bar{w}}

With a similar functional form, the discretization is given by:

xi=xi+1fi+1(xi+1)+Gi+1Gi+1Tsθ(xi+1,i+1)+Gi+1zi+1\bold{x}_i = \bold{x}_{i+1} - \bold{f}_{i+1}(\bold{x}_{i+1}) + \bold{G}_{i+1}\bold{G}^T_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \bold{G}_{i+1} \bold{z}_{i+1}

where the trained score-based model sθ(xi,i)\bold{s}_{\theta^*}(\bold{x}_i, i) is conditioned on iteration number ii.

Note: the discretization is derived from dx=xixi+1d \bold{x} = \bold{x}_i - \bold{x}_{i+1}, dt=Δt=1dt = \Delta t = -1, and dwˉ=wtwt+Δt=ΔtzN(0,ΔtI)d \bold{\bar{w}} = \bold{w}_{t} - \bold{w}_{t + \Delta t} = \sqrt{|\Delta t|} \bold{z} \sim \mathcal{N}(0, |\Delta t| \bold{I}).

8.2 Probability Flow ODE

Consider a general form SDE:

dx=f(x,t)dt+G(x,t)dwd \bold{x} = \bold{f}(\bold{x}, t) dt + \bold{G}(\bold{x}, t) d\bold{w}

where f(x,t):RdRd\bold{f}(\bold{x}, t): \mathbb{R}^d \rightarrow \mathbb{R}^d and G(x,t):RdRd×d\bold{G}(\bold{x}, t): \mathbb{R}^d \rightarrow \mathbb{R}^{d \times d}. The corresponding probability flow ODE is:

dx={f(x,t)12[G(x,t)G(x,t)T]12G(x,t)G(x,t)Txlogpt(x)}dtd\bold{x} = \{ \bold{f}(\bold{x}, t) - \frac{1}{2}\nabla \cdot [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T] - \frac{1}{2}\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T \nabla_{\bold{x}} \log p_t(\bold{x})\} dt

where we define F(x)(f1(x),f2(x),,fd(x))T:RdRd\nabla \cdot \bold{F}(\bold{x}) \coloneqq (\nabla \cdot \bold{f}^1(\bold{x}), \nabla \cdot \bold{f}^2 (\bold{x}), \dots, \nabla \cdot \bold{f}^d (\bold{x}))^T: \mathbb{R}^d \rightarrow \mathbb{R}^d for a matrix-valued function F(x)(f1(x),f2(x),,fd(x))T:RdRd×d\bold{F} (\bold{x}) \coloneqq (\bold{f}^1(\bold{x}), \bold{f}^2(\bold{x}), \dots, \bold{f}^d(\bold{x}))^T: \mathbb{R}^d \rightarrow \mathbb{R}^{d \times d}.

Recap on Vector Differential Calculus

  1. Let u:RdR\bold{u}: \mathbb{R}^d \rightarrow \mathbb{R} be a scalar-valued function which takes an d-dimensional vector as input.

    Gradient of a Scalar-valued Function: u=(1u,,du)T\nabla \bold{u} = (\partial_1 \bold{u}, \dots, \partial_d \bold{u})^T.

  1. Let u:RdRd\bold{u}: \mathbb{R}^d \rightarrow \mathbb{R}^d ( u(x)=(u1(x),,ud(x))T \bold{u}(\bold{x}) = (u_1(\bold{x}), \dots, u_d(\bold{x}))^T )be a vector-valued function which takes an d-dimensional vector as input and output an d-dimensional vector.

    Divergence of a Vector-valued Function: u=i=1diui\nabla \cdot \bold{u} = \sum_{i=1}^d \partial_i u_i.

  1. Let U(x):RdRd×d\bold{U}(\bold{x}): \mathbb{R}^d \rightarrow \mathbb{R}^{d \times d} be a matrix-valued function which takes an d-dimensional vector as input and output an d×dd \times d matrix:

    Divergence of Matrix-valued Function:

U(x)=[u11(x)u1d(x)ud1(x)udd(x)]=[u1(x)ud(x)]\bold{U}(\bold{x}) = \begin{bmatrix} u_{11}(\bold{x}) & \dots & u_{1d}(\bold{x}) \\ \vdots & \ddots & \vdots \\ u_{d1}(\bold{x}) & \dots & u_{dd}(\bold{x})\end{bmatrix} = \begin{bmatrix} \bold{u}_1(\bold{x}) \\ \vdots \\ \bold{u}_d(\bold{x})\end{bmatrix}
U(x)=[u1(x)ud(x)]=[j=1dju1j(x)j=1djudj(x)]\nabla \cdot \bold{U}(\bold{x}) = \begin{bmatrix} \nabla \cdot \bold{u}_1(\bold{x}) \\ \vdots \\ \nabla \cdot \bold{u}_d(\bold{x}) \end{bmatrix} = \begin{bmatrix} \sum_{j=1}^d \partial_j u_{1j}(\bold{x}) \\ \vdots \\ \sum_{j=1}^d \partial_j u_{dj}(\bold{x}) \end{bmatrix}

Derivation of Probability Flow ODE:

Let pt(x(t))p_t(\bold{x}(t)) denote the marginal probability of x(t)\bold{x}(t). According to Kolmogorov’s forward equation (Fokker-Planck equation), we have:

pt(x)t=i=1dxi[fi(x,t)pt(x)]+12i=1dj=1d2xixj[k=1dGik(x,t)Gjk(x,t)pt(x)].\frac{\partial p_t(\bold{x})}{\partial t} = - \sum_{i=1}^d \frac{\partial}{\partial x_i}[f_i(\bold{x}, t) p_t(x)] + \frac{1}{2} \sum_{i=1}^d \sum_{j=1}^d \frac{\partial^2}{\partial x_i \partial x_j} \left[\sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk}(\bold{x}, t) p_t(\bold{x})\right].

This equation can be rewrite into:

pt(x)t=i=1dxi[fi(x,t)pt(x)]+12i=1dxi[j=1dxj[k=1dGik(x,t)Gjk(x,t)pt(x)]]\frac{\partial p_t(\bold{x})}{\partial t} = - \sum_{i=1}^d \frac{\partial}{\partial x_i}[f_i(\bold{x}, t) p_t(x)] + \frac{1}{2} \sum_{i=1}^d \frac{\partial}{\partial x_i} \left[ \sum_{j=1}^d \frac{\partial}{\partial x_j} \left[ \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk} (\bold{x}, t)p_t(\bold{x}) \right] \right]

Note that:

j=1dxj[k=1dGik(x,t)Gjk(x,t)pt(x)]=j=1dxj[pt(x)k=1dGik(x,t)Gjk(x,t)]=j=1dxj[k=1dGik(x,t)Gjk(x,t)]pt(x)+j=1dk=1dGik(x,t)Gjk(x,t)pt(x)xjlogpt(x)\begin{align*} &\sum_{j=1}^d \frac{\partial}{\partial x_j} \left[ \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk} (\bold{x}, t)p_t(\bold{x}) \right] = \sum_{j=1}^d \frac{\partial}{\partial x_j} \left[ p_t(\bold{x}) \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk} (\bold{x}, t) \right] \\ =& \sum_{j=1}^d \frac{\partial}{\partial x_j} \left[ \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk} (\bold{x}, t) \right] p_t(\bold{x}) + \sum_{j=1}^d \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk}(\bold{x}, t) p_t(\bold{x}) \frac{\partial}{\partial x_j} \log p_t(\bold{x}) \end{align*}

Explanation 1: f(x)dlogf(x)dx=f(x)1f(x)df(x)dx=df(x)dxf(x) \frac{d \log f(x)}{dx} = f(x) \cdot \frac{1}{f(x)} \frac{d f(x)}{dx} = \frac{df(x)}{dx}.

For simplicity, let’s denote F(x,t)=(Fij(x,t))d×d=G(x,t)G(x,t)T\bold{F} (\bold{x}, t) = (\bold{F}_{ij}(\bold{x}, t))_{d \times d} = \bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T , where Fij(x,t)=k=1dGik(x,t)Gjk(x,t)\bold{F}_{ij}(\bold{x}, t) = \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk}(\bold{x}, t). Then we can rewrite the above equation:

j=1dxj[k=1dGik(x,t)Gjk(x,t)pt(x)]=i=1dxjFij(x,t)pt(x)+j=1dFij(x,t)pt(x)xjlogpt(x)=pt(x)i=1dxjFij(x,t)+pt(x)j=1dFij(x,t)xjlogpt(x)=pt(x)[F(x,t)]i+pt(x)[F(x,t)Txlogpt(x)]i\begin{align*} &\sum_{j=1}^d \frac{\partial}{\partial x_j} \left[ \sum_{k=1}^d G_{ik}(\bold{x}, t) G_{jk} (\bold{x}, t)p_t(\bold{x}) \right] \\ =& \sum_{i=1}^d \frac{\partial}{\partial x_j}\bold{F}_{ij}(\bold{x}, t) p_t(\bold{x}) + \sum_{j=1}^d \bold{F}_{ij}(\bold{x}, t) p_t(\bold{x}) \frac{\partial}{\partial x_j} \log p_t(\bold{x}) \\ =& p_t(\bold{x})\sum_{i=1}^d \frac{\partial}{\partial x_j}\bold{F}_{ij}(\bold{x}, t) + p_t(\bold{x})\sum_{j=1}^d \bold{F}_{ij}(\bold{x}, t) \frac{\partial}{\partial x_j} \log p_t(\bold{x}) \\ =& p_t(\bold{x}) [\nabla \cdot \bold{F} (\bold{x}, t)]_i + p_t(\bold{x}) [\bold{F} (\bold{x}, t)^T \cdot \nabla_{\bold{x}} \log p_t(\bold{x})]_i \end{align*}

Therefore, go back to the Fokker-Planck Equation:

pt(x)t=i=1dxi[fi(x,t)pt(x)]+12i=1dxi[pt(x)[F(x,t)]i+pt(x)[F(x,t)Txlogpt(x)]i]=i=1dxi{pt(x)[fi(x,t)12[F(x,t)]i12[F(x,t)Txlogpt(x)]i]}=i=1dxi[f~i(x,t)pt(x)]\begin{align*} \frac{\partial p_t(\bold{x})}{\partial t} &= - \sum_{i=1}^d \frac{\partial}{\partial x_i}[f_i(\bold{x}, t) p_t(x)] + \frac{1}{2} \sum_{i=1}^d \frac{\partial}{\partial x_i} \left[p_t(\bold{x}) [\nabla \cdot \bold{F} (\bold{x}, t)]_i + p_t(\bold{x}) [\bold{F} (\bold{x}, t)^T \cdot \nabla_{\bold{x}} \log p_t(\bold{x})]_i \right] \\ &= -\sum_{i=1}^d \frac{\partial}{\partial x_i} \left\{ p_t(\bold{x}) \left[ f_i(\bold{x}, t) - \frac{1}{2} [\nabla \cdot \bold{F} (\bold{x}, t)]_i - \frac{1}{2} [\bold{F} (\bold{x}, t)^T \cdot \nabla_{\bold{x}} \log p_t(\bold{x})]_i \right] \right\} \\ &= -\sum_{i=1}^d \frac{\partial}{\partial x_i}[\tilde{f}_i(\bold{x}, t)p_t(\bold{x})] \end{align*}

Here:

f~(x,t)=f(x,t)12F(x,t)12F(x,t)Txlogpt(x)=f(x,t)12[G(x,t)G(x,t)]12[G(x,t)G(x,t)T]xlogpt(x)\begin{align*} \bold{\tilde{f}}(\bold{x}, t) &= \bold{f}(\bold{x}, t) - \frac{1}{2} \nabla \cdot \bold{F}(\bold{x}, t) - \frac{1}{2} \bold{F}(\bold{x}, t)^T \nabla_{\bold{x}} \log p_t(\bold{x}) \\ &= \bold{f}(\bold{x}, t) - \frac{1}{2} \nabla \cdot [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)] - \frac{1}{2} [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T] \nabla_{\bold{x}} \log p_t(\bold{x}) \end{align*}

Therefore, this Fokker-Planck Equation can not only be derived from dx=f(x,t)dt+G(x,t)dwd \bold{x} = \bold{f}(\bold{x}, t) dt + \bold{G}(\bold{x}, t) d\bold{w}, but also can be derived from another SDE with G~(x,t)0\bold{\tilde{G}}(\bold{x}, t) \coloneqq 0:

dx=f~(x,t)dt+G~(x,t)dwd \bold{x} = \bold{\tilde{f}}(\bold{x}, t) dt + \bold{\tilde{G}}(\bold{x}, t) d\bold{w}

which is essentially an ODE:

dx=f~(x,t)dt={f(x,t)12[G(x,t)G(x,t)]12[G(x,t)G(x,t)T]xlogpt(x)}dtd \bold{x} = \bold{\tilde{f}}(\bold{x}, t) dt = \left\{ \bold{f}(\bold{x}, t) - \frac{1}{2} \nabla \cdot [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)] - \frac{1}{2} [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T] \nabla_{\bold{x}} \log p_t(\bold{x}) \right\} dt

Hence, we have derived the corresponding ODE of the original SDE

9. Sample from VE SDE and VP SDE

Refresh, the forward SDE takes the form:

dx=f(x,t)dt+G(t)dwd \bold{x} = \bold{f}(\bold{x}, t) dt + \bold{G}(t) d \bold{w}

9.1 Reverse Diffusion Sampling

In Section8.1, given the reverse-time SDE:

dx=[f(x,t)G(t)G(t)Txlogpt(x)]dt+G(t)dwˉd \bold{x} = \left[ \bold{f}(\bold{x}, t) - \bold{G}(t)\bold{G}(t)^T \nabla_{\bold{x}} \log p_t(\bold{x}) \right] dt + \bold{G}(t) d\bold{\bar{w}}

we have derived its discretization form:

xi=xi+1fi+1(xi+1)+Gi+1Gi+1Tsθ(xi+1,i+1)+Gi+1zi+1\bold{x}_i = \bold{x}_{i+1} - \bold{f}_{i+1}(\bold{x}_{i+1}) + \bold{G}_{i+1}\bold{G}_{i+1}^T \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \bold{G}_{i+1}\bold{z}_{i+1}

Then, by applying this equation to VE SDE and VP SDE, we can obtain their sampler respectively.

VE SDE Sampler:

dx=d[σ2(t)]dtdwd \bold{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w}

For VE SDE, f(x,t)=0\bold{f}(\bold{x}, t) = 0 and G(t)=d[σ2(t)]dt\bold{G}(t) = \sqrt{\frac{d[\sigma^2(t)]}{dt}}. Therefore, the VE SDE sampler is:

xi=xi+1+d[σ2(t)]dtsθ(xi+1,i+1)+d[σ2(t)]dtzi+1=xi+1+(σi+12σi2)sθ(xi+1,i+1)+(σi+12σi2)zi+1\begin{align*} \bold{x}_i &= \bold{x}_{i+1} + \frac{d[\sigma^2(t)]}{dt} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\frac{d[\sigma^2(t)]}{dt}} \bold{z}_{i+1} \\ &= \bold{x}_{i+1} + (\sigma_{i+1}^2 - \sigma_i^2) \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + (\sigma_{i+1}^2 - \sigma_i^2) \bold{z}_{i+1} \end{align*}

VP SDE Sampler:

dx=12β(t)xdt+β(t)dwd \bold{x} = -\frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d\bold{w}

For VP SDE, f(x,t)=12β(t)x\bold{f}(\bold{x}, t) = -\frac{1}{2} \beta(t) \bold{x} and G(t)=β(t)\bold{G}(t) = \sqrt{\beta(t)}. Therefore, the VP SDE sampler is:

xi=xi+1+12βi+1xi+1+βi+1sθ(xi+1,i+1)+βi+1zi+1=(1+12βi+1)xi+1+βi+1sθ(xi+1,i+1)+βi+1zi+1(21βi+1)xi+1+βi+1sθ(xi+1,i+1)+βi+1zi+1\begin{align*} \bold{x}_i &= \bold{x}_{i+1} + \frac{1}{2} \beta_{i+1} x_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= (1 + \frac{1}{2} \beta_{i+1}) x_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &\approx (2 - \sqrt{1 - \beta_{i+1}}) x_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \end{align*}

9.2 Probability Flow Sampling

In section 8.2, given a general form SDE:

dx=f(x,t)dt+G(x,t)dwd \bold{x} = \bold{f}(\bold{x}, t) dt + \bold{G}(\bold{x}, t) d\bold{w}

we derived the corresponding probability flow ODE:

dx={f(x,t)12[G(x,t)G(x,t)]12[G(x,t)G(x,t)T]xlogpt(x)}dtd \bold{x} = \left\{ \bold{f}(\bold{x}, t) - \frac{1}{2} \nabla \cdot [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)] - \frac{1}{2} [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T] \nabla_{\bold{x}} \log p_t(\bold{x}) \right\} dt

When G(x,t)=G(t)\bold{G}(\bold{x}, t) = \bold{G}(t), the second term degrades to zero, then the ODE becomes:

dx={f(x,t)12[G(x,t)G(x,t)T]xlogpt(x)}dtd \bold{x} = \left\{ \bold{f}(\bold{x}, t) - \frac{1}{2} [\bold{G}(\bold{x}, t)\bold{G}(\bold{x}, t)^T] \nabla_{\bold{x}} \log p_t(\bold{x}) \right\} dt

We can employ any numerical method to integrate the probability flow ODE backwards in time and substitute the score function with the trained neural network for sample generation. One discretization takes the following form:

xi=xi+1fi+1(xi+1)+12Gi+1Gi+1Tsθ(xi+1,i+1),i=0,1,,N1\bold{x}_i = \bold{x}_{i+1} - \bold{f}_{i+1} (\bold{x}_{i+1}) + \frac{1}{2} \bold{G}_{i+1} \bold{G}_{i+1}^T \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1), \qquad i=0,1,\dots,N-1

By analogy to section 9.1, we can obtain the probability flow sampling rules for SMLD(VE model) and DDPM(VP model).

SMLD ODE Sampling:

xi=xi+1+12(σi+12σi2)sθ(xi+1,i+1)\bold{x}_i = \bold{x}_{i+1} + \frac{1}{2}(\sigma_{i+1}^2 - \sigma_i^2) \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)

DDPM ODE Sampling:

xi=(21βi+1)xi+1+12βi+1sθ(xi+1,i+1)\bold{x}_i = (2 - \sqrt{1 - \beta_{i+1}}) x_{i+1} + \frac{1}{2}\beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)

10. Equivalence between DDPM Sampling and VP Reverse-SDE

Claim: The DDPM sampling is equivalent to a numerical solution of VP reverse-SDE

The DDPM sampling takes the form:

xi=11βi+1(xi+1+βi+1sθ(xi+1,i+1))+βi+1zi+1\bold{x}_i = \frac{1}{\sqrt{1 - \beta_{i+1}}} (\bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)) + \sqrt{\beta_{i+1}} \bold{z}_{i+1}

According to the first order Taylor expansion of f(x)=(1x)12f(x) = (1-x)^{-\frac{1}{2}} at x=0x=0,

xi=(1+12βi+1+o(βi+1))(xi+1+βi+1sθ(xi+1,i+1))+βi+1zi+1(1+12βi+1)(xi+1+βi+1sθ(xi+1,i+1))+βi+1zi+1=(1+12βi+1)xi+1+βi+1sθ(xi+1,i+1)+12βi+12sθ(xi+1,i+1)+βi+1zi+1(1+12βi+1)xi+1+βi+1sθ(xi+1,i+1)+βi+1zi+1=[2(112βi+1)]xi+1+βi+1sθ(xi+1,i+1)+βi+1zi+1(21βi+1)xi+1+βi+1sθ(xi+1,i+1)+βi+1zi+1\begin{align*} \bold{x}_i &= \left(1 + \frac{1}{2} \beta_{i+1} + o(\beta_{i+1})\right)(\bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &\approx \left( 1 + \frac{1}{2} \beta_{i+1} \right)(\bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \left( 1 + \frac{1}{2} \beta_{i+1} \right) \bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \frac{1}{2}\beta_{i+1}^2 \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) +\sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &\approx \left( 1 + \frac{1}{2} \beta_{i+1} \right) \bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \left[ 2 - (1 - \frac{1}{2} \beta_{i+1}) \right] \bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &\approx ( 2 - \sqrt{1 - \beta_{i+1}} ) \bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1) + \sqrt{\beta_{i+1}} \bold{z}_{i+1}\end{align*}

11. Predictor-Corrector Samplers


Algorithm 1 Predictor-Corrector (PC) sampling

Require:

NN: Number of discretization steps for the reverse-time SDE

MM: Number of corrector steps

Initialize XNpT(x)\bold{X}_N \sim p_T(\bold{x})

for i=N1i = N - 1 to 00 do

xiPredictor(xi+1)\bold{x}_i \leftarrow \text{Predictor}(\bold{x}_{i+1})

for j=1j=1 to MM do

xiCorrector(xi)\bold{x}_i \leftarrow \text{Corrector}(\bold{x}_i)

return x0\bold{x}_0


Predictor-Corrector (PC) sampling

The predictor can be any numerical solver for the reverse-time SDE with a fixed discretization strategy. The corrector can be any score-based MCMC approach.

For example, when we use the reverse diffusion SDE solver (Section 9.1) as the predictor and annealed Langevin dynamics as the corrector, we have the following two algorithms for VE and VP SDEs respectively, where {ϵi}i=0N1\{ \epsilon_i \}_{i=0}^{N-1} are step sizes for Langevin dynamics.


Algorithm 2 PC sampling (VE SDE)

xNN(0,σmax2I)\bold{x}_N \sim \mathcal{N}(\bold{0}, \sigma_{\text{max}}^2 \bold{I})

for i=N1i=N-1 to 00 do

xixi+1+(σi+12σi2)sθ(xi+1,σi+1)\bold{x}_i' \leftarrow \bold{x}_{i+1} + (\sigma_{i+1}^2 - \sigma_i^2) \bold{s}_{\theta^*}(\bold{x}_{i+1}, \sigma_{i+1})

zN(0,I)\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})

xixi+σi+12σi2z\bold{x}_i \leftarrow \bold{x}_i' + \sqrt{\sigma_{i+1}^2 - \sigma_i^2} \bold{z}

for j=1j=1 to MM do

zN(0,I)\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})

xixi+ϵisθ(xi,σi)+2ϵiz\bold{x}_i \leftarrow \bold{x}_i + \epsilon_i \bold{s}_{\theta^*}(\bold{x}_i, \sigma_i) + \sqrt{2 \epsilon_i} \bold{z}

return x0\bold{x}_0



Algorithm 3 PC sampling (VP SDE)

xNN(0,I)\bold{x}_N \sim \mathcal{N}(\bold{0}, \bold{I})

for i=N1i = N-1 to 00 do

xi(21βi+1)xi+1+βi+1sθ(xi+1,i+1)\bold{x}_i' \leftarrow (2-\sqrt{1-\beta_{i+1}}) \bold{x}_{i+1} + \beta_{i+1} \bold{s}_{\theta^*}(\bold{x}_{i+1}, i+1)

zN(0,I)\bold{z} \sim \mathcal{N} (\bold{0}, \bold{I})

xixi+βi+1z\bold{x}_i \leftarrow \bold{x}_i' + \sqrt{\beta_{i+1}} \bold{z}

for j=1j = 1 to MM do

zN(0,I)\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})

xixi+ϵisθ(xi,i)+2ϵiz\bold{x}_i \leftarrow \bold{x}_i + \epsilon_i \bold{s}_{\theta^*}(\bold{x}_i, i) + \sqrt{2 \epsilon_i} \bold{z}

return x0\bold{x}_0


The corrector algorithms

The corrector algorithms are given in the following Algorithm 4 and 5, where rr is the “signal-to-noise” ratio. The Langevin Dynamics step size ϵ\epsilon is determined using the norm of the Gaussian noise z2||z||_2, norm of the score-based model sθ2|| \bold{s}_{\theta^*}||_2 and the signal-to-noise ratio rr.

Trick: when sampling a large batch of samples together, the norm 2||\cdot||_2 is replaced with the average norm across the mini-batch. When the batch size is small, z2||\bold{z}||_2 is replaced by d\sqrt{d}, where dd is the dimensionality of z\bold{z}.


Algorithm 4 Corrector algorithm (VE SDE).

Require: {σi}i=1N,r,N,M\{ \sigma_i \}_{i=1}^N, r, N, M

xN0N(0,σmax2I)\bold{x}_N^0 \sim \mathcal{N} (\bold{0}, \sigma_{\text{max}}^2 \bold{I})

for iNi \leftarrow N to 11 do

for j1j \leftarrow 1 to MM do

zN(0,I)\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})

gsθ(xij1,σi)\bold{g} \leftarrow \bold{s}_{\theta^*}(\bold{x}_i^{j-1}, \sigma_i)

ϵ2(rz2/g2)2\epsilon \leftarrow 2 (r ||\bold{z}||_2 / ||\bold{g}||_2)^2

xijxij1+ϵg+2ϵz\bold{x}_i^j \leftarrow \bold{x}_i^{j-1} + \epsilon \bold{g} + \sqrt{2 \epsilon} \bold{z}

xi10xiM\bold{x}_{i-1}^0 \leftarrow \bold{x}_i^M

return x00\bold{x}_0^0



Algorithm 5 Corrector algorithm (VP SDE).

Require: {βi}i=1N,{αi}i=1N,r,N,M\{ \beta_i \}_{i=1}^N , \{ \alpha_i \}_{i=1}^N, r, N, M

xN0N(0,I)\bold{x}_N^0 \sim \mathcal{N} (\bold{0}, \bold{I})

for iNi \leftarrow N to 11 do

for j1j \leftarrow 1 to MM do

zN(0,I)\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})

gsθ(xij1,i)\bold{g} \leftarrow \bold{s}_{\theta^*}(\bold{x}_i^{j-1}, i)

ϵ2αi(rz2/g2)2\epsilon \leftarrow 2 \alpha_i (r ||\bold{z}||_2 / ||\bold{g}||_2)^2

xijxij1+ϵg+2ϵz\bold{x}_i^j \leftarrow \bold{x}_i^{j-1} + \epsilon \bold{g} + \sqrt{2 \epsilon} \bold{z}

xi10xiM\bold{x}_{i-1}^0 \leftarrow \bold{x}_i^M

return x00\bold{x}_0^0


Denoising

For both SMLD and DDPM models, the generated samples typically contain small noise that is hard to detect by humans. This is part of the reason why NCSN models trained with SMLD has been performing worse than DDPM models in terms of FID, because the former doesn’t use a denoising step at the end of sampling. One can use Tweedie’s formula at the end of the sampling process.

For a Gaussian variable zN(z;μz,Σz)\bold{z} \sim \mathcal{N} (\bold{z}; \mu_{z}, \Sigma_z), Tweedie’s Formula states that:

E[μzz]=z+Σzzlogp(z)\mathbb{E}[\mu_z | \bold{z}] = \bold{z} + \Sigma_z \nabla_{\bold{z}} \log p(\bold{z})

This is equivalent to running a predictor step without adding the random noise.