Score-Based Generative Modeling through Stochastic Differential Equations

This is a learning note of this series of videos.

Paper Link: https://arxiv.org/abs/2011.13456

1. Why we use SDE to describe the diffusion process?

We want to perturb the data with multiple noise scales(why??). The idea is to use SDE to provide an infinite number of noise scales(continuously changing).

The SDE (which is continuous) can be used for theoretical analysis. In practice, we discretize the SDE for numerical computation.

Goal: construct a diffusion process {x(t)}t=0T\{\bold{x}(t)\}^T_{t=0} indexed by a continuous time variable t[0,T]t \in [0, T], such that x(0)p0\bold{x}(0) \sim p_0, for which we have a dataset of i.i.d. samples, and x(T)pT\bold{x}(T) \sim p_T, for which we have a tractable form to generate samples efficiently. This diffusion process can be modeled as the solution to an Ito SDE:

dx=f(x,t)dt+g(t)dwd \bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w}

ww: a Brownian Motion whose increments follow the gaussian distribution, and variance increase with time:

w(t+Δt)w(t)N(0,Δt)\bold{w}(t+\Delta t) - \bold{w}(t) \sim \mathcal{N}(0, \Delta t)
dw=0+Δtϵwhere ϵN(0,I)d \bold{w} = 0 + \sqrt{\Delta t} \epsilon \qquad \text{where } \epsilon \sim \mathcal{N}(0, \bold{I})

f(,t):RdRd\bold{f} ( \cdot, t): \mathbb{R}^d \rightarrow \mathbb{R}^d is a vector valued function called the drift coefficient of x(t)\bold{x}(t), and g():RRg(\cdot): \mathbb{R} \rightarrow \mathbb{R} is a scalar function known as the diffusion coefficient of x(t)\bold{x}(t).

Starting from samples of x(T)pT\bold{x}(T) \sim p_T and reversing the process, we can obtain samples x(0)p0\bold{x}(0) \sim p_0. It is proved that the reverse of a diffusion process is also a diffusion process, running backward in time and given by the reverse-time SDE:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dwˉd\bold{x} = [\bold{f}(\bold{x}, t) - g(t)^2 \nabla_{\bold{x}} \log p_t(\bold{x})] dt + g(t) d \bar{\bold{w}}

where wˉ\bar{\bold{w}} is a standard Wiener process when time flows backwards from TT to 0, and dtdt is a n infinitesimal negative timestep. Once the score of each marginal distribution, xlogpt(x)\nabla_{\bold{x}} \log p_t(\bold{x}), is known for all t, we can derive the above reverse diffusion process and simulate it to sample from p0p_0.

2. How to derive the reverse-time SDE?

Forward SDE: dx=f(x,t)dt+g(t)dwd\bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w}

Reverse SDE: dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dwˉd\bold{x} = [\bold{f}(\bold{x}, t) - g(t)^2 \nabla_{\bold{x}} \log p_t(\bold{x})] dt + g(t) d \bar{\bold{w}}

Proof 1:

Important Assumption: the diffusion coefficient is g(t)g(t) rather than g(x,t)g(\bold{x}, t).

Characteristics of the Brownian Motion {w(t),t0}\{ w(t), t\geq 0\}:

  1. Gaussian Increments: w(t)w(s)N(0,ts)w(t) - w(s) \sim \mathcal{N}(0, t-s); w(t)w(0)N(0,t)w(t) - w(0) \sim \mathcal{N}(0, t)
  1. Independent Increments: If 0ust0 \leq u \leq s \leq t, then w(t)w(s)w(t) - w(s) and w(s)w(u)w(s) - w(u) are independent.
  1. Path Continuity: w(t)w(t) is a continuous function of tt.

Discretization of the forward SDE gives:

xt+Δtxt=f(xt,t)Δt+g(t)ΔtϵϵN(0,I)\bold{x}_{t + \Delta t} - \bold{x}_t = \bold{f}(\bold{x}_t, t) \Delta t + g(t) \sqrt{\Delta t} \epsilon \qquad \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})
xt+Δt=xt+f(xt,t)Δt+g(t)Δtϵ\Leftrightarrow \bold{x}_{t + \Delta t} = \bold{x}_t + \bold{f}(\bold{x}_t, t) \Delta t + g(t) \sqrt{\Delta t} \epsilon

This indicates that:

p(xt+Δtxt)=N(xt+Δtxt+f(xt,t)Δt,g(t)2ΔtI)exp(xt+Δtxtf(xt,t)Δt22g(t)2Δt)\begin{align*} p(\bold{x}_{t + \Delta t} | \bold{x}_t) &= \mathcal{N}(\bold{x}_{t + \Delta t} | \bold{x}_t + \bold{f}(\bold{x}_t, t) \Delta t, g(t)^2 \Delta t \cdot \bold{I}) \\ & \propto \exp \left( - \frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} \right)\end{align*}

According to Bayesian theorem:

p(xtxt+Δt)=p(xt+Δtxt)p(xt)p(xt+Δt)=p(xt+Δtxt)exp(logp(xt)logp(xt+Δt))exp(xt+Δtxtf(xt,t)Δt22g(t)2Δt+logp(xt)logp(xt+Δt))\begin{align*} p(\bold{x}_t | \bold{x}_{t+\Delta t}) &= \frac{p(\bold{x}_{t + \Delta t} | \bold{x}_t) p(\bold{x}_t)}{p(\bold{x}_{t + \Delta t)}} = p(\bold{x}_{t + \Delta t} | \bold{x}_t) \exp \left( \log p(\bold{x}_t) - \log p(\bold{x}_{t+\Delta t}) \right) \\ &\propto \exp \left( -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} + \log p(\bold{x}_t) - \log p(\bold{x}_{t+\Delta t})\right) \end{align*}

Note: First order Taylor Expansion of f(x)f(x) at x0x_0 is f(x)f(x0)+f(x0)(xx0)f(x) \approx f(x_0) + f'(x_0) (x - x_0)

Note: logp(xt)\log p(x_t) is actually a function of both xtx_t and tt, therefore when taking derivatives, both of them should be considered.

By Taylor Expansion, we have

logp(xt+Δt)logp(xt)+(xt+Δtxt)xtlogp(xt)+Δttlogp(xt)\log p(\bold{x}_{t + \Delta t}) \approx \log p(\bold{x}_t) + (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) + \Delta t \frac{\partial}{\partial t} \log p(\bold{x}_t)

Plug in the above equation into p(xtxt+Δt)p(\bold{x}_t | \bold{x}_{t + \Delta t}):

p(xtxt+Δt)exp(xt+Δtxtf(xt,t)Δt22g(t)2Δt(xt+Δtxt)xtlogp(xt)Δttlogp(xt))\begin{align*} p(\bold{x}_t | \bold{x}_{t + \Delta t}) &\propto \exp \left( -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} - (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) - \Delta t \frac{\partial}{\partial t} \log p(\bold{x}_t) \right)\end{align*}
xt+Δtxtf(xt,t)Δt22g(t)2Δt(xt+Δtxt)xtlogp(xt)=12g(t)2Δt(xt+Δtxt22(xt+Δtxt)f(xt,t)Δt+f(xt,t)2Δt2+2g(t)2Δt(xt+Δtxt)xtlogp(xt))=12g(t)2Δt(xt+Δtxt22(xt+Δtxt)[f(xt,t)g(t)2xtlogp(xt)]Δt+C(Δt))=12g(t)2Δt(xt+Δtxt[f(xt,t)g(t)2xtlogp(xt)]Δt2+C(Δt))\begin{align*} & -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} - (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) \\ &= -\frac{1}{2 g(t)^2 \Delta t} \left( ||\bold{x}_{t+\Delta t} - \bold{x}_t||^2 - 2(\bold{x}_{t+\Delta t} - \bold{x}_t) \bold{f}(\bold{x}_t, t) \Delta t + \bold{f}(\bold{x}_t, t)^2 \Delta t^2 + 2 g(t)^2 \Delta t (\bold{x}_{t + \Delta t} - \bold{x}_t)\nabla_{\bold{x}_t} \log p(\bold{x}_t)\right) \\ &= -\frac{1}{2 g(t)^2 \Delta t} \left( ||\bold{x}_{t+\Delta t} - \bold{x}_t||^2 - 2(\bold{x}_{t + \Delta t} - \bold{x}_t)[\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t + C(\Delta t) \right) \\ &= -\frac{1}{2 g(t)^2 \Delta t}\left( ||\bold{x}_{t+\Delta t} - \bold{x}_t - [\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t||^2 + C(\Delta t) \right)\end{align*}

Here C(Δt)C(\Delta t) is a polynomial of Δt\Delta t without a constant term. It goes to 00 when Δt0\Delta t \rightarrow 0. Therefore,

p(xtxt+Δt)exp(xt+Δtxt[f(xt,t)g(t)2xtlogp(xt)]Δt22g(t)2Δt)exp(xtxt+Δt[f(xt+Δt,t+Δt)g(t+Δt)2xtlogp(xt+Δt)](Δt)22g(t+Δt)2Δt)\begin{align*} p(\bold{x}_t | \bold{x}_{t + \Delta t}) & \propto \exp \left( - \frac{||\bold{x}_{t+\Delta t} - \bold{x}_t - [\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t||^2}{2g(t)^2 \Delta t} \right) \\ &\approx \exp \left( - \frac{||\bold{x}_{t} - \bold{x}_{t + \Delta t} - [\bold{f}(\bold{x}_{t+\Delta t}, t + \Delta t) - g(t+\Delta t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_{t+\Delta t})}] (-\Delta t)||^2}{2g(t + \Delta t)^2 \Delta t} \right)\end{align*}

Previously, we derived p(xt+Δtxt)exp(xt+Δtxtf(xt,t)Δt22g(t)2Δt)p(\bold{x}_{t + \Delta t} | \bold{x}_t) \propto \exp \left( - \frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} \right) from dx=f(x,t)dt+g(t)dwd\bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d\bold{w}. Therefore we can conclude that the corresponding reverse-time SDE is:

dx=[f(xt,t)g(t)2xtlogp(xt)]dt+g(t)dwd \bold{x} = [\bold{f}(\bold{x}_{t}, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_{t})}]dt + g(t) d \bold{w}

3. Core settings of two diffusion models: SMLD and DDPM

3.1 Denoising score matching with Langevin Dynamics (SMLD)

Let the perturbation kernel to be: pσ(x~x)N(x~;x,σ2I)p_{\sigma} (\tilde{\bold{x}} | \bold{x}) \coloneqq \mathcal{N}(\tilde{\bold{x}}; \bold{x} , \sigma^2 \bold{I}), and pσ(x~)pdata(x)pσ(x~x)dxp_{\sigma}(\tilde{\bold{x}}) \coloneqq \int p_{data}(\bold{x}) p_{\sigma}(\tilde{\bold{x}} | \bold{x}) d\bold{x}, where pdata(x)p_{data}(\bold{x}) denotes the data distribution. Consider a sequence of positive noise scales σmin=σ1<σ2<<σN=σmax\sigma_{\min} = \sigma_1 < \sigma_2 < \dots < \sigma_N = \sigma_{\max}. Usually σmin\sigma_{\min} is small enough such that pσmin(x)pdata(x)p_{\sigma_{\min}}(\bold{x}) \approx p_{data}(\bold{x}) and σmax\sigma_{\max} is large enough such that pσmax(x)N(x;0,σmax2I)p_{\sigma_{\max}}(\bold{x}) \approx \mathcal{N}(\bold{x}; \bold{0}, \sigma_{\max}^2 \bold{I})

Overall Step (derived from single step): p(xtx0)=N(x0,σt2I)p(\bold{x}_t | \bold{x}_0) = \mathcal{N}(\bold{x}_0, \sigma_t^2 \bold{I})

Single Step (by design): p(xtxt1)=N(xt1,(σt2σt12)I)p(\bold{x}_t | \bold{x}_{t-1}) = \mathcal{N}(\bold{x}_{t-1}, (\sigma_t^2 - \sigma_{t-1}^2) \bold{I})

Previous work propose to train a Noise Conditional Score Network sθ(x,σ)\bold{s}_{\theta}(\bold{x}, \sigma) with a weighted sum of denoising score matching objectives:

θ=arg minθi=1Nσi2Epdata(x)Epσi(x~x)[sθ(x~,σi)x~logpσi(x~x)22]\theta^* = \argmin_{\theta} \sum_{i=1}^N \sigma_i^2 \mathbb{E}_{p_{data}(\bold{x})} \mathbb{E}_{p_{\sigma_i}(\tilde{\bold{x}}|\bold{x})}\left[ ||\bold{s}_{\theta}(\tilde{\bold{x}}, \sigma_i) - \nabla_{\tilde{\bold{x}}} \log p_{\sigma_i}(\tilde{\bold{x}}|\bold{x})||^2_2 \right]

Here, x~logpσi(x~x)=x~xσi2\nabla_{\tilde{\bold{x}}} \log p_{\sigma_i}(\tilde{\bold{x}}|\bold{x}) = - \frac{\tilde{\bold{x}} - \bold{x}}{\sigma_i^2}. The optimal score-based model sθ(x,σ)\bold{s}_{\theta^*}(\bold{x}, \sigma) can be obtained given sufficient data and model capacity. It matches xlogpσ(x)\nabla{\bold{x}} \log p_{\sigma}(\bold{x}) almost everywhere for σ{σi}i=1N\sigma \in \{ \sigma_i \}_{i=1}^N.

For sampling, the work uses MM steps of Langevin MCMC to get a sample for each pσi(x)p_{\sigma_i}(\bold{x}) sequentially:

xim=xim1+ϵisθ(xim1,σi)+2ϵizimm=1,2,,M,\bold{x}_i^m = \bold{x}_i^{m-1} + \epsilon_i \bold{s}_{\theta^*}(\bold{x}_i^{m-1}, \sigma_i) + \sqrt{2 \epsilon_i} \bold{z}_i^m \qquad m=1,2,\dots,M,

where ϵi>0\epsilon_i > 0 is the step size, and zim\bold{z}_i^m is standard normal. The above process is repeated for i=N,N1,,1i = N, N-1, \dots, 1 in turn with xN0N(x0,σmax2I)\bold{x}_N^0 \sim \mathcal{N}(\bold{x} | \bold{0}, \sigma_{\max}^2 \bold{I}) and xi0=xi+1M\bold{x}_i^0 = \bold{x}_{i+1}^M when i<Ni < N.

Note: ii means different noise levels; MM means denoise MM steps at each noise level; Therefore, to generate a sample, we need to run the score function N×MN \times M times:

xN0xN1xNMxN11xN1Mx0M\bold{x}_N^0 \rightarrow \bold{x}_N^1 \rightarrow \dots \rightarrow \bold{x}_N^M \rightarrow \bold{x}_{N-1}^1 \rightarrow \dots \rightarrow \bold{x}_{N-1}^M \rightarrow \dots \rightarrow \bold{x}_{0}^M

3.2 Denoising Diffusion Probabilistic Models (DDPM)

Consider a sequence of positive noise scales 0<β1,β2,,βN<10 < \beta_1, \beta_2, \dots, \beta_N < 1. For each training data point x0pdata(x)\bold{x}_0 \sim p_{data}(\bold{x}), construct a discrete Markov Chain {x0,x1,,xN}\{\bold{x}_0, \bold{x}_1, \dots, \bold{x}_N\}. The single step and overall step of this chain is:

Single step (by design): p(xixi1)=N(xi;1βixi1,βiI)p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N} (\bold{x}_i; \sqrt{1 - \beta_i} \bold{x}_{i-1}, \beta_i \bold{I})

Overall step (derived): pαi(xix0)=N(xi;αix0,(1αi)I)p_{\alpha_i}(\bold{x}_i | \bold{x}_0) = \mathcal{N} (\bold{x}_i ; \sqrt{\alpha_i} \bold{x}_0, (1 - \alpha_i) \bold{I})

Here αiΠj=1i(1βj)\alpha_i \coloneqq \Pi_{j=1}^i (1 - \beta_j)

Proof of overall step:

Since xi=1βixi1+βizi\bold{x}_i = \sqrt{1 - \beta_i}\bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_i and xi+1=1βi+1xi+βi+1zi+1\bold{x}_{i+1} = \sqrt{1 - \beta_{i+1}} \bold{x}_i + \sqrt{\beta_{i+1}} \bold{z}_{i+1}, we have:

xi+1=1βi+1(1βixi1+βizi)+βi+1zi+1=(1βi+1)(1βi)xi1+(1βi+1)βizi+βi+1zi+1=(1βi+1)(1βi)xi1+βi+βi+1βiβi+1z(zN(0,I))=(1βi+1)(1βi)xi1+1(1βi+1)(1βi)z\begin{align*} \bold{x}_{i+1} &= \sqrt{1 - \beta_{i+1}} (\sqrt{1 - \beta_i}\bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_i) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt{(1 - \beta_{i+1})\beta_i} \bold{z}_i + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt {\beta_i + \beta_{i+1} - \beta_i \beta_{i+1}} \bold{z} \qquad (\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})) \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt{1 - (1 - \beta_{i+1})(1 - \beta_i)} \bold{z}\end{align*}

By induction, we can conclude that:

xi=Πj=1i(1βj)x0+1Πj=1i(1βj)z=αix0+1αiz(zN(0,I))\begin{align*} \bold{x}_i &= \sqrt{\Pi_{j=1}^i (1 - \beta_j)} \bold{x}_0 + \sqrt{1 - \Pi_{j=1}^i (1 - \beta_j)} \bold{z} \\ &= \sqrt{\alpha_i} \bold{x}_0 + \sqrt{1 - \alpha_i} \bold{z} \qquad (\bold{z} \sim \mathcal{N} (\bold{0}, \bold{I}))\end{align*}

Therefore, pαi(xix0)=N(xi;αix0,(1αi)I)p_{\alpha_i}(\bold{x}_i | \bold{x}_0) = \mathcal{N} (\bold{x}_i ; \sqrt{\alpha_i} \bold{x}_0, (1 - \alpha_i) \bold{I}).

End of proof.

The noise scales are prescribed such that xN\bold{x}_N is approximately distributed according to N(0,I)\mathcal{N} ( \bold{0}, \bold{I})

The denoising distribution derived from Bayesian formula is:

pθ(xi1xi)=N(xi1;11βi(xi+βisθ(xi,i)),βiI)p_{\theta}(\bold{x}_{i-1} | \bold{x}_i) = \mathcal{N} (\bold{x}_{i-1}; \frac{1}{\sqrt{1 - \beta_i}}(\bold{x}_i + \beta_i \bold{s}_{\theta}(\bold{x}_i, i)), \beta_i \bold{I})

and it can be trained with a re-weighted variant of the evidence lower bound (ELBO):

θ=arg minθi=1N(1αi)Epdata(x)Epαi(x~x)[sθ(x~,i)x~logpαi(x~x)22]\theta^* = \argmin_{\theta} \sum_{i=1}^N (1 - \alpha_i) \mathbb{E}_{p_{data}(\bold{x})} \mathbb{E}_{p_{\alpha_i}}(\tilde{\bold{x}} | \bold{x}) [||\bold{s}_{\theta}(\tilde{\bold{x}} , i) - \nabla_{\tilde{\bold{x}}}\log p_{\alpha_i}(\tilde{\bold{x}} | \bold{x})||^2_2]

After solving the above equation we get the optimal model sθ(xi,i)\bold{s}_{\theta^*} (\bold{x}_i, i), then samples can be generated by starting from xNN(0,I)\bold{x}_N \sim \mathcal{N}(\bold{0}, \bold{I}) and following the estimated reverse Markov chain as below:

xi1=11βi(xi+βisθ(xi,i))+βizi,i=N,N1,,1.\bold{x}_{i-1} = \frac{1}{\sqrt{1 - \beta_i}} (\bold{x}_i + \beta_i \bold{s}_{\theta^*}(\bold{x}_i, i)) + \sqrt{\beta_i} \bold{z}_i, \qquad i=N, N-1, \dots, 1.

NOTE: we can notice that the weights of the i-th summand in both SMLD and DDPM loss functions, namely σi2\sigma_i^2 and (1αi)(1 - \alpha_i), are related to the corresponding perturbation kernels in the same functional form: σi21/E[xlogpσi(x~x)22]\sigma_i^2 \propto 1 / \mathbb{E} [||\nabla_{\bold{x}} \log p_{\sigma_i}(\tilde{\bold{x}}| \bold{x})||^2_2] and (1αi)1/E[xlogpαi(x~x)22](1 - \alpha_i) \propto 1 / \mathbb{E} [||\nabla_{\bold{x}} \log p_{\alpha_i}(\tilde{\bold{x}}| \bold{x})||^2_2]

4. VE, VP SDEs: Derive SDEs from Conditional Distributions

The noise perturbations used in SMLD and DDPM can be regarded as discretizations of two different SDEs.

VE Model

When using a total of N noise scales, each perturbation kernel pσi(xx0)p_{\sigma_i}(\bold{x} | \bold{x}_0) of SMLD corresponds to the distribution of xi\bold{x}_i in the following Markov chain:

xi=xi1+σi2σi12zi1i=1,,N.\bold{x}_i = \bold{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \bold{z}_{i-1} \qquad i=1,\dots,N.

where zi1N(0,I)\bold{z}_{i-1} \sim \mathcal{N}(\bold{0}, \bold{I}), and we have introduced σ0=0\sigma_0 = 0 to simplify the notation.

(Recall that single step perturbation of SMLD is: p(xixi1)=N(xi;xi1,(σi2σi12)I)p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N}(\bold{x}_i; \bold{x}_{i-1}, (\sigma_i^2 - \sigma_{i-1}^2) \bold{I})

When NN \rightarrow \infty, {σi}i=1N\{ \sigma_i \}_{i=1}^N becomes a function σ(t)\sigma(t), zi\bold{z}_i becomes z(t)\bold{z}(t), and the Markov chain {xi}i=1N\{ \bold{x}_i \}_{i=1}^N becomes a continuous stochastic process {x(t)}t=01\{ \bold{x}(t) \}_{t=0}^1, where we have used a continuous time variable t[0,1]t \in [0, 1]. The process {x(t)}t=01\{ \bold{x} (t) \}_{t=0}^1 is given by the following SDE:

dx=d[σ2(t)]dtdwd \bold{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w}

VP Model

Similarly for DDPM, the perturbation kernels are p(xixi1)=N(1βixi1,βiI)p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N} (\sqrt{1 - \beta_i} x_{i-1}, \beta_i \bold{I}). Then the discrete Markov chain is:

xi=1βixi1+βizi1i=1,,N.\bold{x}_i = \sqrt{1 - \beta_i} \bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_{i-1} \qquad i=1,\dots,N.

As NN \rightarrow \infty, the Markov chain converges to the following SDE:

dx=12β(t)xdt+β(t)dwd \bold{x} = - \frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d \bold{w}

Proof for SMLD(VE):

xi=xi1+σi2σi12zi1i=1,,N.\bold{x}_i = \bold{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \bold{z}_{i-1} \qquad i=1,\dots,N.

Define some notations first: x(iN)=xi\bold{x}(\frac{i}{N}) = \bold{x}_i, σ(iN)=σi\sigma(\frac{i}{N}) = \sigma_i, and z(iN)=zi\bold{z} (\frac{i}{N}) = \bold{z}_i for i=1,,Ni=1,\dots,N. We can rewrite the Markov chain as follows with Δt=1N\Delta t = \frac{1}{N} and t{i,1N,,N1N}t \in \{ i, \frac{1}{N}, \dots, \frac{N-1}{N} \}:

x(t+Δt)=x(t)+σ2(t+Δt)σ2(t)z(t)x(t)+d[σ2(t)]dtΔtz(t)\bold{x}(t + \Delta t) = \bold{x}(t) + \sqrt{\sigma^2(t + \Delta t) - \sigma^2 (t)} \bold{z}(t) \approx \bold{x}(t) + \sqrt{\frac{d[\sigma^2(t)]}{dt}\Delta t} \bold{z}(t)

The approximation is given by the definition of derivative: d[σ2(t)]dt=limΔt0σ2(t+Δt)σ2(t)Δt\frac{d[\sigma^2(t)]}{dt} = \lim_{\Delta t \rightarrow 0} \frac{\sigma^2(t + \Delta t) - \sigma^2 (t)}{\Delta t}

As Δt0\Delta t \rightarrow 0, Δtz(t)=dw\sqrt{\Delta t} z(t) = d \bold{w} where w\bold{w} is a Wiener process. This is because w(t+Δt)w(t)N(0,Δt)\bold{w}(t + \Delta t) - \bold{w}(t) \sim \mathcal{N}(0, \Delta t).

End of Proof.

Proof for DDPM(VP):

xi=1βixi1+βizi1i=1,,N.\bold{x}_i = \sqrt{1 - \beta_i} \bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_{i-1} \qquad i=1,\dots,N.

Define an auxiliary set of noise scales {βiˉ=Nβi}i=1N\{ \bar{\beta_i} = N \beta_i \}_{i=1}^N and rewrite the Markov chain as below:

xi=1βiˉNxi1+βiˉNzi1i=1,,N\bold{x}_i = \sqrt{1 - \frac{\bar{\beta_i}}{N}} \bold{x}_{i-1} + \sqrt{\frac{\bar{\beta_i}}{N}} \bold{z}_{i-1} \qquad i=1,\dots, N

In the limit of NN \rightarrow \infty, {βiˉ}i=1N\{ \bar{\beta_i}\}_{i=1}^N becomes a function β(t)\beta(t) indexed by t[0,1]t \in [0, 1]. Let β(iN)=βiˉ\beta(\frac{i}{N}) = \bar{\beta_i}, x(iN)=xi\bold{x}(\frac{i}{N}) = \bold{x}_i, z(iN)=zi\bold{z}(\frac{i}{N}) = \bold{z}_i. We can rewrite the above Markov chain as the following with Δt=1N\Delta t = \frac{1}{N} and t{0,1,,N1N}t \in \{ 0, 1, \dots, \frac{N-1}{N} \}:

x(t+Δt)=1β(t+Δt)Δtx(t)+β(t+Δt)Δtz(t)x(t)12β(t+Δt)Δtx(t)+β(t+Δt)Δtz(t)x(t)12β(t)Δtx(t)+β(t)Δtz(t)\begin{align*} \bold{x}(t + \Delta t) &= \sqrt{1 - \beta(t + \Delta t) \Delta t} \bold{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \bold{z}(t) \\ &\approx \bold{x}(t) - \frac{1}{2} \beta(t + \Delta t) \Delta t \bold{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \bold{z}(t) \\ &\approx \bold{x}(t) - \frac{1}{2} \beta(t) \Delta t \bold{x}(t) + \sqrt{\beta(t) \Delta t} \bold{z}(t)\end{align*}

The first approximation comes from Taylor Expansion: 1x=1x2x281x2\sqrt{1 - x} = 1 - \frac{x}{2} - \frac{x^2}{8} - \dots \approx 1 - \frac{x}{2}

Therefore, in the limit of Δt0\Delta t \rightarrow 0, the Markov chain converges to the following VP SDE:

dx=12β(t)xdt+β(t)dwd \bold{x} = - \frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d \bold{w}

5. How to solve the SDE?

Idea: Solve E(xt)\mathbb{E}(\bold{x}_t) and Var(xt)\text{Var}(\bold{x}_t), then under the Gaussian assumption, we know that p0t(xtx0)N(,)p_{0t}(\bold{x}_t | \bold{x}_0) \sim \mathcal{N}(\cdot, \cdot)

Theorem 5.1 (simplified from Equation (5.50), Equation (5.51) in Applied Stochastic Differential Equations)

If the SDE takes the form:

dx=f(x,t)dt+g(t)dwd \bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w}

Then the expectation m(t)\bold{m}(t) and covariance matrix P(t)\bold{P}(t) of x(t)\bold{x}(t) have:

dm(t)dt=E[f(x,t)]\frac{d \bold{m}(t)}{dt} = \mathbb{E} [\bold{f}(\bold{x}, t)]
dP(t)dt=E[f(x,t)(xm(t))T]+E[(xm(t))fT(x,t)]+E[g2(t)]\frac{d \bold{P}(t)}{dt} = \mathbb{E}[\bold{f}(\bold{x}, t) (\bold{x} - \bold{m}(t))^T] + \mathbb{E}[ (\bold{x} - \bold{m}(t))\bold{f}^T(\bold{x}, t)] + \mathbb{E}[g^2(t)]

Solution to VP SDE:

VP Model:dx=12β(t)xdt+β(t)dw\text{VP Model}: \quad d \bold{x} = -\frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d\bold{w}

By Theorem 5.1, we have

dmdt=E[12β(t)x]=12β(t)E[x(t)]=12β(t)mdmm=12β(t)dtlnm(t)lnm(0)=120tβ(s)dsm(t)=elnm(0)120tβ(s)ds=m(0)e120tβ(s)ds\begin{align*} \frac{d \bold{m}}{dt} &= \mathbb{E}[-\frac{1}{2} \beta(t) \bold{x}] = -\frac{1}{2} \beta(t) \mathbb{E}[\bold{x}(t)] \\ &= -\frac{1}{2} \beta(t) \bold{m} \\ \Rightarrow \frac{d\bold{m}}{\bold{m}} &= -\frac{1}{2} \beta(t) dt \\ \ln \bold{m}(t) - \ln \bold{m}(0) & = -\frac{1}{2} \int_0^t \beta(s) ds \\ \Rightarrow \bold{m}(t) &= e^{\ln \bold{m}(0) - \frac{1}{2} \int_0^t \beta(s) ds} = \bold{m}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}\end{align*}

Therefore,

E[x(t)x(0)]=x(0)e120tβ(s)ds\mathbb{E}[\bold{x} (t) | \bold{x}(0)] = \bold{x}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}

For covariance matrix P(t)P(t):

dP(t)dt=E[12β(t)x(t)(x(t)m(t))T]+E[12β(t)(x(t)m(t))x(t)T]+E[β(t)]=β(t)E[x(t)(x(t)m(t))T]+β(t)\begin{align*} \frac{dP(t)}{dt} &= \mathbb{E} [-\frac{1}{2} \beta(t)\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] + \mathbb{E} [-\frac{1}{2} \beta(t) (\bold{x}(t) - \bold{m}(t)) \bold{x}(t)^T] + \mathbb{E}[\beta(t)] \\ &= -\beta(t)\mathbb{E}[\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] + \beta(t) \end{align*}

Since E[x(t)m(t)]=0\mathbb{E}[\bold{x}(t) - \bold{m}(t)] = 0, we have:

E[x(t)(x(t)m(t))T]=E[x(t)(x(t)m(t))T]0=E[x(t)(x(t)m(t))T]m(t)E[(x(t)m(t))T]=E[(x(t)m(t))(x(t)m(t))T]=P(t)\begin{align*} \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] &= \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] - 0 \\ &= \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] - \bold{m}(t) \mathbb{E} [(\bold{x}(t) - \bold{m}(t))^T] \\ &= \mathbb{E} [(\bold{x}(t) - \bold{m}(t))(\bold{x}(t) - \bold{m}(t))^T] \\ &= P(t)\end{align*}
dP(t)dt=β(t)P(t)+β(t)=β(t)(IP(t))dP(t)IP(t)=β(t)dtln(IP(t))+ln(IP(0))=0tβ(s)dsIP(t)=exp{ln(IP(0))0tβ(s)ds}P(t)=I(IP(0))e0tβ(s)ds\begin{align*} \Rightarrow \frac{d P(t)}{dt} &= -\beta(t) P(t) + \beta(t) \\ &= \beta(t) (\bold{I} - P(t)) \\ \frac{d P(t)}{\bold{I} - P(t)} &= \beta(t) dt \\ -\ln (\bold{I} - P(t)) + \ln (\bold{I} - P(0)) &= \int_0^t \beta(s) ds \\ \bold{I} - P(t) &= \exp\{\ln(\bold{I} - P(0)) -\int_0^t \beta(s)ds \} \\ P(t) &= \bold{I} - (\bold{I} - P(0)) e^{-\int_0^t \beta(s)ds} \end{align*}

Since Cov(x0x0)=0Cov(\bold{x}_0 | \bold{x}_0) = 0, we have:

Cov(xtx0)=IIe0tβ(s)dsCov(\bold{x}_t | \bold{x}_0) = \bold{I} - \bold{I} e^{-\int_0^t \beta(s)ds}

Therefore, the solution to VP model is:

p0t(x(t)x(0))=N(x(t);x(0)e120tβ(s)ds,IIe0tβ(s)ds)(VP SDE)p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}, \bold{I} - \bold{I} e^{-\int_0^t \beta(s)ds}) \qquad (\text{VP SDE})

Solution to VE SDE:

VE Model:dx=d[σ2(t)]dtdw(f(x,t)=0)\text{VE Model}: \quad d\bold{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w} \qquad (\bold{f}(\bold{x}, t) = 0)

By Theorem 5.1:

dmdt=0m(t)=C=m(0)=x0\frac{d\bold{m}}{dt} = 0 \Rightarrow \bold{m}(t) = C = \bold{m}(0) = \bold{x}_0
dP(t)dt=E[d[σ2(t)]dt]=d[σ2(t)]dtdP(t)=d[σ2(t)]P(t)P(0)=σ2(t)σ2(0)P(t)=σ2(t)σ2(0)\begin{align*}\frac{d P(t)}{dt} &= \mathbb{E} [\frac{d[\sigma^2(t)]}{dt}] = \frac{d[\sigma^2(t)]}{dt} \\ dP(t) &= d[\sigma^2(t)] \\ P(t) - P(0) &= \sigma^2(t) - \sigma^2(0) \\ P(t) &= \sigma^2(t) - \sigma^2(0)\end{align*}

Therefore, the solution to VE SDE is:

p0t(x(t)x(0))=N(x(t);x(0),[σ2(t)σ2(0)]I)(VE SDE)p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) , [\sigma^2(t) - \sigma^2(0)] \bold{I}) \qquad (\text{VE SDE})

6. Derive mean and variance of the perturbation kernel from sub-VP SDE

dx=12β(t)xdt+β(t)(1e20tβ(s)ds)dwd \bold{x} = -\frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t) (1 - e^{-2 \int_0^t \beta(s) ds})} d\bold{w}

Why sub-VP SDE:

  1. Perform well on likelihood
  1. Variance is bounded by VP SDE

Since VE, VP and sub-VP SDEs all have affine drift coefficients, their perturbation kernels p0t(x(t)x(0))p_{0t} (\bold{x}(t) | \bold{x}(0)) are all Gaussian and can be computed in closed-forms. This makes training with the score-matching loss:

θ=arg minθEt{λ(t)Ex(0)Ex(t)x(0)[sθ(x(t),t)x(t)logp0t(x(t)x(0))22]}\theta^* = \argmin_{\theta} \mathbb{E}_t \{ \lambda(t) \mathbb{E}_{\bold{x}(0)} \mathbb{E}_{\bold{x}(t) | \bold{x}(0)} [||\bold{s}_{\theta}(\bold{x}(t), t) - \nabla_{\bold{x}(t)} \log p_{0t}(\bold{x}(t) | \bold{x}(0))||^2_2] \}

Solution to sub-VP SDE:

Corollary 6.1:

Given an ODE of the form y(x)+p(x)y(x)=f(x)y'(x) + p(x)y(x) = f(x), the solution is given by:

y(x)=1μ(x)(f(ξ)μ(ξ)dξ+C)y(x) = \frac{1}{\mu(x)} \left( \int f(\xi) \mu(\xi) d\xi + C \right)

where μ(t)=exp(tp(ξ)dξ)\mu(t) = \exp \left( \int^t p(\xi)d\xi \right).

Similar to VP SDE, we have:

E[x(t)x(0)]=x(0)e120tβ(s)ds\mathbb{E} [\bold{x}(t) | \bold{x}(0)] = \bold{x}(0) e^{-\frac{1}{2} \int_0^t \beta(s) ds}

By Theorem 5.1:

dP(t)dt=β(t)P(t)+β(t)(1exp{20tβ(s)ds})P(t)+β(t)P(t)=β(t)(1exp{20tβ(s)ds})\begin{align*} \frac{d P(t)}{dt} &= -\beta(t) P(t) + \beta(t) (1 - \exp \{ -2 \int_0^t \beta(s) ds \}) \\ P'(t) + \beta(t) P(t) &= \beta(t) (1 - \exp \{ -2 \int_0^t \beta(s) ds \})\end{align*}

By Corollary 6.1:

P(t)=Iexp{0tβ(s)ds}(0tβ(s)[1exp{20sβ(ξ)dξ}]exp{0sβ(ξ)dξ}ds+C)=Iexp{0tβ(s)ds}(0tβ(s)exp{0sβ(ξ)dξ}ds0tβ(s)exp{0sβ(ξ)dξ}ds+C)\begin{align*} P(t) &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left( \int_0^t \beta(s) \left[ 1 - \exp\{ -2 \int_0^s \beta(\xi) d\xi \} \right] \exp \{ \int_0^s \beta(\xi) d\xi \} ds + C \right) \\ &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left( \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds - \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds + C \right)\end{align*}

Denote 1=0tβ(s)exp{0sβ(ξ)dξ}ds\textcircled{1} = \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds and 2=0tβ(s)exp{0sβ(ξ)dξ}ds\textcircled{2} = \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds. Solve them separately.

1=0tβ(s)exp{0sβ(ξ)dξ}ds=exp{0sβ(ξ)dξ}s=0s=t=exp{0tβ(s)ds}1\begin{align*} \textcircled{1} &= \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds \\ &= \exp \{ \int_0^s \beta(\xi) d\xi \} |_{s=0}^{s=t} \\ &= \exp \{ \int_0^t \beta(s) ds \} - 1 \end{align*}
2=0tβ(s)exp{0sβ(ξ)dξ}ds=exp{0sβ(ξ)dξ}s=0s=t=exp{0tβ(s)ds}+1\begin{align*} \textcircled{2} &= \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds \\ &= - \exp \{ -\int_0^s \beta(\xi) d\xi \} |_{s=0}^{s=t} \\ &= - \exp \{ -\int_0^t \beta(s) ds \} + 1\end{align*}

Therefore:

P(t)=Iexp{0tβ(s)ds}[exp{0tβ(s)ds}+exp{0tβ(s)ds}+C]=I[1+exp{20tβ(s)ds}+exp{0tβ(s)ds}C]\begin{align*} P(t) &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left[ \exp \{ \int_0^t \beta(s) ds \} + \exp \{ - \int_0^t \beta(s) ds \} + C \right] \\ &= \bold{I} \cdot \left[ 1 + \exp \{ -2\int_0^t \beta(s) ds \} + \exp \{ -\int_0^t \beta(s) ds \} \cdot C\right]\end{align*}

Plug in t=0t=0, we have P(0)=I(2+C)CI=P(0)2IP(0) = \bold{I} (2 + C) \Rightarrow C \bold{I} = P(0) - 2\bold{I}. Then:

P(t)=I+e20tβ(s)dsI+e0tβ(s)ds(P(0)2I)P(t) = \bold{I} + e^{-2 \int_0^t \beta(s) ds} \bold{I} + e^{- \int_0^t \beta(s) ds} (P(0) - 2 \bold{I})

Note: If limt0tβ(s)ds=\lim_{t \rightarrow \infty} \int_0^t \beta(s) ds = \infty, we can observe that limtP(t)=I\lim_{t \rightarrow \infty} P(t) = \bold{I}. This justifies the use of sub-VP SDEs for score-based generative modeling, since they can perturb any data distribution to standard Gaussian under suitable conditions.

Since P(0)=0P(0) = 0, we have:

P(t)=I+e20tβ(s)dsI2e0tβ(s)dsI=[1e0tβ(s)ds]2IP(t) = \bold{I} + e^{-2 \int_0^t \beta(s) ds} \bold{I} - 2 e^{- \int_0^t \beta(s) ds} \bold{I} = [1 - e^{- \int_0^t \beta(s) ds}]^2 \bold{I}

Therefore, the solution to sub-VP SDEs is:

p0t(x(t)x(0))=N(x(t);x(0)e120tβ(s)ds,[1e0tβ(s)ds]2I)(sub-VP SDE)p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) e^{-\frac{1}{2} \int_0^t \beta(s) ds} , [1 - e^{- \int_0^t \beta(s) ds}]^2 \bold{I}) \qquad \text{(sub-VP SDE)}

7. How to choose the noise scale

7.1 SMLD (VE SDEs)

In SMLD, the noise scales {σi}i=1N\{ \sigma_i \}_{i=1}^N is typically a geometric sequence where σmin=0.01\sigma_{\text{min}} = 0.01 and σmax\sigma_{\text{max}} is chosen to Technique 1 in Song & Ermon (2020).

Technique 1: Choose σmax\sigma_{\text{max}} to be as large as the maximum Euclidean distance between all pairs of training data points. (Usually, SMLD models normalize image inputs to the range [0,1])

Since {σi}i=1N\{ \sigma_i \}_{i=1}^N is a geometric sequence, we have σ(iN)=σi=σmin(σmaxσmin)i1N1\sigma(\frac{i}{N}) = \sigma_i = \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^{\frac{i-1}{N-1}} for i=1,,Ni = 1, \dots, N. When NN \rightarrow \infty, σ(t)=σmin(σmaxσmin)t\sigma(t) = \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^t for t(0,1]t \in (0, 1].

Then the corresponding VE SDE is:

dx=d[σ2(t)]dtdw=σmin(σmaxσmin)t2lnσmaxσmindw\begin{align*} d \bold{x} &= \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w} \\ &= \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^t \sqrt{2 \ln \frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}} d\bold{w}\end{align*}

The perturbation kernel can be derived according to Section 5 VE SDEs:

p0t(x(t)x(0))=N(x(t);x(0),σmin2(σmaxσmin)2tI),t(0,1]p_{0t} (\bold{x}(t) | \bold{x}(0)) = \mathcal{N} \left( \bold{x}(t); \bold{x}(0), \sigma^2_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^{2t} \bold{I} \right), \quad t \in (0, 1]