This is a learning note of this series of videos.
Paper Link: https://arxiv.org/abs/2011.13456
1. Why we use SDE to describe the diffusion process? We want to perturb the data with multiple noise scales(why??). The idea is to use SDE to provide an infinite number of noise scales(continuously changing).
The SDE (which is continuous) can be used for theoretical analysis. In practice, we discretize the SDE for numerical computation.
Goal : construct a diffusion process { x ( t ) } t = 0 T \{\bold{x}(t)\}^T_{t=0} { x ( t ) } t = 0 T indexed by a continuous time variable t ∈ [ 0 , T ] t \in [0, T] t ∈ [ 0 , T ] , such that x ( 0 ) ∼ p 0 \bold{x}(0) \sim p_0 x ( 0 ) ∼ p 0 , for which we have a dataset of i.i.d. samples, and x ( T ) ∼ p T \bold{x}(T) \sim p_T x ( T ) ∼ p T , for which we have a tractable form to generate samples efficiently. This diffusion process can be modeled as the solution to an Ito SDE:
d x = f ( x , t ) d t + g ( t ) d w d \bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w} d x = f ( x , t ) d t + g ( t ) d w
w w w : a Brownian Motion whose increments follow the gaussian distribution, and variance increase with time:
w ( t + Δ t ) − w ( t ) ∼ N ( 0 , Δ t ) \bold{w}(t+\Delta t) - \bold{w}(t) \sim \mathcal{N}(0, \Delta t) w ( t + Δ t ) − w ( t ) ∼ N ( 0 , Δ t ) d w = 0 + Δ t ϵ where ϵ ∼ N ( 0 , I ) d \bold{w} = 0 + \sqrt{\Delta t} \epsilon \qquad \text{where } \epsilon \sim \mathcal{N}(0, \bold{I}) d w = 0 + Δ t ϵ where ϵ ∼ N ( 0 , I ) f ( ⋅ , t ) : R d → R d \bold{f} ( \cdot, t): \mathbb{R}^d \rightarrow \mathbb{R}^d f ( ⋅ , t ) : R d → R d is a vector valued function called the drift coefficient of x ( t ) \bold{x}(t) x ( t ) , and g ( ⋅ ) : R → R g(\cdot): \mathbb{R} \rightarrow \mathbb{R} g ( ⋅ ) : R → R is a scalar function known as the diffusion coefficient of x ( t ) \bold{x}(t) x ( t ) .
Starting from samples of x ( T ) ∼ p T \bold{x}(T) \sim p_T x ( T ) ∼ p T and reversing the process, we can obtain samples x ( 0 ) ∼ p 0 \bold{x}(0) \sim p_0 x ( 0 ) ∼ p 0 . It is proved that the reverse of a diffusion process is also a diffusion process, running backward in time and given by the reverse-time SDE:
d x = [ f ( x , t ) − g ( t ) 2 ∇ x log p t ( x ) ] d t + g ( t ) d w ˉ d\bold{x} = [\bold{f}(\bold{x}, t) - g(t)^2 \nabla_{\bold{x}} \log p_t(\bold{x})] dt + g(t) d \bar{\bold{w}} d x = [ f ( x , t ) − g ( t ) 2 ∇ x log p t ( x )] d t + g ( t ) d w ˉ where w ˉ \bar{\bold{w}} w ˉ is a standard Wiener process when time flows backwards from T T T to 0, and d t dt d t is a n infinitesimal negative timestep. Once the score of each marginal distribution, ∇ x log p t ( x ) \nabla_{\bold{x}} \log p_t(\bold{x}) ∇ x log p t ( x ) , is known for all t, we can derive the above reverse diffusion process and simulate it to sample from p 0 p_0 p 0 .
2. How to derive the reverse-time SDE? Forward SDE: d x = f ( x , t ) d t + g ( t ) d w d\bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w} d x = f ( x , t ) d t + g ( t ) d w
Reverse SDE: d x = [ f ( x , t ) − g ( t ) 2 ∇ x log p t ( x ) ] d t + g ( t ) d w ˉ d\bold{x} = [\bold{f}(\bold{x}, t) - g(t)^2 \nabla_{\bold{x}} \log p_t(\bold{x})] dt + g(t) d \bar{\bold{w}} d x = [ f ( x , t ) − g ( t ) 2 ∇ x log p t ( x )] d t + g ( t ) d w ˉ
Proof 1 :
Important Assumption : the diffusion coefficient is g ( t ) g(t) g ( t ) rather than g ( x , t ) g(\bold{x}, t) g ( x , t ) .
Characteristics of the Brownian Motion { w ( t ) , t ≥ 0 } \{ w(t), t\geq 0\} { w ( t ) , t ≥ 0 } :
Gaussian Increments: w ( t ) − w ( s ) ∼ N ( 0 , t − s ) w(t) - w(s) \sim \mathcal{N}(0, t-s) w ( t ) − w ( s ) ∼ N ( 0 , t − s ) ; w ( t ) − w ( 0 ) ∼ N ( 0 , t ) w(t) - w(0) \sim \mathcal{N}(0, t) w ( t ) − w ( 0 ) ∼ N ( 0 , t ) Independent Increments: If 0 ≤ u ≤ s ≤ t 0 \leq u \leq s \leq t 0 ≤ u ≤ s ≤ t , then w ( t ) − w ( s ) w(t) - w(s) w ( t ) − w ( s ) and w ( s ) − w ( u ) w(s) - w(u) w ( s ) − w ( u ) are independent. Path Continuity: w ( t ) w(t) w ( t ) is a continuous function of t t t . Discretization of the forward SDE gives:
x t + Δ t − x t = f ( x t , t ) Δ t + g ( t ) Δ t ϵ ϵ ∼ N ( 0 , I ) \bold{x}_{t + \Delta t} - \bold{x}_t = \bold{f}(\bold{x}_t, t) \Delta t + g(t) \sqrt{\Delta t} \epsilon \qquad \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}) x t + Δ t − x t = f ( x t , t ) Δ t + g ( t ) Δ t ϵ ϵ ∼ N ( 0 , I ) ⇔ x t + Δ t = x t + f ( x t , t ) Δ t + g ( t ) Δ t ϵ \Leftrightarrow \bold{x}_{t + \Delta t} = \bold{x}_t + \bold{f}(\bold{x}_t, t) \Delta t + g(t) \sqrt{\Delta t} \epsilon ⇔ x t + Δ t = x t + f ( x t , t ) Δ t + g ( t ) Δ t ϵ This indicates that:
p ( x t + Δ t ∣ x t ) = N ( x t + Δ t ∣ x t + f ( x t , t ) Δ t , g ( t ) 2 Δ t ⋅ I ) ∝ exp ( − ∣ ∣ x t + Δ t − x t − f ( x t , t ) Δ t ∣ ∣ 2 2 g ( t ) 2 Δ t ) \begin{align*} p(\bold{x}_{t + \Delta t} | \bold{x}_t) &= \mathcal{N}(\bold{x}_{t + \Delta t} | \bold{x}_t + \bold{f}(\bold{x}_t, t) \Delta t, g(t)^2 \Delta t \cdot \bold{I}) \\ & \propto \exp \left( - \frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} \right)\end{align*} p ( x t + Δ t ∣ x t ) = N ( x t + Δ t ∣ x t + f ( x t , t ) Δ t , g ( t ) 2 Δ t ⋅ I ) ∝ exp ( − 2 g ( t ) 2 Δ t ∣∣ x t + Δ t − x t − f ( x t , t ) Δ t ∣ ∣ 2 ) According to Bayesian theorem:
p ( x t ∣ x t + Δ t ) = p ( x t + Δ t ∣ x t ) p ( x t ) p ( x t + Δ t ) = p ( x t + Δ t ∣ x t ) exp ( log p ( x t ) − log p ( x t + Δ t ) ) ∝ exp ( − ∣ ∣ x t + Δ t − x t − f ( x t , t ) Δ t ∣ ∣ 2 2 g ( t ) 2 Δ t + log p ( x t ) − log p ( x t + Δ t ) ) \begin{align*} p(\bold{x}_t | \bold{x}_{t+\Delta t}) &= \frac{p(\bold{x}_{t + \Delta t} | \bold{x}_t) p(\bold{x}_t)}{p(\bold{x}_{t + \Delta t)}} = p(\bold{x}_{t + \Delta t} | \bold{x}_t) \exp \left( \log p(\bold{x}_t) - \log p(\bold{x}_{t+\Delta t}) \right) \\ &\propto \exp \left( -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} + \log p(\bold{x}_t) - \log p(\bold{x}_{t+\Delta t})\right) \end{align*} p ( x t ∣ x t + Δ t ) = p ( x t + Δ t ) p ( x t + Δ t ∣ x t ) p ( x t ) = p ( x t + Δ t ∣ x t ) exp ( log p ( x t ) − log p ( x t + Δ t ) ) ∝ exp ( − 2 g ( t ) 2 Δ t ∣∣ x t + Δ t − x t − f ( x t , t ) Δ t ∣ ∣ 2 + log p ( x t ) − log p ( x t + Δ t ) ) Note: First order Taylor Expansion of f ( x ) f(x) f ( x ) at x 0 x_0 x 0 is f ( x ) ≈ f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) f(x) \approx f(x_0) + f'(x_0) (x - x_0) f ( x ) ≈ f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 )
Note: log p ( x t ) \log p(x_t) log p ( x t ) is actually a function of both x t x_t x t and t t t , therefore when taking derivatives, both of them should be considered.
By Taylor Expansion, we have
log p ( x t + Δ t ) ≈ log p ( x t ) + ( x t + Δ t − x t ) ⋅ ∇ x t log p ( x t ) + Δ t ∂ ∂ t log p ( x t ) \log p(\bold{x}_{t + \Delta t}) \approx \log p(\bold{x}_t) + (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) + \Delta t \frac{\partial}{\partial t} \log p(\bold{x}_t) log p ( x t + Δ t ) ≈ log p ( x t ) + ( x t + Δ t − x t ) ⋅ ∇ x t log p ( x t ) + Δ t ∂ t ∂ log p ( x t ) Plug in the above equation into p ( x t ∣ x t + Δ t ) p(\bold{x}_t | \bold{x}_{t + \Delta t}) p ( x t ∣ x t + Δ t ) :
p ( x t ∣ x t + Δ t ) ∝ exp ( − ∣ ∣ x t + Δ t − x t − f ( x t , t ) Δ t ∣ ∣ 2 2 g ( t ) 2 Δ t − ( x t + Δ t − x t ) ⋅ ∇ x t log p ( x t ) − Δ t ∂ ∂ t log p ( x t ) ) \begin{align*} p(\bold{x}_t | \bold{x}_{t + \Delta t}) &\propto \exp \left( -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} - (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) - \Delta t \frac{\partial}{\partial t} \log p(\bold{x}_t) \right)\end{align*} p ( x t ∣ x t + Δ t ) ∝ exp ( − 2 g ( t ) 2 Δ t ∣∣ x t + Δ t − x t − f ( x t , t ) Δ t ∣ ∣ 2 − ( x t + Δ t − x t ) ⋅ ∇ x t log p ( x t ) − Δ t ∂ t ∂ log p ( x t ) ) − ∣ ∣ x t + Δ t − x t − f ( x t , t ) Δ t ∣ ∣ 2 2 g ( t ) 2 Δ t − ( x t + Δ t − x t ) ⋅ ∇ x t log p ( x t ) = − 1 2 g ( t ) 2 Δ t ( ∣ ∣ x t + Δ t − x t ∣ ∣ 2 − 2 ( x t + Δ t − x t ) f ( x t , t ) Δ t + f ( x t , t ) 2 Δ t 2 + 2 g ( t ) 2 Δ t ( x t + Δ t − x t ) ∇ x t log p ( x t ) ) = − 1 2 g ( t ) 2 Δ t ( ∣ ∣ x t + Δ t − x t ∣ ∣ 2 − 2 ( x t + Δ t − x t ) [ f ( x t , t ) − g ( t ) 2 ∇ x t log p ( x t ) ] Δ t + C ( Δ t ) ) = − 1 2 g ( t ) 2 Δ t ( ∣ ∣ x t + Δ t − x t − [ f ( x t , t ) − g ( t ) 2 ∇ x t log p ( x t ) ] Δ t ∣ ∣ 2 + C ( Δ t ) ) \begin{align*} & -\frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} - (\bold{x}_{t + \Delta t} - \bold{x}_t) \cdot \nabla_{\bold{x}_t} \log p(\bold{x}_t) \\ &= -\frac{1}{2 g(t)^2 \Delta t} \left( ||\bold{x}_{t+\Delta t} - \bold{x}_t||^2 - 2(\bold{x}_{t+\Delta t} - \bold{x}_t) \bold{f}(\bold{x}_t, t) \Delta t + \bold{f}(\bold{x}_t, t)^2 \Delta t^2 + 2 g(t)^2 \Delta t (\bold{x}_{t + \Delta t} - \bold{x}_t)\nabla_{\bold{x}_t} \log p(\bold{x}_t)\right) \\ &= -\frac{1}{2 g(t)^2 \Delta t} \left( ||\bold{x}_{t+\Delta t} - \bold{x}_t||^2 - 2(\bold{x}_{t + \Delta t} - \bold{x}_t)[\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t + C(\Delta t) \right) \\ &= -\frac{1}{2 g(t)^2 \Delta t}\left( ||\bold{x}_{t+\Delta t} - \bold{x}_t - [\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t||^2 + C(\Delta t) \right)\end{align*} − 2 g ( t ) 2 Δ t ∣∣ x t + Δ t − x t − f ( x t , t ) Δ t ∣ ∣ 2 − ( x t + Δ t − x t ) ⋅ ∇ x t log p ( x t ) = − 2 g ( t ) 2 Δ t 1 ( ∣∣ x t + Δ t − x t ∣ ∣ 2 − 2 ( x t + Δ t − x t ) f ( x t , t ) Δ t + f ( x t , t ) 2 Δ t 2 + 2 g ( t ) 2 Δ t ( x t + Δ t − x t ) ∇ x t log p ( x t ) ) = − 2 g ( t ) 2 Δ t 1 ( ∣∣ x t + Δ t − x t ∣ ∣ 2 − 2 ( x t + Δ t − x t ) [ f ( x t , t ) − g ( t ) 2 ∇ x t l o g p ( x t ) ] Δ t + C ( Δ t ) ) = − 2 g ( t ) 2 Δ t 1 ( ∣∣ x t + Δ t − x t − [ f ( x t , t ) − g ( t ) 2 ∇ x t l o g p ( x t ) ] Δ t ∣ ∣ 2 + C ( Δ t ) ) Here C ( Δ t ) C(\Delta t) C ( Δ t ) is a polynomial of Δ t \Delta t Δ t without a constant term. It goes to 0 0 0 when Δ t → 0 \Delta t \rightarrow 0 Δ t → 0 . Therefore,
p ( x t ∣ x t + Δ t ) ∝ exp ( − ∣ ∣ x t + Δ t − x t − [ f ( x t , t ) − g ( t ) 2 ∇ x t log p ( x t ) ] Δ t ∣ ∣ 2 2 g ( t ) 2 Δ t ) ≈ exp ( − ∣ ∣ x t − x t + Δ t − [ f ( x t + Δ t , t + Δ t ) − g ( t + Δ t ) 2 ∇ x t log p ( x t + Δ t ) ] ( − Δ t ) ∣ ∣ 2 2 g ( t + Δ t ) 2 Δ t ) \begin{align*} p(\bold{x}_t | \bold{x}_{t + \Delta t}) & \propto \exp \left( - \frac{||\bold{x}_{t+\Delta t} - \bold{x}_t - [\bold{f}(\bold{x}_t, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_t)}] \Delta t||^2}{2g(t)^2 \Delta t} \right) \\ &\approx \exp \left( - \frac{||\bold{x}_{t} - \bold{x}_{t + \Delta t} - [\bold{f}(\bold{x}_{t+\Delta t}, t + \Delta t) - g(t+\Delta t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_{t+\Delta t})}] (-\Delta t)||^2}{2g(t + \Delta t)^2 \Delta t} \right)\end{align*} p ( x t ∣ x t + Δ t ) ∝ exp ( − 2 g ( t ) 2 Δ t ∣∣ x t + Δ t − x t − [ f ( x t , t ) − g ( t ) 2 ∇ x t l o g p ( x t ) ] Δ t ∣ ∣ 2 ) ≈ exp ( − 2 g ( t + Δ t ) 2 Δ t ∣∣ x t − x t + Δ t − [ f ( x t + Δ t , t + Δ t ) − g ( t + Δ t ) 2 ∇ x t l o g p ( x t + Δ t ) ] ( − Δ t ) ∣ ∣ 2 ) Previously, we derived p ( x t + Δ t ∣ x t ) ∝ exp ( − ∣ ∣ x t + Δ t − x t − f ( x t , t ) Δ t ∣ ∣ 2 2 g ( t ) 2 Δ t ) p(\bold{x}_{t + \Delta t} | \bold{x}_t) \propto \exp \left( - \frac{||\bold{x}_{t + \Delta t} - \bold{x}_t - \bold{f}(\bold{x}_t, t) \Delta t||^2}{2 g(t)^2 \Delta t} \right) p ( x t + Δ t ∣ x t ) ∝ exp ( − 2 g ( t ) 2 Δ t ∣∣ x t + Δ t − x t − f ( x t , t ) Δ t ∣ ∣ 2 ) from d x = f ( x , t ) d t + g ( t ) d w d\bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d\bold{w} d x = f ( x , t ) d t + g ( t ) d w . Therefore we can conclude that the corresponding reverse-time SDE is:
d x = [ f ( x t , t ) − g ( t ) 2 ∇ x t log p ( x t ) ] d t + g ( t ) d w d \bold{x} = [\bold{f}(\bold{x}_{t}, t) - g(t)^2 \nabla_{\bold{x}_t \log p(\bold{x}_{t})}]dt + g(t) d \bold{w} d x = [ f ( x t , t ) − g ( t ) 2 ∇ x t l o g p ( x t ) ] d t + g ( t ) d w
3. Core settings of two diffusion models: SMLD and DDPM 3.1 Denoising score matching with Langevin Dynamics (SMLD) Let the perturbation kernel to be: p σ ( x ~ ∣ x ) ≔ N ( x ~ ; x , σ 2 I ) p_{\sigma} (\tilde{\bold{x}} | \bold{x}) \coloneqq \mathcal{N}(\tilde{\bold{x}}; \bold{x} , \sigma^2 \bold{I}) p σ ( x ~ ∣ x ) : = N ( x ~ ; x , σ 2 I ) , and p σ ( x ~ ) ≔ ∫ p d a t a ( x ) p σ ( x ~ ∣ x ) d x p_{\sigma}(\tilde{\bold{x}}) \coloneqq \int p_{data}(\bold{x}) p_{\sigma}(\tilde{\bold{x}} | \bold{x}) d\bold{x} p σ ( x ~ ) : = ∫ p d a t a ( x ) p σ ( x ~ ∣ x ) d x , where p d a t a ( x ) p_{data}(\bold{x}) p d a t a ( x ) denotes the data distribution. Consider a sequence of positive noise scales σ min = σ 1 < σ 2 < ⋯ < σ N = σ max \sigma_{\min} = \sigma_1 < \sigma_2 < \dots < \sigma_N = \sigma_{\max} σ m i n = σ 1 < σ 2 < ⋯ < σ N = σ m a x . Usually σ min \sigma_{\min} σ m i n is small enough such that p σ min ( x ) ≈ p d a t a ( x ) p_{\sigma_{\min}}(\bold{x}) \approx p_{data}(\bold{x}) p σ m i n ( x ) ≈ p d a t a ( x ) and σ max \sigma_{\max} σ m a x is large enough such that p σ max ( x ) ≈ N ( x ; 0 , σ max 2 I ) p_{\sigma_{\max}}(\bold{x}) \approx \mathcal{N}(\bold{x}; \bold{0}, \sigma_{\max}^2 \bold{I}) p σ m a x ( x ) ≈ N ( x ; 0 , σ m a x 2 I )
Overall Step (derived from single step ): p ( x t ∣ x 0 ) = N ( x 0 , σ t 2 I ) p(\bold{x}_t | \bold{x}_0) = \mathcal{N}(\bold{x}_0, \sigma_t^2 \bold{I}) p ( x t ∣ x 0 ) = N ( x 0 , σ t 2 I )
Single Step (by design ): p ( x t ∣ x t − 1 ) = N ( x t − 1 , ( σ t 2 − σ t − 1 2 ) I ) p(\bold{x}_t | \bold{x}_{t-1}) = \mathcal{N}(\bold{x}_{t-1}, (\sigma_t^2 - \sigma_{t-1}^2) \bold{I}) p ( x t ∣ x t − 1 ) = N ( x t − 1 , ( σ t 2 − σ t − 1 2 ) I )
Previous work propose to train a Noise Conditional Score Network s θ ( x , σ ) \bold{s}_{\theta}(\bold{x}, \sigma) s θ ( x , σ ) with a weighted sum of denoising score matching objectives:
θ ∗ = arg min θ ∑ i = 1 N σ i 2 E p d a t a ( x ) E p σ i ( x ~ ∣ x ) [ ∣ ∣ s θ ( x ~ , σ i ) − ∇ x ~ log p σ i ( x ~ ∣ x ) ∣ ∣ 2 2 ] \theta^* = \argmin_{\theta} \sum_{i=1}^N \sigma_i^2 \mathbb{E}_{p_{data}(\bold{x})} \mathbb{E}_{p_{\sigma_i}(\tilde{\bold{x}}|\bold{x})}\left[ ||\bold{s}_{\theta}(\tilde{\bold{x}}, \sigma_i) - \nabla_{\tilde{\bold{x}}} \log p_{\sigma_i}(\tilde{\bold{x}}|\bold{x})||^2_2 \right] θ ∗ = θ arg min i = 1 ∑ N σ i 2 E p d a t a ( x ) E p σ i ( x ~ ∣ x ) [ ∣∣ s θ ( x ~ , σ i ) − ∇ x ~ log p σ i ( x ~ ∣ x ) ∣ ∣ 2 2 ] Here, ∇ x ~ log p σ i ( x ~ ∣ x ) = − x ~ − x σ i 2 \nabla_{\tilde{\bold{x}}} \log p_{\sigma_i}(\tilde{\bold{x}}|\bold{x}) = - \frac{\tilde{\bold{x}} - \bold{x}}{\sigma_i^2} ∇ x ~ log p σ i ( x ~ ∣ x ) = − σ i 2 x ~ − x . The optimal score-based model s θ ∗ ( x , σ ) \bold{s}_{\theta^*}(\bold{x}, \sigma) s θ ∗ ( x , σ ) can be obtained given sufficient data and model capacity. It matches ∇ x log p σ ( x ) \nabla{\bold{x}} \log p_{\sigma}(\bold{x}) ∇ x log p σ ( x ) almost everywhere for σ ∈ { σ i } i = 1 N \sigma \in \{ \sigma_i \}_{i=1}^N σ ∈ { σ i } i = 1 N .
For sampling , the work uses M M M steps of Langevin MCMC to get a sample for each p σ i ( x ) p_{\sigma_i}(\bold{x}) p σ i ( x ) sequentially:
x i m = x i m − 1 + ϵ i s θ ∗ ( x i m − 1 , σ i ) + 2 ϵ i z i m m = 1 , 2 , … , M , \bold{x}_i^m = \bold{x}_i^{m-1} + \epsilon_i \bold{s}_{\theta^*}(\bold{x}_i^{m-1}, \sigma_i) + \sqrt{2 \epsilon_i} \bold{z}_i^m \qquad m=1,2,\dots,M, x i m = x i m − 1 + ϵ i s θ ∗ ( x i m − 1 , σ i ) + 2 ϵ i z i m m = 1 , 2 , … , M , where ϵ i > 0 \epsilon_i > 0 ϵ i > 0 is the step size, and z i m \bold{z}_i^m z i m is standard normal. The above process is repeated for i = N , N − 1 , … , 1 i = N, N-1, \dots, 1 i = N , N − 1 , … , 1 in turn with x N 0 ∼ N ( x ∣ 0 , σ max 2 I ) \bold{x}_N^0 \sim \mathcal{N}(\bold{x} | \bold{0}, \sigma_{\max}^2 \bold{I}) x N 0 ∼ N ( x ∣ 0 , σ m a x 2 I ) and x i 0 = x i + 1 M \bold{x}_i^0 = \bold{x}_{i+1}^M x i 0 = x i + 1 M when i < N i < N i < N .
Note: i i i means different noise levels; M M M means denoise M M M steps at each noise level; Therefore, to generate a sample, we need to run the score function N × M N \times M N × M times:
x N 0 → x N 1 → ⋯ → x N M → x N − 1 1 → ⋯ → x N − 1 M → ⋯ → x 0 M \bold{x}_N^0 \rightarrow \bold{x}_N^1 \rightarrow \dots \rightarrow \bold{x}_N^M \rightarrow \bold{x}_{N-1}^1 \rightarrow \dots \rightarrow \bold{x}_{N-1}^M \rightarrow \dots \rightarrow \bold{x}_{0}^M x N 0 → x N 1 → ⋯ → x N M → x N − 1 1 → ⋯ → x N − 1 M → ⋯ → x 0 M 3.2 Denoising Diffusion Probabilistic Models (DDPM) Consider a sequence of positive noise scales 0 < β 1 , β 2 , … , β N < 1 0 < \beta_1, \beta_2, \dots, \beta_N < 1 0 < β 1 , β 2 , … , β N < 1 . For each training data point x 0 ∼ p d a t a ( x ) \bold{x}_0 \sim p_{data}(\bold{x}) x 0 ∼ p d a t a ( x ) , construct a discrete Markov Chain { x 0 , x 1 , … , x N } \{\bold{x}_0, \bold{x}_1, \dots, \bold{x}_N\} { x 0 , x 1 , … , x N } . The single step and overall step of this chain is:
Single step (by design ): p ( x i ∣ x i − 1 ) = N ( x i ; 1 − β i x i − 1 , β i I ) p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N} (\bold{x}_i; \sqrt{1 - \beta_i} \bold{x}_{i-1}, \beta_i \bold{I}) p ( x i ∣ x i − 1 ) = N ( x i ; 1 − β i x i − 1 , β i I )
Overall step (derived ): p α i ( x i ∣ x 0 ) = N ( x i ; α i x 0 , ( 1 − α i ) I ) p_{\alpha_i}(\bold{x}_i | \bold{x}_0) = \mathcal{N} (\bold{x}_i ; \sqrt{\alpha_i} \bold{x}_0, (1 - \alpha_i) \bold{I}) p α i ( x i ∣ x 0 ) = N ( x i ; α i x 0 , ( 1 − α i ) I )
Here α i ≔ Π j = 1 i ( 1 − β j ) \alpha_i \coloneqq \Pi_{j=1}^i (1 - \beta_j) α i : = Π j = 1 i ( 1 − β j )
Proof of overall step:
Since x i = 1 − β i x i − 1 + β i z i \bold{x}_i = \sqrt{1 - \beta_i}\bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_i x i = 1 − β i x i − 1 + β i z i and x i + 1 = 1 − β i + 1 x i + β i + 1 z i + 1 \bold{x}_{i+1} = \sqrt{1 - \beta_{i+1}} \bold{x}_i + \sqrt{\beta_{i+1}} \bold{z}_{i+1} x i + 1 = 1 − β i + 1 x i + β i + 1 z i + 1 , we have:
x i + 1 = 1 − β i + 1 ( 1 − β i x i − 1 + β i z i ) + β i + 1 z i + 1 = ( 1 − β i + 1 ) ( 1 − β i ) x i − 1 + ( 1 − β i + 1 ) β i z i + β i + 1 z i + 1 = ( 1 − β i + 1 ) ( 1 − β i ) x i − 1 + β i + β i + 1 − β i β i + 1 z ( z ∼ N ( 0 , I ) ) = ( 1 − β i + 1 ) ( 1 − β i ) x i − 1 + 1 − ( 1 − β i + 1 ) ( 1 − β i ) z \begin{align*} \bold{x}_{i+1} &= \sqrt{1 - \beta_{i+1}} (\sqrt{1 - \beta_i}\bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_i) + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt{(1 - \beta_{i+1})\beta_i} \bold{z}_i + \sqrt{\beta_{i+1}} \bold{z}_{i+1} \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt {\beta_i + \beta_{i+1} - \beta_i \beta_{i+1}} \bold{z} \qquad (\bold{z} \sim \mathcal{N}(\bold{0}, \bold{I})) \\ &= \sqrt{(1 - \beta_{i+1})(1 - \beta_i)}x_{i-1} + \sqrt{1 - (1 - \beta_{i+1})(1 - \beta_i)} \bold{z}\end{align*} x i + 1 = 1 − β i + 1 ( 1 − β i x i − 1 + β i z i ) + β i + 1 z i + 1 = ( 1 − β i + 1 ) ( 1 − β i ) x i − 1 + ( 1 − β i + 1 ) β i z i + β i + 1 z i + 1 = ( 1 − β i + 1 ) ( 1 − β i ) x i − 1 + β i + β i + 1 − β i β i + 1 z ( z ∼ N ( 0 , I )) = ( 1 − β i + 1 ) ( 1 − β i ) x i − 1 + 1 − ( 1 − β i + 1 ) ( 1 − β i ) z By induction, we can conclude that:
x i = Π j = 1 i ( 1 − β j ) x 0 + 1 − Π j = 1 i ( 1 − β j ) z = α i x 0 + 1 − α i z ( z ∼ N ( 0 , I ) ) \begin{align*} \bold{x}_i &= \sqrt{\Pi_{j=1}^i (1 - \beta_j)} \bold{x}_0 + \sqrt{1 - \Pi_{j=1}^i (1 - \beta_j)} \bold{z} \\ &= \sqrt{\alpha_i} \bold{x}_0 + \sqrt{1 - \alpha_i} \bold{z} \qquad (\bold{z} \sim \mathcal{N} (\bold{0}, \bold{I}))\end{align*} x i = Π j = 1 i ( 1 − β j ) x 0 + 1 − Π j = 1 i ( 1 − β j ) z = α i x 0 + 1 − α i z ( z ∼ N ( 0 , I )) Therefore, p α i ( x i ∣ x 0 ) = N ( x i ; α i x 0 , ( 1 − α i ) I ) p_{\alpha_i}(\bold{x}_i | \bold{x}_0) = \mathcal{N} (\bold{x}_i ; \sqrt{\alpha_i} \bold{x}_0, (1 - \alpha_i) \bold{I}) p α i ( x i ∣ x 0 ) = N ( x i ; α i x 0 , ( 1 − α i ) I ) .
End of proof.
The noise scales are prescribed such that x N \bold{x}_N x N is approximately distributed according to N ( 0 , I ) \mathcal{N} ( \bold{0}, \bold{I}) N ( 0 , I )
The denoising distribution derived from Bayesian formula is:
p θ ( x i − 1 ∣ x i ) = N ( x i − 1 ; 1 1 − β i ( x i + β i s θ ( x i , i ) ) , β i I ) p_{\theta}(\bold{x}_{i-1} | \bold{x}_i) = \mathcal{N} (\bold{x}_{i-1}; \frac{1}{\sqrt{1 - \beta_i}}(\bold{x}_i + \beta_i \bold{s}_{\theta}(\bold{x}_i, i)), \beta_i \bold{I}) p θ ( x i − 1 ∣ x i ) = N ( x i − 1 ; 1 − β i 1 ( x i + β i s θ ( x i , i )) , β i I ) and it can be trained with a re-weighted variant of the evidence lower bound (ELBO):
θ ∗ = arg min θ ∑ i = 1 N ( 1 − α i ) E p d a t a ( x ) E p α i ( x ~ ∣ x ) [ ∣ ∣ s θ ( x ~ , i ) − ∇ x ~ log p α i ( x ~ ∣ x ) ∣ ∣ 2 2 ] \theta^* = \argmin_{\theta} \sum_{i=1}^N (1 - \alpha_i) \mathbb{E}_{p_{data}(\bold{x})} \mathbb{E}_{p_{\alpha_i}}(\tilde{\bold{x}} | \bold{x}) [||\bold{s}_{\theta}(\tilde{\bold{x}} , i) - \nabla_{\tilde{\bold{x}}}\log p_{\alpha_i}(\tilde{\bold{x}} | \bold{x})||^2_2] θ ∗ = θ arg min i = 1 ∑ N ( 1 − α i ) E p d a t a ( x ) E p α i ( x ~ ∣ x ) [ ∣∣ s θ ( x ~ , i ) − ∇ x ~ log p α i ( x ~ ∣ x ) ∣ ∣ 2 2 ] After solving the above equation we get the optimal model s θ ∗ ( x i , i ) \bold{s}_{\theta^*} (\bold{x}_i, i) s θ ∗ ( x i , i ) , then samples can be generated by starting from x N ∼ N ( 0 , I ) \bold{x}_N \sim \mathcal{N}(\bold{0}, \bold{I}) x N ∼ N ( 0 , I ) and following the estimated reverse Markov chain as below:
x i − 1 = 1 1 − β i ( x i + β i s θ ∗ ( x i , i ) ) + β i z i , i = N , N − 1 , … , 1. \bold{x}_{i-1} = \frac{1}{\sqrt{1 - \beta_i}} (\bold{x}_i + \beta_i \bold{s}_{\theta^*}(\bold{x}_i, i)) + \sqrt{\beta_i} \bold{z}_i, \qquad i=N, N-1, \dots, 1. x i − 1 = 1 − β i 1 ( x i + β i s θ ∗ ( x i , i )) + β i z i , i = N , N − 1 , … , 1. NOTE: we can notice that the weights of the i-th summand in both SMLD and DDPM loss functions, namely σ i 2 \sigma_i^2 σ i 2 and ( 1 − α i ) (1 - \alpha_i) ( 1 − α i ) , are related to the corresponding perturbation kernels in the same functional form: σ i 2 ∝ 1 / E [ ∣ ∣ ∇ x log p σ i ( x ~ ∣ x ) ∣ ∣ 2 2 ] \sigma_i^2 \propto 1 / \mathbb{E} [||\nabla_{\bold{x}} \log p_{\sigma_i}(\tilde{\bold{x}}| \bold{x})||^2_2] σ i 2 ∝ 1/ E [ ∣∣ ∇ x log p σ i ( x ~ ∣ x ) ∣ ∣ 2 2 ] and ( 1 − α i ) ∝ 1 / E [ ∣ ∣ ∇ x log p α i ( x ~ ∣ x ) ∣ ∣ 2 2 ] (1 - \alpha_i) \propto 1 / \mathbb{E} [||\nabla_{\bold{x}} \log p_{\alpha_i}(\tilde{\bold{x}}| \bold{x})||^2_2] ( 1 − α i ) ∝ 1/ E [ ∣∣ ∇ x log p α i ( x ~ ∣ x ) ∣ ∣ 2 2 ]
4. VE, VP SDEs: Derive SDEs from Conditional Distributions The noise perturbations used in SMLD and DDPM can be regarded as discretizations of two different SDEs.
VE Model When using a total of N noise scales, each perturbation kernel p σ i ( x ∣ x 0 ) p_{\sigma_i}(\bold{x} | \bold{x}_0) p σ i ( x ∣ x 0 ) of SMLD corresponds to the distribution of x i \bold{x}_i x i in the following Markov chain:
x i = x i − 1 + σ i 2 − σ i − 1 2 z i − 1 i = 1 , … , N . \bold{x}_i = \bold{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \bold{z}_{i-1} \qquad i=1,\dots,N. x i = x i − 1 + σ i 2 − σ i − 1 2 z i − 1 i = 1 , … , N . where z i − 1 ∼ N ( 0 , I ) \bold{z}_{i-1} \sim \mathcal{N}(\bold{0}, \bold{I}) z i − 1 ∼ N ( 0 , I ) , and we have introduced σ 0 = 0 \sigma_0 = 0 σ 0 = 0 to simplify the notation.
(Recall that single step perturbation of SMLD is: p ( x i ∣ x i − 1 ) = N ( x i ; x i − 1 , ( σ i 2 − σ i − 1 2 ) I ) p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N}(\bold{x}_i; \bold{x}_{i-1}, (\sigma_i^2 - \sigma_{i-1}^2) \bold{I}) p ( x i ∣ x i − 1 ) = N ( x i ; x i − 1 , ( σ i 2 − σ i − 1 2 ) I )
When N → ∞ N \rightarrow \infty N → ∞ , { σ i } i = 1 N \{ \sigma_i \}_{i=1}^N { σ i } i = 1 N becomes a function σ ( t ) \sigma(t) σ ( t ) , z i \bold{z}_i z i becomes z ( t ) \bold{z}(t) z ( t ) , and the Markov chain { x i } i = 1 N \{ \bold{x}_i \}_{i=1}^N { x i } i = 1 N becomes a continuous stochastic process { x ( t ) } t = 0 1 \{ \bold{x}(t) \}_{t=0}^1 { x ( t ) } t = 0 1 , where we have used a continuous time variable t ∈ [ 0 , 1 ] t \in [0, 1] t ∈ [ 0 , 1 ] . The process { x ( t ) } t = 0 1 \{ \bold{x} (t) \}_{t=0}^1 { x ( t ) } t = 0 1 is given by the following SDE:
d x = d [ σ 2 ( t ) ] d t d w d \bold{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w} d x = d t d [ σ 2 ( t )] d w VP Model Similarly for DDPM, the perturbation kernels are p ( x i ∣ x i − 1 ) = N ( 1 − β i x i − 1 , β i I ) p(\bold{x}_i | \bold{x}_{i-1}) = \mathcal{N} (\sqrt{1 - \beta_i} x_{i-1}, \beta_i \bold{I}) p ( x i ∣ x i − 1 ) = N ( 1 − β i x i − 1 , β i I ) . Then the discrete Markov chain is:
x i = 1 − β i x i − 1 + β i z i − 1 i = 1 , … , N . \bold{x}_i = \sqrt{1 - \beta_i} \bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_{i-1} \qquad i=1,\dots,N. x i = 1 − β i x i − 1 + β i z i − 1 i = 1 , … , N . As N → ∞ N \rightarrow \infty N → ∞ , the Markov chain converges to the following SDE:
d x = − 1 2 β ( t ) x d t + β ( t ) d w d \bold{x} = - \frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d \bold{w} d x = − 2 1 β ( t ) x d t + β ( t ) d w Proof for SMLD(VE) :
x i = x i − 1 + σ i 2 − σ i − 1 2 z i − 1 i = 1 , … , N . \bold{x}_i = \bold{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \bold{z}_{i-1} \qquad i=1,\dots,N. x i = x i − 1 + σ i 2 − σ i − 1 2 z i − 1 i = 1 , … , N . Define some notations first: x ( i N ) = x i \bold{x}(\frac{i}{N}) = \bold{x}_i x ( N i ) = x i , σ ( i N ) = σ i \sigma(\frac{i}{N}) = \sigma_i σ ( N i ) = σ i , and z ( i N ) = z i \bold{z} (\frac{i}{N}) = \bold{z}_i z ( N i ) = z i for i = 1 , … , N i=1,\dots,N i = 1 , … , N . We can rewrite the Markov chain as follows with Δ t = 1 N \Delta t = \frac{1}{N} Δ t = N 1 and t ∈ { i , 1 N , … , N − 1 N } t \in \{ i, \frac{1}{N}, \dots, \frac{N-1}{N} \} t ∈ { i , N 1 , … , N N − 1 } :
x ( t + Δ t ) = x ( t ) + σ 2 ( t + Δ t ) − σ 2 ( t ) z ( t ) ≈ x ( t ) + d [ σ 2 ( t ) ] d t Δ t z ( t ) \bold{x}(t + \Delta t) = \bold{x}(t) + \sqrt{\sigma^2(t + \Delta t) - \sigma^2 (t)} \bold{z}(t) \approx \bold{x}(t) + \sqrt{\frac{d[\sigma^2(t)]}{dt}\Delta t} \bold{z}(t) x ( t + Δ t ) = x ( t ) + σ 2 ( t + Δ t ) − σ 2 ( t ) z ( t ) ≈ x ( t ) + d t d [ σ 2 ( t )] Δ t z ( t ) The approximation is given by the definition of derivative: d [ σ 2 ( t ) ] d t = lim Δ t → 0 σ 2 ( t + Δ t ) − σ 2 ( t ) Δ t \frac{d[\sigma^2(t)]}{dt} = \lim_{\Delta t \rightarrow 0} \frac{\sigma^2(t + \Delta t) - \sigma^2 (t)}{\Delta t} d t d [ σ 2 ( t )] = lim Δ t → 0 Δ t σ 2 ( t + Δ t ) − σ 2 ( t )
As Δ t → 0 \Delta t \rightarrow 0 Δ t → 0 , Δ t z ( t ) = d w \sqrt{\Delta t} z(t) = d \bold{w} Δ t z ( t ) = d w where w \bold{w} w is a Wiener process. This is because w ( t + Δ t ) − w ( t ) ∼ N ( 0 , Δ t ) \bold{w}(t + \Delta t) - \bold{w}(t) \sim \mathcal{N}(0, \Delta t) w ( t + Δ t ) − w ( t ) ∼ N ( 0 , Δ t ) .
End of Proof.
Proof for DDPM(VP):
x i = 1 − β i x i − 1 + β i z i − 1 i = 1 , … , N . \bold{x}_i = \sqrt{1 - \beta_i} \bold{x}_{i-1} + \sqrt{\beta_i} \bold{z}_{i-1} \qquad i=1,\dots,N. x i = 1 − β i x i − 1 + β i z i − 1 i = 1 , … , N . Define an auxiliary set of noise scales { β i ˉ = N β i } i = 1 N \{ \bar{\beta_i} = N \beta_i \}_{i=1}^N { β i ˉ = N β i } i = 1 N and rewrite the Markov chain as below:
x i = 1 − β i ˉ N x i − 1 + β i ˉ N z i − 1 i = 1 , … , N \bold{x}_i = \sqrt{1 - \frac{\bar{\beta_i}}{N}} \bold{x}_{i-1} + \sqrt{\frac{\bar{\beta_i}}{N}} \bold{z}_{i-1} \qquad i=1,\dots, N x i = 1 − N β i ˉ x i − 1 + N β i ˉ z i − 1 i = 1 , … , N In the limit of N → ∞ N \rightarrow \infty N → ∞ , { β i ˉ } i = 1 N \{ \bar{\beta_i}\}_{i=1}^N { β i ˉ } i = 1 N becomes a function β ( t ) \beta(t) β ( t ) indexed by t ∈ [ 0 , 1 ] t \in [0, 1] t ∈ [ 0 , 1 ] . Let β ( i N ) = β i ˉ \beta(\frac{i}{N}) = \bar{\beta_i} β ( N i ) = β i ˉ , x ( i N ) = x i \bold{x}(\frac{i}{N}) = \bold{x}_i x ( N i ) = x i , z ( i N ) = z i \bold{z}(\frac{i}{N}) = \bold{z}_i z ( N i ) = z i . We can rewrite the above Markov chain as the following with Δ t = 1 N \Delta t = \frac{1}{N} Δ t = N 1 and t ∈ { 0 , 1 , … , N − 1 N } t \in \{ 0, 1, \dots, \frac{N-1}{N} \} t ∈ { 0 , 1 , … , N N − 1 } :
x ( t + Δ t ) = 1 − β ( t + Δ t ) Δ t x ( t ) + β ( t + Δ t ) Δ t z ( t ) ≈ x ( t ) − 1 2 β ( t + Δ t ) Δ t x ( t ) + β ( t + Δ t ) Δ t z ( t ) ≈ x ( t ) − 1 2 β ( t ) Δ t x ( t ) + β ( t ) Δ t z ( t ) \begin{align*} \bold{x}(t + \Delta t) &= \sqrt{1 - \beta(t + \Delta t) \Delta t} \bold{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \bold{z}(t) \\ &\approx \bold{x}(t) - \frac{1}{2} \beta(t + \Delta t) \Delta t \bold{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \bold{z}(t) \\ &\approx \bold{x}(t) - \frac{1}{2} \beta(t) \Delta t \bold{x}(t) + \sqrt{\beta(t) \Delta t} \bold{z}(t)\end{align*} x ( t + Δ t ) = 1 − β ( t + Δ t ) Δ t x ( t ) + β ( t + Δ t ) Δ t z ( t ) ≈ x ( t ) − 2 1 β ( t + Δ t ) Δ t x ( t ) + β ( t + Δ t ) Δ t z ( t ) ≈ x ( t ) − 2 1 β ( t ) Δ t x ( t ) + β ( t ) Δ t z ( t ) The first approximation comes from Taylor Expansion: 1 − x = 1 − x 2 − x 2 8 − ⋯ ≈ 1 − x 2 \sqrt{1 - x} = 1 - \frac{x}{2} - \frac{x^2}{8} - \dots \approx 1 - \frac{x}{2} 1 − x = 1 − 2 x − 8 x 2 − ⋯ ≈ 1 − 2 x
Therefore, in the limit of Δ t → 0 \Delta t \rightarrow 0 Δ t → 0 , the Markov chain converges to the following VP SDE:
d x = − 1 2 β ( t ) x d t + β ( t ) d w d \bold{x} = - \frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d \bold{w} d x = − 2 1 β ( t ) x d t + β ( t ) d w
5. How to solve the SDE? Idea : Solve E ( x t ) \mathbb{E}(\bold{x}_t) E ( x t ) and Var ( x t ) \text{Var}(\bold{x}_t) Var ( x t ) , then under the Gaussian assumption, we know that p 0 t ( x t ∣ x 0 ) ∼ N ( ⋅ , ⋅ ) p_{0t}(\bold{x}_t | \bold{x}_0) \sim \mathcal{N}(\cdot, \cdot) p 0 t ( x t ∣ x 0 ) ∼ N ( ⋅ , ⋅ )
Theorem 5.1 (simplified from Equation (5.50), Equation (5.51) in Applied Stochastic Differential Equations )
If the SDE takes the form:
d x = f ( x , t ) d t + g ( t ) d w d \bold{x} = \bold{f}(\bold{x}, t) dt + g(t) d \bold{w} d x = f ( x , t ) d t + g ( t ) d w Then the expectation m ( t ) \bold{m}(t) m ( t ) and covariance matrix P ( t ) \bold{P}(t) P ( t ) of x ( t ) \bold{x}(t) x ( t ) have:
d m ( t ) d t = E [ f ( x , t ) ] \frac{d \bold{m}(t)}{dt} = \mathbb{E} [\bold{f}(\bold{x}, t)] d t d m ( t ) = E [ f ( x , t )] d P ( t ) d t = E [ f ( x , t ) ( x − m ( t ) ) T ] + E [ ( x − m ( t ) ) f T ( x , t ) ] + E [ g 2 ( t ) ] \frac{d \bold{P}(t)}{dt} = \mathbb{E}[\bold{f}(\bold{x}, t) (\bold{x} - \bold{m}(t))^T] + \mathbb{E}[ (\bold{x} - \bold{m}(t))\bold{f}^T(\bold{x}, t)] + \mathbb{E}[g^2(t)] d t d P ( t ) = E [ f ( x , t ) ( x − m ( t ) ) T ] + E [( x − m ( t )) f T ( x , t )] + E [ g 2 ( t )] Solution to VP SDE: VP Model : d x = − 1 2 β ( t ) x d t + β ( t ) d w \text{VP Model}: \quad d \bold{x} = -\frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t)} d\bold{w} VP Model : d x = − 2 1 β ( t ) x d t + β ( t ) d w By Theorem 5.1, we have
d m d t = E [ − 1 2 β ( t ) x ] = − 1 2 β ( t ) E [ x ( t ) ] = − 1 2 β ( t ) m ⇒ d m m = − 1 2 β ( t ) d t ln m ( t ) − ln m ( 0 ) = − 1 2 ∫ 0 t β ( s ) d s ⇒ m ( t ) = e ln m ( 0 ) − 1 2 ∫ 0 t β ( s ) d s = m ( 0 ) e − 1 2 ∫ 0 t β ( s ) d s \begin{align*} \frac{d \bold{m}}{dt} &= \mathbb{E}[-\frac{1}{2} \beta(t) \bold{x}] = -\frac{1}{2} \beta(t) \mathbb{E}[\bold{x}(t)] \\ &= -\frac{1}{2} \beta(t) \bold{m} \\ \Rightarrow \frac{d\bold{m}}{\bold{m}} &= -\frac{1}{2} \beta(t) dt \\ \ln \bold{m}(t) - \ln \bold{m}(0) & = -\frac{1}{2} \int_0^t \beta(s) ds \\ \Rightarrow \bold{m}(t) &= e^{\ln \bold{m}(0) - \frac{1}{2} \int_0^t \beta(s) ds} = \bold{m}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}\end{align*} d t d m ⇒ m d m ln m ( t ) − ln m ( 0 ) ⇒ m ( t ) = E [ − 2 1 β ( t ) x ] = − 2 1 β ( t ) E [ x ( t )] = − 2 1 β ( t ) m = − 2 1 β ( t ) d t = − 2 1 ∫ 0 t β ( s ) d s = e l n m ( 0 ) − 2 1 ∫ 0 t β ( s ) d s = m ( 0 ) e − 2 1 ∫ 0 t β ( s ) d s Therefore,
E [ x ( t ) ∣ x ( 0 ) ] = x ( 0 ) e − 1 2 ∫ 0 t β ( s ) d s \mathbb{E}[\bold{x} (t) | \bold{x}(0)] = \bold{x}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds} E [ x ( t ) ∣ x ( 0 )] = x ( 0 ) e − 2 1 ∫ 0 t β ( s ) d s For covariance matrix P ( t ) P(t) P ( t ) :
d P ( t ) d t = E [ − 1 2 β ( t ) x ( t ) ( x ( t ) − m ( t ) ) T ] + E [ − 1 2 β ( t ) ( x ( t ) − m ( t ) ) x ( t ) T ] + E [ β ( t ) ] = − β ( t ) E [ x ( t ) ( x ( t ) − m ( t ) ) T ] + β ( t ) \begin{align*} \frac{dP(t)}{dt} &= \mathbb{E} [-\frac{1}{2} \beta(t)\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] + \mathbb{E} [-\frac{1}{2} \beta(t) (\bold{x}(t) - \bold{m}(t)) \bold{x}(t)^T] + \mathbb{E}[\beta(t)] \\ &= -\beta(t)\mathbb{E}[\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] + \beta(t) \end{align*} d t d P ( t ) = E [ − 2 1 β ( t ) x ( t ) ( x ( t ) − m ( t ) ) T ] + E [ − 2 1 β ( t ) ( x ( t ) − m ( t )) x ( t ) T ] + E [ β ( t )] = − β ( t ) E [ x ( t ) ( x ( t ) − m ( t ) ) T ] + β ( t ) Since E [ x ( t ) − m ( t ) ] = 0 \mathbb{E}[\bold{x}(t) - \bold{m}(t)] = 0 E [ x ( t ) − m ( t )] = 0 , we have:
E [ x ( t ) ( x ( t ) − m ( t ) ) T ] = E [ x ( t ) ( x ( t ) − m ( t ) ) T ] − 0 = E [ x ( t ) ( x ( t ) − m ( t ) ) T ] − m ( t ) E [ ( x ( t ) − m ( t ) ) T ] = E [ ( x ( t ) − m ( t ) ) ( x ( t ) − m ( t ) ) T ] = P ( t ) \begin{align*} \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] &= \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] - 0 \\ &= \mathbb{E} [\bold{x}(t) (\bold{x}(t) - \bold{m}(t))^T] - \bold{m}(t) \mathbb{E} [(\bold{x}(t) - \bold{m}(t))^T] \\ &= \mathbb{E} [(\bold{x}(t) - \bold{m}(t))(\bold{x}(t) - \bold{m}(t))^T] \\ &= P(t)\end{align*} E [ x ( t ) ( x ( t ) − m ( t ) ) T ] = E [ x ( t ) ( x ( t ) − m ( t ) ) T ] − 0 = E [ x ( t ) ( x ( t ) − m ( t ) ) T ] − m ( t ) E [( x ( t ) − m ( t ) ) T ] = E [( x ( t ) − m ( t )) ( x ( t ) − m ( t ) ) T ] = P ( t ) ⇒ d P ( t ) d t = − β ( t ) P ( t ) + β ( t ) = β ( t ) ( I − P ( t ) ) d P ( t ) I − P ( t ) = β ( t ) d t − ln ( I − P ( t ) ) + ln ( I − P ( 0 ) ) = ∫ 0 t β ( s ) d s I − P ( t ) = exp { ln ( I − P ( 0 ) ) − ∫ 0 t β ( s ) d s } P ( t ) = I − ( I − P ( 0 ) ) e − ∫ 0 t β ( s ) d s \begin{align*} \Rightarrow \frac{d P(t)}{dt} &= -\beta(t) P(t) + \beta(t) \\ &= \beta(t) (\bold{I} - P(t)) \\ \frac{d P(t)}{\bold{I} - P(t)} &= \beta(t) dt \\ -\ln (\bold{I} - P(t)) + \ln (\bold{I} - P(0)) &= \int_0^t \beta(s) ds \\ \bold{I} - P(t) &= \exp\{\ln(\bold{I} - P(0)) -\int_0^t \beta(s)ds \} \\ P(t) &= \bold{I} - (\bold{I} - P(0)) e^{-\int_0^t \beta(s)ds} \end{align*} ⇒ d t d P ( t ) I − P ( t ) d P ( t ) − ln ( I − P ( t )) + ln ( I − P ( 0 )) I − P ( t ) P ( t ) = − β ( t ) P ( t ) + β ( t ) = β ( t ) ( I − P ( t )) = β ( t ) d t = ∫ 0 t β ( s ) d s = exp { ln ( I − P ( 0 )) − ∫ 0 t β ( s ) d s } = I − ( I − P ( 0 )) e − ∫ 0 t β ( s ) d s Since C o v ( x 0 ∣ x 0 ) = 0 Cov(\bold{x}_0 | \bold{x}_0) = 0 C o v ( x 0 ∣ x 0 ) = 0 , we have:
C o v ( x t ∣ x 0 ) = I − I e − ∫ 0 t β ( s ) d s Cov(\bold{x}_t | \bold{x}_0) = \bold{I} - \bold{I} e^{-\int_0^t \beta(s)ds} C o v ( x t ∣ x 0 ) = I − I e − ∫ 0 t β ( s ) d s Therefore, the solution to VP model is:
p 0 t ( x ( t ) ∣ x ( 0 ) ) = N ( x ( t ) ; x ( 0 ) e − 1 2 ∫ 0 t β ( s ) d s , I − I e − ∫ 0 t β ( s ) d s ) ( VP SDE ) p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) e^{- \frac{1}{2} \int_0^t \beta(s) ds}, \bold{I} - \bold{I} e^{-\int_0^t \beta(s)ds}) \qquad (\text{VP SDE}) p 0 t ( x ( t ) ∣ x ( 0 )) = N ( x ( t ) ; x ( 0 ) e − 2 1 ∫ 0 t β ( s ) d s , I − I e − ∫ 0 t β ( s ) d s ) ( VP SDE )
Solution to VE SDE: VE Model : d x = d [ σ 2 ( t ) ] d t d w ( f ( x , t ) = 0 ) \text{VE Model}: \quad d\bold{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w} \qquad (\bold{f}(\bold{x}, t) = 0) VE Model : d x = d t d [ σ 2 ( t )] d w ( f ( x , t ) = 0 ) By Theorem 5.1:
d m d t = 0 ⇒ m ( t ) = C = m ( 0 ) = x 0 \frac{d\bold{m}}{dt} = 0 \Rightarrow \bold{m}(t) = C = \bold{m}(0) = \bold{x}_0 d t d m = 0 ⇒ m ( t ) = C = m ( 0 ) = x 0 d P ( t ) d t = E [ d [ σ 2 ( t ) ] d t ] = d [ σ 2 ( t ) ] d t d P ( t ) = d [ σ 2 ( t ) ] P ( t ) − P ( 0 ) = σ 2 ( t ) − σ 2 ( 0 ) P ( t ) = σ 2 ( t ) − σ 2 ( 0 ) \begin{align*}\frac{d P(t)}{dt} &= \mathbb{E} [\frac{d[\sigma^2(t)]}{dt}] = \frac{d[\sigma^2(t)]}{dt} \\ dP(t) &= d[\sigma^2(t)] \\ P(t) - P(0) &= \sigma^2(t) - \sigma^2(0) \\ P(t) &= \sigma^2(t) - \sigma^2(0)\end{align*} d t d P ( t ) d P ( t ) P ( t ) − P ( 0 ) P ( t ) = E [ d t d [ σ 2 ( t )] ] = d t d [ σ 2 ( t )] = d [ σ 2 ( t )] = σ 2 ( t ) − σ 2 ( 0 ) = σ 2 ( t ) − σ 2 ( 0 ) Therefore, the solution to VE SDE is:
p 0 t ( x ( t ) ∣ x ( 0 ) ) = N ( x ( t ) ; x ( 0 ) , [ σ 2 ( t ) − σ 2 ( 0 ) ] I ) ( VE SDE ) p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) , [\sigma^2(t) - \sigma^2(0)] \bold{I}) \qquad (\text{VE SDE}) p 0 t ( x ( t ) ∣ x ( 0 )) = N ( x ( t ) ; x ( 0 ) , [ σ 2 ( t ) − σ 2 ( 0 )] I ) ( VE SDE )
6. Derive mean and variance of the perturbation kernel from sub-VP SDE d x = − 1 2 β ( t ) x d t + β ( t ) ( 1 − e − 2 ∫ 0 t β ( s ) d s ) d w d \bold{x} = -\frac{1}{2} \beta(t) \bold{x} dt + \sqrt{\beta(t) (1 - e^{-2 \int_0^t \beta(s) ds})} d\bold{w} d x = − 2 1 β ( t ) x d t + β ( t ) ( 1 − e − 2 ∫ 0 t β ( s ) d s ) d w Why sub-VP SDE :
Perform well on likelihood Variance is bounded by VP SDE
Since VE, VP and sub-VP SDEs all have affine drift coefficients, their perturbation kernels p 0 t ( x ( t ) ∣ x ( 0 ) ) p_{0t} (\bold{x}(t) | \bold{x}(0)) p 0 t ( x ( t ) ∣ x ( 0 )) are all Gaussian and can be computed in closed-forms. This makes training with the score-matching loss:
θ ∗ = arg min θ E t { λ ( t ) E x ( 0 ) E x ( t ) ∣ x ( 0 ) [ ∣ ∣ s θ ( x ( t ) , t ) − ∇ x ( t ) log p 0 t ( x ( t ) ∣ x ( 0 ) ) ∣ ∣ 2 2 ] } \theta^* = \argmin_{\theta} \mathbb{E}_t \{ \lambda(t) \mathbb{E}_{\bold{x}(0)} \mathbb{E}_{\bold{x}(t) | \bold{x}(0)} [||\bold{s}_{\theta}(\bold{x}(t), t) - \nabla_{\bold{x}(t)} \log p_{0t}(\bold{x}(t) | \bold{x}(0))||^2_2] \} θ ∗ = θ arg min E t { λ ( t ) E x ( 0 ) E x ( t ) ∣ x ( 0 ) [ ∣∣ s θ ( x ( t ) , t ) − ∇ x ( t ) log p 0 t ( x ( t ) ∣ x ( 0 )) ∣ ∣ 2 2 ]}
Solution to sub-VP SDE: Corollary 6.1:
Given an ODE of the form y ′ ( x ) + p ( x ) y ( x ) = f ( x ) y'(x) + p(x)y(x) = f(x) y ′ ( x ) + p ( x ) y ( x ) = f ( x ) , the solution is given by:
y ( x ) = 1 μ ( x ) ( ∫ f ( ξ ) μ ( ξ ) d ξ + C ) y(x) = \frac{1}{\mu(x)} \left( \int f(\xi) \mu(\xi) d\xi + C \right) y ( x ) = μ ( x ) 1 ( ∫ f ( ξ ) μ ( ξ ) d ξ + C ) where μ ( t ) = exp ( ∫ t p ( ξ ) d ξ ) \mu(t) = \exp \left( \int^t p(\xi)d\xi \right) μ ( t ) = exp ( ∫ t p ( ξ ) d ξ ) .
Similar to VP SDE, we have:
E [ x ( t ) ∣ x ( 0 ) ] = x ( 0 ) e − 1 2 ∫ 0 t β ( s ) d s \mathbb{E} [\bold{x}(t) | \bold{x}(0)] = \bold{x}(0) e^{-\frac{1}{2} \int_0^t \beta(s) ds} E [ x ( t ) ∣ x ( 0 )] = x ( 0 ) e − 2 1 ∫ 0 t β ( s ) d s By Theorem 5.1:
d P ( t ) d t = − β ( t ) P ( t ) + β ( t ) ( 1 − exp { − 2 ∫ 0 t β ( s ) d s } ) P ′ ( t ) + β ( t ) P ( t ) = β ( t ) ( 1 − exp { − 2 ∫ 0 t β ( s ) d s } ) \begin{align*} \frac{d P(t)}{dt} &= -\beta(t) P(t) + \beta(t) (1 - \exp \{ -2 \int_0^t \beta(s) ds \}) \\ P'(t) + \beta(t) P(t) &= \beta(t) (1 - \exp \{ -2 \int_0^t \beta(s) ds \})\end{align*} d t d P ( t ) P ′ ( t ) + β ( t ) P ( t ) = − β ( t ) P ( t ) + β ( t ) ( 1 − exp { − 2 ∫ 0 t β ( s ) d s }) = β ( t ) ( 1 − exp { − 2 ∫ 0 t β ( s ) d s }) By Corollary 6.1:
P ( t ) = I ⋅ exp { − ∫ 0 t β ( s ) d s } ⋅ ( ∫ 0 t β ( s ) [ 1 − exp { − 2 ∫ 0 s β ( ξ ) d ξ } ] exp { ∫ 0 s β ( ξ ) d ξ } d s + C ) = I ⋅ exp { − ∫ 0 t β ( s ) d s } ⋅ ( ∫ 0 t β ( s ) exp { ∫ 0 s β ( ξ ) d ξ } d s − ∫ 0 t β ( s ) exp { − ∫ 0 s β ( ξ ) d ξ } d s + C ) \begin{align*} P(t) &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left( \int_0^t \beta(s) \left[ 1 - \exp\{ -2 \int_0^s \beta(\xi) d\xi \} \right] \exp \{ \int_0^s \beta(\xi) d\xi \} ds + C \right) \\ &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left( \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds - \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds + C \right)\end{align*} P ( t ) = I ⋅ exp { − ∫ 0 t β ( s ) d s } ⋅ ( ∫ 0 t β ( s ) [ 1 − exp { − 2 ∫ 0 s β ( ξ ) d ξ } ] exp { ∫ 0 s β ( ξ ) d ξ } d s + C ) = I ⋅ exp { − ∫ 0 t β ( s ) d s } ⋅ ( ∫ 0 t β ( s ) exp { ∫ 0 s β ( ξ ) d ξ } d s − ∫ 0 t β ( s ) exp { − ∫ 0 s β ( ξ ) d ξ } d s + C ) Denote 1 ◯ = ∫ 0 t β ( s ) exp { ∫ 0 s β ( ξ ) d ξ } d s \textcircled{1} = \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds 1 ◯ = ∫ 0 t β ( s ) exp { ∫ 0 s β ( ξ ) d ξ } d s and 2 ◯ = ∫ 0 t β ( s ) exp { − ∫ 0 s β ( ξ ) d ξ } d s \textcircled{2} = \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds 2 ◯ = ∫ 0 t β ( s ) exp { − ∫ 0 s β ( ξ ) d ξ } d s . Solve them separately.
1 ◯ = ∫ 0 t β ( s ) exp { ∫ 0 s β ( ξ ) d ξ } d s = exp { ∫ 0 s β ( ξ ) d ξ } ∣ s = 0 s = t = exp { ∫ 0 t β ( s ) d s } − 1 \begin{align*} \textcircled{1} &= \int_0^t \beta(s) \exp \{ \int_0^s \beta(\xi) d\xi \} ds \\ &= \exp \{ \int_0^s \beta(\xi) d\xi \} |_{s=0}^{s=t} \\ &= \exp \{ \int_0^t \beta(s) ds \} - 1 \end{align*} 1 ◯ = ∫ 0 t β ( s ) exp { ∫ 0 s β ( ξ ) d ξ } d s = exp { ∫ 0 s β ( ξ ) d ξ } ∣ s = 0 s = t = exp { ∫ 0 t β ( s ) d s } − 1 2 ◯ = ∫ 0 t β ( s ) exp { − ∫ 0 s β ( ξ ) d ξ } d s = − exp { − ∫ 0 s β ( ξ ) d ξ } ∣ s = 0 s = t = − exp { − ∫ 0 t β ( s ) d s } + 1 \begin{align*} \textcircled{2} &= \int_0^t \beta(s) \exp \{ -\int_0^s \beta(\xi) d\xi \} ds \\ &= - \exp \{ -\int_0^s \beta(\xi) d\xi \} |_{s=0}^{s=t} \\ &= - \exp \{ -\int_0^t \beta(s) ds \} + 1\end{align*} 2 ◯ = ∫ 0 t β ( s ) exp { − ∫ 0 s β ( ξ ) d ξ } d s = − exp { − ∫ 0 s β ( ξ ) d ξ } ∣ s = 0 s = t = − exp { − ∫ 0 t β ( s ) d s } + 1 Therefore:
P ( t ) = I ⋅ exp { − ∫ 0 t β ( s ) d s } ⋅ [ exp { ∫ 0 t β ( s ) d s } + exp { − ∫ 0 t β ( s ) d s } + C ] = I ⋅ [ 1 + exp { − 2 ∫ 0 t β ( s ) d s } + exp { − ∫ 0 t β ( s ) d s } ⋅ C ] \begin{align*} P(t) &= \bold{I} \cdot \exp\{ -\int_0^t \beta(s) ds \} \cdot \left[ \exp \{ \int_0^t \beta(s) ds \} + \exp \{ - \int_0^t \beta(s) ds \} + C \right] \\ &= \bold{I} \cdot \left[ 1 + \exp \{ -2\int_0^t \beta(s) ds \} + \exp \{ -\int_0^t \beta(s) ds \} \cdot C\right]\end{align*} P ( t ) = I ⋅ exp { − ∫ 0 t β ( s ) d s } ⋅ [ exp { ∫ 0 t β ( s ) d s } + exp { − ∫ 0 t β ( s ) d s } + C ] = I ⋅ [ 1 + exp { − 2 ∫ 0 t β ( s ) d s } + exp { − ∫ 0 t β ( s ) d s } ⋅ C ] Plug in t = 0 t=0 t = 0 , we have P ( 0 ) = I ( 2 + C ) ⇒ C I = P ( 0 ) − 2 I P(0) = \bold{I} (2 + C) \Rightarrow C \bold{I} = P(0) - 2\bold{I} P ( 0 ) = I ( 2 + C ) ⇒ C I = P ( 0 ) − 2 I . Then:
P ( t ) = I + e − 2 ∫ 0 t β ( s ) d s I + e − ∫ 0 t β ( s ) d s ( P ( 0 ) − 2 I ) P(t) = \bold{I} + e^{-2 \int_0^t \beta(s) ds} \bold{I} + e^{- \int_0^t \beta(s) ds} (P(0) - 2 \bold{I}) P ( t ) = I + e − 2 ∫ 0 t β ( s ) d s I + e − ∫ 0 t β ( s ) d s ( P ( 0 ) − 2 I ) Note: If lim t → ∞ ∫ 0 t β ( s ) d s = ∞ \lim_{t \rightarrow \infty} \int_0^t \beta(s) ds = \infty lim t → ∞ ∫ 0 t β ( s ) d s = ∞ , we can observe that lim t → ∞ P ( t ) = I \lim_{t \rightarrow \infty} P(t) = \bold{I} lim t → ∞ P ( t ) = I . This justifies the use of sub-VP SDEs for score-based generative modeling, since they can perturb any data distribution to standard Gaussian under suitable conditions.
Since P ( 0 ) = 0 P(0) = 0 P ( 0 ) = 0 , we have:
P ( t ) = I + e − 2 ∫ 0 t β ( s ) d s I − 2 e − ∫ 0 t β ( s ) d s I = [ 1 − e − ∫ 0 t β ( s ) d s ] 2 I P(t) = \bold{I} + e^{-2 \int_0^t \beta(s) ds} \bold{I} - 2 e^{- \int_0^t \beta(s) ds} \bold{I} = [1 - e^{- \int_0^t \beta(s) ds}]^2 \bold{I} P ( t ) = I + e − 2 ∫ 0 t β ( s ) d s I − 2 e − ∫ 0 t β ( s ) d s I = [ 1 − e − ∫ 0 t β ( s ) d s ] 2 I Therefore, the solution to sub-VP SDEs is:
p 0 t ( x ( t ) ∣ x ( 0 ) ) = N ( x ( t ) ; x ( 0 ) e − 1 2 ∫ 0 t β ( s ) d s , [ 1 − e − ∫ 0 t β ( s ) d s ] 2 I ) (sub-VP SDE) p_{0t}(\bold{x}(t) | \bold{x}(0)) = \mathcal{N} (\bold{x}(t); \bold{x}(0) e^{-\frac{1}{2} \int_0^t \beta(s) ds} , [1 - e^{- \int_0^t \beta(s) ds}]^2 \bold{I}) \qquad \text{(sub-VP SDE)} p 0 t ( x ( t ) ∣ x ( 0 )) = N ( x ( t ) ; x ( 0 ) e − 2 1 ∫ 0 t β ( s ) d s , [ 1 − e − ∫ 0 t β ( s ) d s ] 2 I ) (sub-VP SDE)
7. How to choose the noise scale 7.1 SMLD (VE SDEs) In SMLD, the noise scales { σ i } i = 1 N \{ \sigma_i \}_{i=1}^N { σ i } i = 1 N is typically a geometric sequence where σ min = 0.01 \sigma_{\text{min}} = 0.01 σ min = 0.01 and σ max \sigma_{\text{max}} σ max is chosen to Technique 1 in Song & Ermon (2020) .
Technique 1: Choose σ max \sigma_{\text{max}} σ max to be as large as the maximum Euclidean distance between all pairs of training data points. (Usually, SMLD models normalize image inputs to the range [0,1])
Since { σ i } i = 1 N \{ \sigma_i \}_{i=1}^N { σ i } i = 1 N is a geometric sequence, we have σ ( i N ) = σ i = σ min ( σ max σ min ) i − 1 N − 1 \sigma(\frac{i}{N}) = \sigma_i = \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^{\frac{i-1}{N-1}} σ ( N i ) = σ i = σ min ( σ min σ max ) N − 1 i − 1 for i = 1 , … , N i = 1, \dots, N i = 1 , … , N . When N → ∞ N \rightarrow \infty N → ∞ , σ ( t ) = σ min ( σ max σ min ) t \sigma(t) = \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^t σ ( t ) = σ min ( σ min σ max ) t for t ∈ ( 0 , 1 ] t \in (0, 1] t ∈ ( 0 , 1 ] .
Then the corresponding VE SDE is:
d x = d [ σ 2 ( t ) ] d t d w = σ min ( σ max σ min ) t 2 ln σ max σ min d w \begin{align*} d \bold{x} &= \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\bold{w} \\ &= \sigma_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^t \sqrt{2 \ln \frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}} d\bold{w}\end{align*} d x = d t d [ σ 2 ( t )] d w = σ min ( σ min σ max ) t 2 ln σ min σ max d w The perturbation kernel can be derived according to Section 5 VE SDEs:
p 0 t ( x ( t ) ∣ x ( 0 ) ) = N ( x ( t ) ; x ( 0 ) , σ min 2 ( σ max σ min ) 2 t I ) , t ∈ ( 0 , 1 ] p_{0t} (\bold{x}(t) | \bold{x}(0)) = \mathcal{N} \left( \bold{x}(t); \bold{x}(0), \sigma^2_{\text{min}} (\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^{2t} \bold{I} \right), \quad t \in (0, 1] p 0 t ( x ( t ) ∣ x ( 0 )) = N ( x ( t ) ; x ( 0 ) , σ min 2 ( σ min σ max )