Goal: generate (sample) data {x1,…,xT} from a distribution p(x). If p(x) is known, plug in ∇xlogp(x) into Langevin Dynamics to generate samples. If p(x) is unknown, train a neural network for approximation sθ(x)≈∇xlogp(x)
1. Langevin Dynamics is gradient descent
Definition1.1: The Langevin dynamics for sampling from a known distribution p(x) is an iterative procedure for t=1,…,T:
where τ is the step size which users can control, and x0 is white noise.
Remark: Without the noise term 2τz, Langevin dynamics is gradient descent.
The goal of sampling is equivalent to solving the optimization:
This optimization can be solved by gradient descent. One gradient descent step is:
2. Langevin Dynamics is not only gradient descent, but also stochastic gradient descent
Why stochastic? We are more interested in sampling rather than optimization.
3. Langevin Dynamics from a physics perspective
Relationship between force F and mass m and velocity v(t):
Relationship between force F and the potential energy U(x):
The randomness of Langevin dynamics comes from Brownian motion. Suppose there is a bag of molecules moving around. Their motion can be described according to the Brownian motion model:
According to the above three equations, we have:
Since dtdx=v(t) and η∼N(0,σ2I), we have
If we let τ=λdt and discretize the above differential equation, we will obtain:
A lazy choice to determine the energy potential is using the Boltzmann distribution with the form:
Therefore,
If we choose σ=τ2, we will obtain:
which is the Langevin Dynamics.
4. Stein’s Score Function
Definition 4.1: (Stein’s score function)
Distinguish it from ordinary score function:
Example 4.1: If p(x) is a Gaussian with p(x)=2πσ21e−2σ2(x−μ)2, then
5. Score-Matching Techniques
Problem: We don’t know p(x), so we can not calculate sθ(x).
Goal: Calculate ∇xp(x) without knowing the real distribution p(x)
5.1 Explicit Score-Matching
Use q(x) as an approximation of the true data distribution p(x). We can use classical kernel density estimation to obtain q(x):
where h is the hyperparameter for the kernel function K(⋅), and xm is the m-th sample in the training set.
Since q(x) is an approximation of p(x), we can learn sθ(x) based on q(x). This leads to the explict score matching loss:
By substituting the kernel density estimation, we can show that the loss is:
Problem with ESM: When the sample size M is a large number, the computation of q(x) is expensive. And when the sample size is limited and data is in a high dimensional space, the kernel density estimation can have poor performance.
5.2 Denoising Score Matching
The key difference is that we replace the distribution q(x), by a conditional distribution q(x∣x′):
Specially, we set q(x∣x′)=N(x∣x′,σ2), and we can let x=x′+σz. This leads to:
As a result, the loss function of the denoising score matching becomes:
Replace the dummy variable x′ by x, and note that sampling from q(x) can be replaced by sampling from p(x) give a training dataset. Then we can conclude the denoising score matching loss function:
Remark: The above loss function is highly interpretable. The quantity x+σz is effectively adding noise σz to a clean image x. The score function is supposed to take this noisy image and predict the noise σz.
The training step can simply describe as follow. Given a training dataset {x(l)}l=1L, we train a network θ with the goal to