# Score-Matching Langevin Dynamics (SMLD)

This is a learning note of this series of videos.

Link to the paper: https://arxiv.org/abs/2403.18103

**Core components**: 1. Langevin Dynamics; 2. Stein score function; 3. score-matching loss

**Goal**: generate (sample) data $\{ x_1, \dots, x_T \}$ from a distribution $p(x)$. If $p(x)$ is known, plug in $\nabla_x \log p(x)$ into Langevin Dynamics to generate samples. If $p(x)$ is unknown, train a neural network for approximation $s_{\theta}(x) \approx \nabla_x \log p(x)$

# 1. Langevin Dynamics is gradient descent

* Definition1.1*: The Langevin dynamics for sampling from a known distribution $p(x)$ is an iterative procedure for $t = 1, \dots , T$:

where $\tau$ is the step size which users can control, and $x_0$ is white noise.

**Remark**: Without the noise term $\sqrt{2\tau} z$, Langevin dynamics is** gradient descent**.

The goal of sampling is equivalent to solving the optimization:

This optimization can be solved by gradient descent. One gradient descent step is:

# 2. Langevin Dynamics is not only gradient descent, but also stochastic gradient descent

Why stochastic? We are more interested in *sampling *rather than *optimization*.

# 3. Langevin Dynamics from a physics perspective

Relationship between force $\bold{F}$ and mass $m$ and velocity $v(t)$:

Relationship between force $\bold{F}$ and the potential energy $U(x)$:

The randomness of Langevin dynamics comes from Brownian motion. Suppose there is a bag of molecules moving around. Their motion can be described according to the Brownian motion model:

According to the above three equations, we have:

Since $\frac{dx}{dt} = \bold{v}(t)$ and $\bold{\eta} \sim \mathcal{N} (0, \sigma^2 \bold{I})$, we have

If we let $\tau = \frac{dt}{\lambda}$ and discretize the above differential equation, we will obtain:

A lazy choice to determine the energy potential is using the Boltzmann distribution with the form:

Therefore,

If we choose $\sigma = \sqrt{\frac{2}{\tau}}$, we will obtain:

which is the Langevin Dynamics.

# 4. Stein’s Score Function

* Definition 4.1*: (Stein’s score function)

Distinguish it from **ordinary score function**:

**Example 4.1**: If $p(x)$ is a Gaussian with $p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{- \frac{(x-\mu)^2}{2\sigma^2}}$, then

# 5. Score-Matching Techniques

**Problem**: We don’t know $p(\bold{x})$, so we can not calculate $s_{\theta}(x)$.

**Goal**: Calculate $\nabla_{\bold{x}} p(\bold{x})$ without knowing the real distribution $p(\bold{x})$

## 5.1 Explicit Score-Matching

Use $q(\bold{x})$ as an approximation of the true data distribution $p(\bold{x})$. We can use classical kernel density estimation to obtain $q(\bold{x})$:

where $h$ is the hyperparameter for the kernel function $K(\cdot)$, and $\bold{x}_m$ is the m-th sample in the training set.

Since $q(\bold{x})$ is an approximation of $p(\bold{x})$, we can learn $\bold{s}_{\theta}(\bold{x})$ based on $q(\bold{x})$. This leads to the **explict score matching** loss:

By substituting the kernel density estimation, we can show that the loss is:

**Problem with ESM**: When the sample size $M$ is a large number, the computation of $q(\bold{x})$ is expensive. And when the sample size is limited and data is in a high dimensional space, the kernel density estimation can have poor performance.

## 5.2 Denoising Score Matching

The key difference is that we replace the distribution $q(\bold{x})$, by a conditional distribution $q(\bold{x} | \bold{x}')$:

Specially, we set $q(\bold{x} | \bold{x}') = \mathcal{N}(\bold{x} | \bold{x}', \sigma^2)$, and we can let $\bold{x} = \bold{x}' + \sigma \bold{z}$. This leads to:

As a result, the loss function of the denoising score matching becomes:

Replace the dummy variable $x'$ by $x$, and note that sampling from $q(x)$ can be replaced by sampling from $p(x)$ give a training dataset. Then we can conclude the **denoising score matching** loss function:

**Remark**: The above loss function is highly interpretable. The quantity $\bold{x} + \sigma \bold{z}$ is effectively adding noise $\sigma \bold{z}$ to a clean image $\bold{x}$. The score function is supposed to take this noisy image and predict the noise $\frac{\bold{z}}{\sigma}$.

The training step can simply describe as follow. Given a training dataset $\{ \bold{x}^{(l)} \}^L_{l=1}$, we train a network $\theta$ with the goal to