Flow Matching For Generative Modeling

This is a learning note from this series of videos.

Link to the paper: https://arxiv.org/abs/2210.02747

Supplementary papers:

Step-by-Step Diffusion: An Elementary Tutorial

Glow: Generative Flow with Invertible 1x1 Convolutions

1. What is Flow? Understand vector field and ODE.

Generative Modeling: With known samples $x_1 \sim q_1 \text{(unknown distribution)}$ , we want to estimate this unknown distribution.

Method: $\underbrace{q_0}_{\text{a known distribution}} \xrightarrow{\phi} \underbrace{q_1}_{\text{unknown distribution to be estimated}}$ . This mapping is denoted as $\phi$ .

How to solve for $\phi$ : 1. Normalizing Flow; 2. Flow Matching (ODE)

Keypoints of Flow Matching:

Let $q_0$ and $q_1$ be the initial and final points of ODE.

Use neural networks to fit the gradient term in ODE.

Solve the ODE.

Preliminaries

ODE → Flow → Normalizing Flow → Continuous Normalizing Flow

Flow

Definition 1. (Flow): A flow is a collection of time-indexed vector fields $v = \{ v_t \}_{t \in [0, 1]}$ .

Any flow defines a trajectory taking initial points $x_1$ to final points $x_0$ , by transporting the initial point along the velocity fields $\{ v_t \}$ . It is equivalent to a transfer between two distributions.

Formally, for velocity field $v$ and initial point $x_1$ , consider the ODE

\begin{equation} \frac{d x_t}{dt} = -v_t(x_t) \end{equation}

with initial condition $x_1$ at time $t=1$ . We write

x_t \coloneqq \text{RunFlow}(v, x_1, t)

to denote the solution to the flow ODE at time $t$ , terminating at final point $x_0$ . That is, RunFlow is the result of transporting point $x_1$ along the flow $v$ up to time $t$ .

Flows also define transports between entire distributions by pushing forward points from the source distribution along their trajectories. If $p_1$ is a distribution on initial points, then applying the flow $v$ yields the distribution on final points:

p_0 = \{ \text{RunFlow}(v, x_1, t=0) \}_{x_1 \sim p_1}

This process is denoted as $p_1 \xrightarrow{v} p_0$ , meaning the flow $v$ transports initial distribution $p_1$ to final distribution $p_0$ .

THE ULTIMATE GOAL OF FLOW MATCHING is to learn a velocity field $v^*$ which transport $p_1 \xrightarrow{v^*} p_0$ , where $p_0$ is the target distribution and $p_1$ is some easy-to-sample base distribution (such as Gaussian).

Continuous Normalizing Flows

Let $\mathbb{R}^d$ denote the data space with data points $x = (x^1, \dots, x^d) \in \mathbb{R}^d$ .

The probability density path $p: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}_{>0}$ is a time-dependent probability density function, i.e., $\int p_t(x) dx = 1$ .

$v: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d$ is a time-dependent vector field.

A vector field $v_t$ can be used to construct a time-dependent diffeomorphic map, called a flow, $\phi: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d$ , defined via the ordinary differential equation (ODE):

\begin{align} \frac{d}{dt} \phi_t(x) &= v_t (\phi_t(x)) \\ \phi_0(x) &= x \end{align}

Here $\phi_t(x)$ is a solution to the ODE, and we call it a flow. We can model the vector field $v_t$ with a neural network $v_t(x ; \theta)$ , where $\theta \in \mathbb{R}^p$ are its learnable parameters.

2. The Continuity Equation and Fokker-Planck Equation

Continuity Equation

How to test if a vector field $v_t$ generates a probability path $p_t$ ?

📖

C´edric Villani. Optimal transport: old and new, volume 338. Springer, 2009.

The following PDE provides a necessary and sufficient condition to ensure that a vector field $v_t$ generates a probability path $p_t$ .

\begin{equation} \frac{d}{dt} p_t(x) + \text{div} (p_t(x)v_t(x)) = 0 \end{equation}

Here the divergence operator $\text{div}$ is defined with respect to the spatial variable $x = (x^1, \dots, x^d)$ , i.e. $\text{div} = \sum_{i=1}^d \frac{\partial}{\partial x^i}$ .

Conditional VFs for Fokker-Planck probability paths

Consider a Stochastic Differential Equation (SDE) of the standard form:

\begin{equation} dy = f_t dt + g_t dw \end{equation}

with time parameter $t$ , drift $f_t$ , diffusion coefficient $g_t$ , and $dw$ is the Wiener process.

The solution to the SDE is $y_t$ , which is a stochastic process (a continuous time-dependent variable). Its probability density $p_t(y_t)$ is characterized by the Fokker-Planck equation:

\begin{equation} \frac{dp_t}{dt} = -\text{div}(f_tp_t) + \frac{g_t^2}{2} \Delta p_t \end{equation}

where $\Delta$ represents the Laplace operator (in $y$ ), namely $\text{div} \nabla$ , where $\nabla$ is the gradient operator. We can rewrite the above equation in the form of the continuity equation:

\begin{equation} \begin{align*} \frac{dp_t}{dt} &= -\text{div}(f_t p_t) + \text{div}(\frac{g_t^2}{2}\nabla p_t) \\ &= -\text{div} (f_t p_t - \frac{g_t^2}{2}\frac{\nabla p_t}{p_t} p_t) \\ &= -\text{div}((f_t - \frac{g_t^2}{2} \nabla \log p_t)p_t) \\ &= -\text{div}(w_t p_t) \end{align*} \end{equation}

where the vector field

w_t = f_t - \frac{g_t^2}{2} \nabla \log p_t

satisfies the continuity equation with the probability path $p_t$ , and therefore generates $p_t$ .

3. Continuous Normalizing Flow (CNF)

A CNF is used to reshape a simple prior density $p_0$ (e.g., pure noise) to a more complicated one, $p_1$ , via the push-forward equation.

\begin{equation} p_t = [\phi_t]_* p_0 \end{equation}

The push-forward (or change of variable) operator $*$ is defined by:

\begin{equation} [\phi_t]_* p_0(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right] \end{equation}

📖

Change of variables in the probability density function

Suppose $\bold{x}$ is an n-dimensional random variable with joint density $f$ . If $\bold{y} = G(\bold{x})$ , where $G$ is a bijective, differentiable function, then $\bold{y}$ has density $p_{\bold{Y}}$ :

p_{\bold{Y}}(\bold{y}) = f\left( G^{-1}(\bold{y}) \right) \left| \det \left[ \frac{d G^{-1}(\bold{y})}{d \bold{y}} \right] \right|

Relationships between Flow, Vector Field, and Probability Density

Equation (2)(3) describe the relationship between flow $\phi_t(x)$ and vector field $v_t(x)$

Equation (8)(9) describe how the flow $\phi_t(x)$ changes the density from $p_0$ to $p_t$ .

Equation (4) (continuity equation) gives a necessary and sufficient condition to test whether a vector field $v_t$ generates a probability path $p_t$ .

4. Flow Matching Objective

Notations:

$x_1$ : a random variable distributed according to unknown distribution $q(x_1)$ . We have access to samples from $q(x_1)$ , but no access to the density function.

$p_t$ : a probability path such that $p_0 = p$ is a simple distribution, e.g., the standard normal $p(x) = \mathcal{N}(x | 0, I)$

$p_1$ : approximately equal in distribution to $q$ .

The Flow Matching objective is designed to match this target probability path, which will allow us to flow from $p_0$ to $p_1$ .

Given a target probability density path $p_t(x)$ and a corresponding vector field $u_t(x)$ , which generates $p_t(x)$ , the Flow Matching (FM) objective is defined as:

\begin{equation} \mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{t, p_t(x)}||v_t(x) - u_t(x)||^2 \end{equation}

where $\theta$ denotes the learnable parameters of the CNF vector field $v_t$ (here is a neural network), $t \sim \mathcal{U}[0,1]$ , and $x \sim p_t(x)$ .

Problem: the above objective assumes that the density $p_t(x)$ and the ground truth vector field $u_t(x)$ are known, but we don’t.

Solution: Use conditional probability density for unknown items, and verify it’s valid.

5. From $p_t(x|x_1)$ and $u_t(x|x_1)$ to $p_t(x)$ and $u_t(x)$

Construct a target probability path via a mixture of simpler probability paths:

Given a sample $x_1$ , let $p_t(x | x_1)$ denote a conditional probability path such that $p_0(x|x_1) = p(x)$ at time $t=0$ and $p_1(x|x_1)$ at $t=1$ to be a distribution concentrated around $x=x_1$ , e.g., $p_1(x|x_1) = \mathcal{N}(x|x_1, \sigma^2 I)$ where $\sigma$ is sufficiently small.

The marginal probability path can be given by:

\begin{equation} p_t(x) = \int p_t(x | x_1) q(x_1) dx_1 \end{equation}

In particular at time $t=1$ , the marginal probability $p_1$ is a mixture distribution that closely approximates the data distribution $q$ :

\begin{equation} p_1(x) = \int p_1(x|x_1) q(x_1) dx_1 \approx q(x) \end{equation}

☝

Why $p_1(x)$ approximates $q(x)$ .

The Dirac delta function ( $\delta$ distribution, unit impulse) is defined as:

\delta(x) = \begin{cases} 0, & x \neq 0 \\ \infty, & x=0 \end{cases}

such that

\int_{-\infty}^{\infty} \delta(x) dx = 1

Dirac delta function has the sifting property:

\begin{align*} \int_{-\infty}^{\infty}\delta(x-a)f(x)dx &= \int_{-\infty}^{\infty} \delta(x-a)f(a) dx \\ &= f(a)\int_{-\infty}^{\infty}\delta(x-a) dx = f(a) \end{align*}

The conditional distribution $p_1(x|x_1) = \mathcal{N}(x|x_1, \sigma^2 I)$ with sufficient small $\sigma$ is similar to the Dirac delta function, therefore approximately has the sifting property:

\begin{align*} p_1(x) &= \int p_1(x | x_1) q(x_1) dx_1 \\ &= \int f(x - x_1) q(x_1) dx_1 && \text{(by Gaussian density)} \\ &\approx q(x) && \text{(by 'sifting property')} \end{align*}

Also, we can define the conditional vector fields in the following form:

\begin{equation} u_t(x) = \int u_t(x|x_1) \frac{p_t(x|x_1) q(x_1)}{p_t(x)} dx_1 \end{equation}

where $u_t(\cdot|x_1):\mathbb{R}^d \rightarrow \mathbb{R}^d$ is a conditional vector field that generates $p_t(\cdot|x_1)$

Next, we want to prove that:

The vector field $u_t(x)$ in equation(13) generates the probability path $p_t(x)$ in equation(11).

6. The form of $u_t(x)$

Theorem 1. Given vector field $u_t(x|x_1)$ that generate conditional probability paths $p_t(x|x_1)$ , for any distribution $q(x_1)$ , the marginal vector field $u_t$ in equation (13) generates the marginal probability path $p_t$ in equation (11), i.e., $u_t$ and $p_t$ satisfy the continuity equation (equation (4)).

Proof:

To proof the theorem, we need to check that $p_t$ and $u_t$ satisfy the continuity equation.

Since $u_t(x|x_1)$ generates $p_t(x|x_1)$ , by continuity equation we have:

\frac{d}{dt} p_t(x|x_1) = -\text{div} (p_t(x|x_1) u_t(x|x_1))

Then we inspect $\frac{d}{dt} p_t(x)$ :

\begin{align*} \frac{d}{dt} p_t(x) &= \int \left(\frac{d}{dt} p_t(x | x_1)\right) q(x_1) dx_1 \\ &= -\int \text{div} \left(p_t(x|x_1) u_t(x|x_1) \right) q(x_1) dx_1 \\ &= - \text{div} \left(\int p_t(x|x_1) u_t(x|x_1) q(x_1) dx_1 \right) \\ &= - \text{div} \left( p_t(x) \frac{\int p_t(x|x_1) u_t(x|x_1) q(x_1) dx_1}{p_t(x)} \right) \\ &= -\text{div} (u_t(x) p_t(x)) \end{align*}

It shows that $u_t(x)$ and $p_t(x)$ satisfy the continuity equation.

End of proof.

7. Conditional Flow Matching (Objectives are consistent)

There are still problems. The probability path $p_t(x)$ in Equation (11) and the vector field $u_t(x)$ in Equation (13) are still intractable due to the integration.

To solve this problem, we introduce a new objective $\mathcal{L}_{\text{CFM}}(\theta)$ defined as:

\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, q(x_1), p_t(x|x_1)} \left[ ||v_t(x) - u_t(x|x_1)||^2 \right]

to avoid $u_t(x)$ . And the following theorem states that the new objective $\mathcal{L}_{\text{CFM}}(\theta)$ is equal to $\mathcal{L}_{\text{FM}}(\theta)$ up to a constant difference w.r.t. the model parameters $\theta$ .

The new objective is tractable as long as we can sample from $p_t(x|x_1)$ and compute $u_t(x|x_1)$ . Consequently, this allows us to train a CNF to generate the marginal probability path $p_t$ , which approximates the unknown data distribution $q$ at $t=1$ . We don’t need to access to marginal probability path or marginal vector field. We simply need to design suitable conditional probability paths and vector fields.

Theorem 2. Assuming that $p_t(x) > 0$ for all $x \in \mathbb{R}^d$ and $t \in [0,1]$ , then, up to a constant independent of $\theta$ , $\mathcal{L}_{\text{CFM}}$ and $\mathcal{L}_{\text{FM}}$ are equal. Hence, $\nabla_{\theta} \mathcal{L}_{\text{CFM}} = \nabla_{\theta} \mathcal{L}_{\text{FM}}$ .

Proof:

Some assumptions to ensure the existence of integrals and the changing of integration order (by Fubini’s Theorem):

$q(x)$ and $p_t(x|x_1)$ decrease to zero at sufficient speed as $||x|| \rightarrow \infty$ .

$u_t, v_t, \nabla_{\theta} v_t$ are bounded.

First, we rewrite the two objectives:

\begin{align*} \mathcal{L}_{\text{FM}}(\theta) &= \mathbb{E}_{t, p_t(x)} \left[ ||v_t(x) - u_t(x)||^2 \right] \\ &= \mathbb{E}_{t, p_t(x)} \left[ ||v_t(x)||^2 - 2<v_t(x), u_t(x)> + ||u_t(x)||^2\right] \end{align*}

\begin{align*} \mathcal{L}_{\text{CFM}}(\theta) &= \mathbb{E}_{t, q(x_1), p_t(x|x_1)} \left[ ||v_t(x) - u_t(x|x_1)||^2 \right] \\ &= \mathbb{E}_{t, q(x_1), p_t(x|x_1)} \left[ ||v_t(x)||^2 - 2<v_t(x), u_t(x|x_1)> + ||u_t(x|x_1)||^2 \right] \\ \end{align*}

Note that $||u_t(x)||^2$ and $||u_t(x|x_1)||^2$ are both constant w.r.t the parameters $\theta$ . Now we only need to prove that the expectations of the first two terms are equal.

\begin{align*} \mathbb{E}_{p_t(x)} ||v_t(x)||^2 &= \int ||v_t(x)||^2 p_t(x) dx \\ &= \int \int ||v_t(x)||^2 p_t(x|x_1) q(x_1) dx_1dx \\ &= \mathbb{E}_{q(x_1), p_t(x|x_1)} [||v_t(x)||^2] \end{align*}

Therefore, the first term of two objectives are equal.

\begin{align*} \mathbb{E}_{p_t(x)} <v_t(x), u_t(x)> &= \int <v_t(x), u_t(x)> p_t(x) dx \\ &= \int <v_t(x), \frac{1}{p_t(x)} \int u_t(x|x_1) p_t(x|x_1) q(x_1) dx_1> p_t(x) dx \\ &= \int <v_t(x), \int u_t(x|x_1)p_t(x|x_1) q(x_1) dx_1> dx \\ &= \int \int <v_t(x), u_t(x|x_1) p_t(x|x_1) q(x_1)> dx_1 dx \\ &= \int \int <v_t(x), u_t(x|x_1) >p_t(x|x_1) q(x_1) dx_1 dx \\ &= \mathbb{E}_{q(x_1), p_t(x|x_1)} [<v_t(x), u_t(x|x_1)>] \end{align*}

☝

Brief Explanation

Equality 3 and 5 use the linearity of inner product:

k <a,b> = <ka, b>; \qquad k\in \mathbb{R}, a,b \in \mathbb{R}^d

where $p_t(x) \in \mathbb{R}$ and $p_t(x|x_1) q(x_1) \in \mathbb{R}$ .

Equality 4 change the order of inner product and the integral:

\begin{align*} <a(x), \int b(x_1) dx_1> &= \sum_{i=1}^d a_i(x) \int b_i(x_1) dx_1 \\ &= \sum_{i=1}^d \int a_i(x) b_i(x_1) dx_1 \\ &= \int \left[ \sum_{i=1}^d a_i(x) b_i(x_1) \right] dx_1 \\ &= \int <a(x), b(x_1)> dx_1 \end{align*}

Therefore, the second term of two objectives are also equal. This two results lead to:

\mathcal{L}_{\text{FM}}(\theta) = \mathcal{L}_{\text{CFM}}(\theta) + C\Rightarrow \nabla_{\theta}\mathcal{L}_{\text{FM}}(\theta) = \nabla_{\theta}\mathcal{L}_{\text{CFM}}(\theta)

End of Proof.

8. Derive Conditional Vector Field from Conditional Probability Path (Gaussian)

The Conditional Flow Matching objective works with any choice of conditional probability path and conditional vector fields.

Here we consider a family of Gaussian conditional probability path:

\begin{equation} p_t(x | x_1) = \mathcal{N}(x | \mu_t(x_1), \sigma_t(x_1)^2 I) \end{equation}

where $\mu: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d$ is the time-dependent mean of the Gaussian distribution, while $\sigma: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}_{>0}$ describes a time-dependent scalar standard deviation (std).

When $t=0$ , $\mu_0(x_1)=0$ and $\sigma_0(x_1)=1$ , so that all conditional probability paths converge to the same standard Gaussian noise distribution.

When $t=1$ , $\mu_1(x_1) = x_1$ and $\sigma_1(x_1) = \sigma_{\text{min}}$ , so that $p_1(x|x_1)$ is a concentrated Gaussian distribution centered at $x_1$ .

We decide to choose a simple form of flow (condition on $x_1$ ):

\begin{equation} \psi_t(x) = \sigma_t(x_1) x + \mu_t(x_1) \end{equation}

If we inspect Equation (9), we can verify that $\psi_t$ pushes the noise distribution $p_0(x|x_1) = p(x)$ to $p_t(x|x_1)$

☝

Prove that $[\psi_t]_* p_0(\bold{x}) = p_t(\bold{x})$ .

Proof:

Since $\psi_t(\bold{x}) = \sigma_t(\bold{x_1}) \bold{x} + \mu_t(\bold{x_1})$ , we can derive its inverse function:

\psi_t^{-1}(\bold{x}) = \frac{\bold{x} - \mu_t(\bold{x_1})}{\sigma_t(\bold{x_1})}

When $t=0$ , the probability density function is:

\begin{align*} p_0(\bold{x}) &= (2\pi)^{-d/2} \det(I)^{-1/2}\exp \left( -\frac{1}{2} (\bold{x} - \bold{0})^T I^{-1} (\bold{x} - \bold{0} \right) \\ &= (2\pi)^{-d/2} \exp \left( -\frac{1}{2} \bold{x}^T \bold{x} \right) \end{align*}

According to Equation (9),

\begin{align*} [\psi_t]_* p_0(\bold{x}) &= p_0(\psi_t^{-1}(\bold{x})) \det \left[ \frac{\partial \psi_t^{-1}}{\partial \bold{x}}(\bold{x}) \right] \\ &= (2\pi)^{-d/2} \exp \left( -\frac{1}{2} \frac{1}{\sigma_t(x_1)^2} (\bold{x} - \mu_t(\bold{x_1}))^T (\bold{x} - \mu_t(\bold{x_1})) \right) \det[\frac{1}{\sigma_t(\bold{x}_1)} I] \\ &= (2\pi)^{-d/2} [\sigma_t(\bold{x}_1)^{2d}]^{-1/2} \exp \left( -\frac{1}{2} (\bold{x} - \mu_t(\bold{x_1}))^T \frac{1}{\sigma_t(\bold{x}_1)^2} (\bold{x} - \mu_t(\bold{x_1})) \right) \\ &= (2\pi)^{-d/2} \det(\sigma_t(\bold{x}_1)^2 I)^{-1/2}\exp \left( -\frac{1}{2} (\bold{x} - \mu_t(\bold{x_1}))^T (\sigma_t(\bold{x}_1)^2 I)^{-1} (\bold{x} - \mu_t(\bold{x_1})) \right) \\ &= \mathcal{N}(\bold{x}| \mu_t(x_1), \sigma_t(x_1)^2 I) = p_t(x) \end{align*}

End of Proof.

This flow $\psi_t$ provides a vector field that generates the conditional probability path:

\begin{equation} \frac{d}{dt} \psi_t(x) = u_t(\psi_t(x) | x_1) \end{equation}

Reparameterizing $p_t(x|x_1)$ in terms of just $x_0$ and plugging Equation (16) in the CFM loss we get:

\begin{equation} \mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, q(x_1), p(x_0)} ||v_t(\psi_t(x_0)) - \frac{d}{dt} \psi_t(x_0)||^2 \end{equation}

☝

Derivation of Equation (17)

The original CFM loss is:

\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, q(x_1), p_t(x|x_1)} \left[ ||v_t(x) - u_t(x|x_1)||^2 \right]

According to Equation (14), sample from $p_t(x|x_1)$ is equivalent to sample $x_0$ from standard Gaussian and reparameterize it by

x = \sigma_t(x_1) x_0 + \mu_t(x_1) = \psi_t(x_0) \quad (\text{condition on }x_1)

Therefore,

\begin{align*} \mathcal{L}_{\text{CFM}}(\theta) &= \mathbb{E}_{t, q(x_1), p(x_0)} ||v_t(\psi_t(x_0)) - u_t(\psi_t(x_0)|x_1)||^2 \\ &= \mathbb{E}_{t, q(x_1), p(x_0)} ||v_t(\psi_t(x_0)) - \frac{d}{dt} \psi_t(x_0)||^2 \end{align*}

Theorem 3. Let $p_t(x|x_1)$  be a Gaussian probability path as in Equation (14), and $\psi_t$  its corresponding flow map as in Equation (15). Then, the unique vector field that defines $\psi_t$  has the form:

u_t(x|x_1) = \frac{\sigma_t' (x_1)}{\sigma_t(x_1)} (x - \mu_t(x_1)) + \mu_t'(x_1)

Consequently, $u_t(x|x_1)$  generates the Gaussian path $p_t(x|x_1)$ .

☝

Prove Theorem 3.

Proof:

For notation simplicity, we denote $w_t(x) = u_t(x|x_1)$ . Consider Equation (2) and plug Equation (15) into it:

\begin{align*} \frac{d}{dt} \psi_t(x) &= w_t(\psi_t(x)) \\ &= \frac{d}{dt} [\sigma_t(x_1) x + \mu_t(x_1)] \\ &= \sigma_t'(x_1) x + \mu_t' (x_1) \end{align*}

Set $x = \psi_t^{-1} (y) = \frac{1}{\sigma_t(x_1)}(y - \mu_t(x_1))$ , then we have:

\begin{align*} w_t(\psi_t(\psi_t^{-1}(y))) &= w_t(y) \\ &= \frac{\sigma_t' (x_1)}{\sigma_t(x_1)} (y - \mu_t(x_1)) + \mu_t'(x_1) \end{align*}

Therefore,

u_t(x|x_1) = w_t(x) = \frac{\sigma_t' (x_1)}{\sigma_t(x_1)} (x - \mu_t(x_1)) + \mu_t'(x_1)

End of Proof.

9. Instances of Gaussian Conditional Probability Paths

The above formulation is general for arbitrary functions $\mu_t(x)$ and $\sigma_t(x)$ . They can be set to any differentiable function satisfying desired boundary conditions.

📖

Recap on the solution to the first order linear ODE

A first order linear ODE takes the form:

\frac{dy}{dy} + p(t)y = g(t)

The solution is given by:

y(t) = \frac{\int \mu(t) g(t) dt + c}{\mu(t)}

where $\mu(t) = e^{\int p(t) dt}$ and the constant $c$ is determined by the boundary condition.

Example 1. Diffusion Conditional VFs.

Diffusion models start with data points and gradually add noise until it approximates the pure noise. The process can be formulated as stochastic processes, resulting in Gaussian conditional probability paths $p_t(x|x_1)$ , with specific choices of mean $u_t(x_1)$ and std $\sigma_t(x_1)$ .

The Variance Exploding (VE) path has the form:

\begin{equation} p_t(x) = \mathcal{N}(x | x_1, \sigma_{1-t}^2 I) \end{equation}

where $\sigma_t$ is an increasing function, $\sigma_0 = 0$ and $\sigma_1 \gg 1$ . This this case, $\mu_t(x_1) = x_1$ and $\sigma_t(x_1) = \sigma_{1-t}$ . By Theorem 3 we have:

\begin{equation} u_t(x|x_1) = -\frac{\sigma_{1-t}'}{\sigma_{1-t}} (x - x_1) \end{equation}

☝

Derive the flow for VE path

According to Equation (2),

\begin{align*} \frac{d \phi_t(x)}{dt} &= -\frac{\sigma_{1-t}'}{\sigma_{1-t}}(\phi_t(x) - x_1) \\ \frac{d \phi_t(x)}{dt} + \frac{\sigma_{1-t}'}{\sigma_{1-t}} \phi_t(x) &= \frac{\sigma_{1-t}'}{\sigma_{1-t}} x_1 \end{align*}

$p(t) = A$ , $g(t) = Ax_1$ , and $A = \frac{\sigma_{1-t}'}{\sigma_{1-t}}$ . So we have:

\mu(t) = e^{\int p(t) dt} = c_1 e^{At}

\begin{align*} \int \mu(t) g(t) dt &= \int Ax_1c_1 e^{At} dt\\ &= x_1c_1 e^{At} + c_2 \end{align*}

\begin{align*} \phi_t(x) &= \frac{x_1c_1 e^{At} + c_2 + c}{c_1 e^{At}} \\ &= x_1 + ce^{-At} \end{align*}

According to the boundary condition $\phi_0(x) = x$ , we can solve $c = x - x_1$ . Therefore,

\phi_t(x) = x_1 + (x - x_1) e^{-At} = x_1 + (x - x_1)e^{-\frac{\sigma_{1-t}'}{\sigma_{1-t}} t}

The Variance Preserving diffusion path has the form:

p_t(x|x_1) = \mathcal{N}(x|\alpha_{1-t} x_1, (1 - \alpha_{1-t}^2)I)

where $\alpha_t = e^{-\frac{1}{2}T(t)}$ , $T(t) = \int_0^t \beta(s) ds$ , and $\beta$ is the noise scale function. It provides the choices of $\mu_t(x_1) = \alpha_{1-t}x_1$ and $\sigma_t(x_1) = \sqrt{1 - \alpha_{1-t}^2}$ . Plug them into Theorem 3, we can get the corresponding vector field.

☝

Derive the vector field for VP paths.

Since $\mu_t(x_1) = \alpha_{1-t} x_1$ , $\mu_t'(t) = -\alpha_{1-t}' x_1$ .

Since $\sigma_t(x_1) = (1 - \alpha_{1-t}^2)^{1/2}$ ,

\begin{align*} \sigma_t'(x_1) &= \frac{1}{2}(1 - \alpha_{1-t}^2)^{-1/2} \cdot (-2 \alpha_{1-t}) \cdot \alpha_{1-t}' \cdot (-1)\\ &= (1 - \alpha_{1-t}^2)^{-1/2} \alpha_{1-t} \alpha_{1-t}' \end{align*}

Since $\alpha_{1-t} = e^{-\frac{1}{2} T(1-t)}$ , $\alpha_{1-t}' = -\frac{1}{2} T'(1-t)e^{-\frac{1}{2}T(1-t)}$ .

Then, according to Theorem (3),

\begin{align*} u_t(x|x_1) &= \frac{\alpha_{1-t} \alpha_{1-t}'}{1 - \alpha_{1-t}^2}(x - \alpha_{1-t} x_1) - \alpha_{1-t}' x_1 \\ &= \frac{\alpha_{1-t}'}{1 - \alpha_{1-t}^2}(\alpha_{1-t}x - x_1)\\ &= \frac{-\frac{1}{2} T'(1-t) e^{-\frac{1}{2}T(1-t)} }{1 - e^{-T(1-t)}} (e^{-\frac{1}{2}T(1-t)} x - x_1) \\ &= -\frac{T'(1-t)}{2} \left[ \frac{e^{-T(1-t)}x - e^{-\frac{1}{2}T(1-t)}x_1}{1 - e^{-T(1-t)}} \right] \end{align*}

Example 2. Optimal Transport Conditional VFs.

Another natural choice for conditional probability paths is to define the mean and std to change linearly in time:

\begin{equation} \mu_t(x) = tx_1, \text{ and } \sigma_t(x) = 1 - (1 - \sigma_{\text{min}})t. \end{equation}

According to Theorem (3),

\begin{equation} \begin{align*} u_t(x|x_1) &= \frac{-(1 - \sigma_{\text{min}})}{1 - (1 - \sigma_{\text{min}})t}(x - tx_1) + x_1 \\ &= \frac{x_1 - (1 - \sigma_{\text{min}}) x}{1 - (1 - \sigma_{\text{min}})t} \end{align*} \end{equation}

In contrast to diffusion conditional VF (VE and VP). this vector field is defined for all $t \in [0, 1]$ . The conditional flow that corresponds to $u_t(x | x_1)$ is

\psi_t(x) = (1 - (1 - \sigma_{\text{min}})t)x + tx_1

☝

Derive the conditional flow for Optimal Transport

According to Equation (2),

\begin{align*} \frac{d}{dt} \psi_t(x) &= \frac{x_1 - (1 - \sigma_{\text{min}}) \psi_t(x)}{1 - (1 - \sigma_{\text{min}})t} \\ \frac{d}{dt} \psi_t(x) + \frac{1 - \sigma_{\text{min}} }{1 - (1 - \sigma_{\text{min}}) t} \psi_t(x) &=\frac{x_1}{1 - (1 - \sigma_{\text{min}})t} \end{align*}

Let $p(t) = \frac{1 - \sigma_{\text{min}}}{1 - (1 - \sigma_{\text{min}}) t}$ and $g(t) = \frac{x_1}{1 - (1 - \sigma_{\text{min}})t}$ , then

\mu(t) = e^{\int p(t) dt} = e^{- \ln(1 - (1 - \sigma_{\text{min}})t)} = \frac{1}{1 - (1 - \sigma_{\text{min}})t}

\int \mu(t) g(t) dt = \int [(1 - \sigma_{\text{min}})t - 1]^{-2} x_1 dt = \frac{1}{1 - \sigma_{\text{min}}} [1 - (1 - \sigma_{\text{min}})t]^{-1} x_1

Then we can solve the ODE for $\psi_t(x)$ :

\begin{align*} \psi_t(x) &= [1 - (1 - \sigma_{\text{min}})t] \left( \frac{[1 - (1 - \sigma_{\text{min}})t]^{-1} x_1}{1 - \sigma_{\text{min}}} + c\right) \\ &= \frac{x_1}{1 - \sigma_{\text{min}}} + [1 - (1 - \sigma_{\text{min}})t]c \end{align*}

By the boundary condition $\psi_0(x) = x$ , $c = x - \frac{x_1}{1 - \sigma_{\text{min}}}$ . Therefore, the conditional paths is given by:

\begin{align*} \psi_t(x) &= \frac{x_1}{1 - \sigma_{\text{min}}} + [1 - (1 - \sigma_{\text{min}})t] x - \frac{[1 - (1 - \sigma_{\text{min}})t]x_1}{1 - \sigma_{\text{min}}} \\ &= [1 - (1 - \sigma_{\text{min}})t]x + tx_1 \end{align*}

In this case, the CFM loss takes the form:

\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, q(x_1), p(x_0)}||v_t(\psi_t(x_0)) - \left( x_1 - (1 - \sigma_{\text{min}})x \right)||^2

10. Derive Vector Fields from Fokker-Planck Equation

The conditional vector field derived from Theorem 3 for VE and VP diffusion paths, actually coincide with the vector fields that govern the Probability Flow ODE (equation 13, in [paper]).

Since the diffusion process runs from data at time $t=0$ to noise at time $t=1$ , we need the following lemma to translate the diffusion VFs to our convention of $t=0$ corresponds to noise and $t=1$ corresponds to data:

Lemma 1. Consider a flow defined by a vector field $u_t(x)$ generating probability path $p_t(x)$ . Then, the vector field $\tilde{u}_t(x) = - u_{1-t}(x)$ generates the path $\tilde{p}_t(x) = p_{1-t}(x)$ when initiated from $\tilde{p}_0 (x) = p_1(x)$ .

☝

Proof of Lemma 1.

By the continuity equation:

\begin{align*} \frac{d}{dt} \tilde{p}_t (x) = \frac{d}{dt} p_{1-t}(x) &= - p'_{1-t}(x) \\ &=\text{div} (p_{1-t}(x) u_{1-t}(x)) \\ &= - \text{div} (\tilde{p}_t(x) \tilde{u}_t(x)) \end{align*}

Therefore, $\tilde{u}_t(x)$ generates $\tilde{p}_t(x)$ .

Conditional VFs for Fokker-Planck probability paths

Consider a Stochastic Differential Equation (SDE) of the standard form:

dy = f_t dt + g_t dw

with time parameter $t$ , drift $f_t$ , diffusion coefficient $g_t$ , and $dw$ is the Wiener process. The solution $y_t$ to the SDE is a stochastic process, i.e., a continuous time-dependent random variable, the probability density of which, $p_t(y_t)$ , is characterized by the Fokker-Planck equation:

\frac{d p_t}{dt} = -\text{div}(f_tp_t) + \frac{g_t^2}{2} \Delta p_t

where $\Delta$ represents the Laplace operator (in $y$ ), namely $\text{div} \nabla$ , where $\nabla$ is the gradient operator (also in $y)$ . This equation can be rewrite into the continuity equation form:

\frac{d p_t}{dt} = -\text{div} \left( f_t p_t - \frac{g_t^2}{2} \frac{\nabla p_t}{p_t} p_t\right) = - \text{div} \left( (f_t - \frac{g_t^2}{2} \nabla \log p_t ) p_t\right) = -\text{div}(w_t p_t)

where the vector field

\begin{equation} w_t = f_t - \frac{g_t^2}{2}\nabla \log p_t \end{equation}

satisfy the continuity equation with probability path $p_t$ , therefore generates $p_t$ .

Variance Exploding (VE) Path

The SDE for the VE path is:

dy = \sqrt{\frac{d}{dt}\sigma_t^2} dw

where $\sigma_0 = 0$ and increasing to infinity as $t \rightarrow 1$ . The SDE is moving from data, $y_0$ , at $t=0$ to noise, $y_1$ , at $t=1$ with the probability path

p_t(y|y_0) = \mathcal{N}(y | y_0, \sigma_t^2 I)

The conditional vector field according to Equation (22) is:

\begin{align*} w_t(y | y_0) &= -\frac{1}{2} \cdot 2\sigma_t \sigma'_t \cdot (-\frac{1}{\sigma_t^2}) (y - y_0) \\ &= \frac{\sigma'_t}{\sigma_t}(y - y_0) \end{align*}

By Lemma 1 we get that the probability path

\tilde{p}_t(y|y_0) = \mathcal{N}(y | y_0 , \sigma^2_{1-t} I)

is generated by

\tilde{w}_t(y | y_0) = -\frac{\sigma'_{1-t}}{\sigma_{1-t}}(y - y_0)

which coincides with the VF from Theorem 3.

Variance Preserving (VP) Path

The SDE for the VP path is

dy = -\frac{\beta(t)}{2} x + \sqrt{\beta(t)} dw =-\frac{T'(t)}{2}y + \sqrt{T'(t)} dw

where $T(t) = \int_0^t \beta(s) ds, t\in(0, 1]$ . The probability path is given by:

p_t(y | y_0) = \mathcal{N}(y | e^{-\frac{1}{2} T(t)} y_0, (1 - e^{-T(t)} )I)

Therefore, the conditional vector field is given by:

\begin{align*} w_t(y|y_0) &= - \frac{T'(t)}{2} y - \frac{T'(t)}{2} \cdot (-\frac{1}{1 - e^{-T(t)}}) (y - e^{-\frac{1}{2} T(t)} y_0) \\ &= \frac{T'(t)}{2} \left[ \frac{y - e^{-\frac{1}{2} T(t) } y_0}{1 - e^{-T(t)}} -y \right] \end{align*}

Using Lemma 1 we can get the probability path:

\begin{align*} \tilde{w}_t(y|y_0) &= -\frac{T'(1-t)}{2} \left[ \frac{y - e^{-\frac{1}{2} T(1-t)} y_0}{1 - e^{-T(1-t)}} -y \right] \\ &= -\frac{T'(1-t)}{2} \left[ \frac{e^{-T(1-t)} y - e^{-\frac{1}{2} T(1-t)} y_0}{1 - e^{-T(1-t)}} \right] \end{align*}

which also coincides with the VF from Theorem 3.

1. What is Flow? Understand vector field and ODE.

Preliminaries

Flow

Continuous Normalizing Flows

2. The Continuity Equation and Fokker-Planck Equation

Continuity Equation

Conditional VFs for Fokker-Planck probability paths

3. Continuous Normalizing Flow (CNF)

4. Flow Matching Objective

6. The form of @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.9/katex.min.css')ut(x)u_t(x)ut​(x)﻿

7. Conditional Flow Matching (Objectives are consistent)

8. Derive Conditional Vector Field from Conditional Probability Path (Gaussian)

9. Instances of Gaussian Conditional Probability Paths

Example 1. Diffusion Conditional VFs.

Example 2. Optimal Transport Conditional VFs.

10. Derive Vector Fields from Fokker-Planck Equation

Conditional VFs for Fokker-Planck probability paths

Variance Exploding (VE) Path

Variance Preserving (VP) Path

6. The form of $u_t(x)$