Flow Matching For Generative Modeling

This is a learning note from this series of videos.

Link to the paper: https://arxiv.org/abs/2210.02747

Supplementary papers:

Step-by-Step Diffusion: An Elementary Tutorial

Glow: Generative Flow with Invertible 1x1 Convolutions

1. What is Flow? Understand vector field and ODE.

Generative Modeling: With known samples x1q1(unknown distribution)x_1 \sim q_1 \text{(unknown distribution)}, we want to estimate this unknown distribution.

Method: q0a known distributionϕq1unknown distribution to be estimated\underbrace{q_0}_{\text{a known distribution}} \xrightarrow{\phi} \underbrace{q_1}_{\text{unknown distribution to be estimated}}. This mapping is denoted as ϕ\phi.

How to solve for ϕ\phi: 1. Normalizing Flow; 2. Flow Matching (ODE)

Keypoints of Flow Matching:

  1. Let q0q_0 and q1q_1 be the initial and final points of ODE.
  1. Use neural networks to fit the gradient term in ODE.
  1. Solve the ODE.

Preliminaries

ODE → Flow → Normalizing Flow → Continuous Normalizing Flow

Flow

Definition 1. (Flow): A flow is a collection of time-indexed vector fields v={vt}t[0,1]v = \{ v_t \}_{t \in [0, 1]}.

Any flow defines a trajectory taking initial points x1x_1 to final points x0x_0, by transporting the initial point along the velocity fields {vt}\{ v_t \}. It is equivalent to a transfer between two distributions.

Formally, for velocity field vv and initial point x1x_1, consider the ODE

dxtdt=vt(xt)\begin{equation} \frac{d x_t}{dt} = -v_t(x_t) \end{equation}

with initial condition x1x_1 at time t=1t=1. We write

xtRunFlow(v,x1,t)x_t \coloneqq \text{RunFlow}(v, x_1, t)

to denote the solution to the flow ODE at time tt, terminating at final point x0x_0. That is, RunFlow is the result of transporting point x1x_1 along the flow vv up to time tt.

Flows also define transports between entire distributions by pushing forward points from the source distribution along their trajectories. If p1p_1 is a distribution on initial points, then applying the flow vv yields the distribution on final points:

p0={RunFlow(v,x1,t=0)}x1p1p_0 = \{ \text{RunFlow}(v, x_1, t=0) \}_{x_1 \sim p_1}

This process is denoted as p1vp0p_1 \xrightarrow{v} p_0, meaning the flow vv transports initial distribution p1p_1to final distribution p0p_0.

THE ULTIMATE GOAL OF FLOW MATCHING is to learn a velocity field vv^* which transport p1vp0p_1 \xrightarrow{v^*} p_0, where p0p_0 is the target distribution and p1p_1 is some easy-to-sample base distribution (such as Gaussian).

Continuous Normalizing Flows

Let Rd\mathbb{R}^d denote the data space with data points x=(x1,,xd)Rdx = (x^1, \dots, x^d) \in \mathbb{R}^d.

The probability density path p:[0,1]×RdR>0p: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}_{>0} is a time-dependent probability density function, i.e., pt(x)dx=1\int p_t(x) dx = 1.

v:[0,1]×RdRdv: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d is a time-dependent vector field.

A vector field vtv_t can be used to construct a time-dependent diffeomorphic map, called a flow, ϕ:[0,1]×RdRd\phi: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d, defined via the ordinary differential equation (ODE):

ddtϕt(x)=vt(ϕt(x))ϕ0(x)=x\begin{align} \frac{d}{dt} \phi_t(x) &= v_t (\phi_t(x)) \\ \phi_0(x) &= x \end{align}

Here ϕt(x)\phi_t(x) is a solution to the ODE, and we call it a flow. We can model the vector field vtv_t with a neural network vt(x;θ)v_t(x ; \theta), where θRp\theta \in \mathbb{R}^p are its learnable parameters.

2. The Continuity Equation and Fokker-Planck Equation

Continuity Equation

How to test if a vector field vtv_t generates a probability path ptp_t?

📖

C´edric Villani. Optimal transport: old and new, volume 338. Springer, 2009.

The following PDE provides a necessary and sufficient condition to ensure that a vector field vtv_t generates a probability path ptp_t.

ddtpt(x)+div(pt(x)vt(x))=0\begin{equation} \frac{d}{dt} p_t(x) + \text{div} (p_t(x)v_t(x)) = 0 \end{equation}

Here the divergence operator div\text{div} is defined with respect to the spatial variable x=(x1,,xd)x = (x^1, \dots, x^d), i.e. div=i=1dxi\text{div} = \sum_{i=1}^d \frac{\partial}{\partial x^i}.

Conditional VFs for Fokker-Planck probability paths

Consider a Stochastic Differential Equation (SDE) of the standard form:

dy=ftdt+gtdw\begin{equation} dy = f_t dt + g_t dw \end{equation}

with time parameter tt, drift ftf_t, diffusion coefficient gtg_t, and dwdw is the Wiener process.

The solution to the SDE is yty_t, which is a stochastic process (a continuous time-dependent variable). Its probability density pt(yt)p_t(y_t) is characterized by the Fokker-Planck equation:

dptdt=div(ftpt)+gt22Δpt\begin{equation} \frac{dp_t}{dt} = -\text{div}(f_tp_t) + \frac{g_t^2}{2} \Delta p_t \end{equation}

where Δ\Delta represents the Laplace operator (in yy), namely div\text{div} \nabla, where \nabla is the gradient operator. We can rewrite the above equation in the form of the continuity equation:

dptdt=div(ftpt)+div(gt22pt)=div(ftptgt22ptptpt)=div((ftgt22logpt)pt)=div(wtpt)\begin{equation} \begin{align*} \frac{dp_t}{dt} &= -\text{div}(f_t p_t) + \text{div}(\frac{g_t^2}{2}\nabla p_t) \\ &= -\text{div} (f_t p_t - \frac{g_t^2}{2}\frac{\nabla p_t}{p_t} p_t) \\ &= -\text{div}((f_t - \frac{g_t^2}{2} \nabla \log p_t)p_t) \\ &= -\text{div}(w_t p_t) \end{align*} \end{equation}

where the vector field

wt=ftgt22logptw_t = f_t - \frac{g_t^2}{2} \nabla \log p_t

satisfies the continuity equation with the probability path ptp_t, and therefore generates ptp_t.

3. Continuous Normalizing Flow (CNF)

A CNF is used to reshape a simple prior density p0p_0 (e.g., pure noise) to a more complicated one, p1p_1, via the push-forward equation.

pt=[ϕt]p0\begin{equation} p_t = [\phi_t]_* p_0 \end{equation}

The push-forward (or change of variable) operator * is defined by:

[ϕt]p0(x)=p0(ϕt1(x))det[ϕt1x(x)]\begin{equation} [\phi_t]_* p_0(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right] \end{equation}
📖

Change of variables in the probability density function

Suppose x\bold{x} is an n-dimensional random variable with joint density ff. If y=G(x)\bold{y} = G(\bold{x}), where GG is a bijective, differentiable function, then y\bold{y} has density pYp_{\bold{Y}}:

pY(y)=f(G1(y))det[dG1(y)dy]p_{\bold{Y}}(\bold{y}) = f\left( G^{-1}(\bold{y}) \right) \left| \det \left[ \frac{d G^{-1}(\bold{y})}{d \bold{y}} \right] \right|

Relationships between Flow, Vector Field, and Probability Density

  1. Equation (2)(3) describe the relationship between flow ϕt(x)\phi_t(x) and vector field vt(x)v_t(x)
  1. Equation (8)(9) describe how the flow ϕt(x)\phi_t(x) changes the density from p0p_0 to ptp_t.
  1. Equation (4) (continuity equation) gives a necessary and sufficient condition to test whether a vector field vtv_t generates a probability path ptp_t.

4. Flow Matching Objective

Notations:

  1. x1x_1: a random variable distributed according to unknown distribution q(x1)q(x_1). We have access to samples from q(x1)q(x_1), but no access to the density function.
  1. ptp_t: a probability path such that p0=pp_0 = p is a simple distribution, e.g., the standard normal p(x)=N(x0,I)p(x) = \mathcal{N}(x | 0, I)
  1. p1p_1: approximately equal in distribution to qq.

The Flow Matching objective is designed to match this target probability path, which will allow us to flow from p0p_0 to p1p_1.

Given a target probability density path pt(x)p_t(x) and a corresponding vector field ut(x)u_t(x), which generates pt(x)p_t(x), the Flow Matching (FM) objective is defined as:

LFM(θ)=Et,pt(x)vt(x)ut(x)2\begin{equation} \mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{t, p_t(x)}||v_t(x) - u_t(x)||^2 \end{equation}

where θ\theta denotes the learnable parameters of the CNF vector field vtv_t (here is a neural network), tU[0,1]t \sim \mathcal{U}[0,1], and xpt(x)x \sim p_t(x).

Problem: the above objective assumes that the density pt(x)p_t(x) and the ground truth vector field ut(x)u_t(x) are known, but we don’t.

Solution: Use conditional probability density for unknown items, and verify it’s valid.

5. From pt(xx1)p_t(x|x_1) and ut(xx1)u_t(x|x_1) to pt(x)p_t(x) and ut(x)u_t(x)

Construct a target probability path via a mixture of simpler probability paths:

Given a sample x1x_1, let pt(xx1)p_t(x | x_1) denote a conditional probability path such that p0(xx1)=p(x)p_0(x|x_1) = p(x) at time t=0t=0 and p1(xx1)p_1(x|x_1) at t=1t=1 to be a distribution concentrated around x=x1x=x_1, e.g., p1(xx1)=N(xx1,σ2I)p_1(x|x_1) = \mathcal{N}(x|x_1, \sigma^2 I) where σ\sigma is sufficiently small.

The marginal probability path can be given by:

pt(x)=pt(xx1)q(x1)dx1\begin{equation} p_t(x) = \int p_t(x | x_1) q(x_1) dx_1 \end{equation}

In particular at time t=1t=1, the marginal probability p1p_1 is a mixture distribution that closely approximates the data distribution qq:

p1(x)=p1(xx1)q(x1)dx1q(x)\begin{equation} p_1(x) = \int p_1(x|x_1) q(x_1) dx_1 \approx q(x) \end{equation}

Why p1(x)p_1(x) approximates q(x)q(x).

The Dirac delta function (δ\delta distribution, unit impulse) is defined as:

δ(x)={0,x0,x=0\delta(x) = \begin{cases} 0, & x \neq 0 \\ \infty, & x=0 \end{cases}

such that

δ(x)dx=1\int_{-\infty}^{\infty} \delta(x) dx = 1

Dirac delta function has the sifting property:

δ(xa)f(x)dx=δ(xa)f(a)dx=f(a)δ(xa)dx=f(a)\begin{align*} \int_{-\infty}^{\infty}\delta(x-a)f(x)dx &= \int_{-\infty}^{\infty} \delta(x-a)f(a) dx \\ &= f(a)\int_{-\infty}^{\infty}\delta(x-a) dx = f(a) \end{align*}

The conditional distribution p1(xx1)=N(xx1,σ2I)p_1(x|x_1) = \mathcal{N}(x|x_1, \sigma^2 I) with sufficient small σ\sigma is similar to the Dirac delta function, therefore approximately has the sifting property:

p1(x)=p1(xx1)q(x1)dx1=f(xx1)q(x1)dx1(by Gaussian density)q(x)(by ’sifting property’)\begin{align*} p_1(x) &= \int p_1(x | x_1) q(x_1) dx_1 \\ &= \int f(x - x_1) q(x_1) dx_1 && \text{(by Gaussian density)} \\ &\approx q(x) && \text{(by 'sifting property')} \end{align*}

Also, we can define the conditional vector fields in the following form:

ut(x)=ut(xx1)pt(xx1)q(x1)pt(x)dx1\begin{equation} u_t(x) = \int u_t(x|x_1) \frac{p_t(x|x_1) q(x_1)}{p_t(x)} dx_1 \end{equation}

where ut(x1):RdRdu_t(\cdot|x_1):\mathbb{R}^d \rightarrow \mathbb{R}^d is a conditional vector field that generates pt(x1)p_t(\cdot|x_1)

Next, we want to prove that:

The vector field ut(x)u_t(x) in equation(13) generates the probability path pt(x)p_t(x) in equation(11).

6. The form of ut(x)u_t(x)

Theorem 1. Given vector field ut(xx1)u_t(x|x_1) that generate conditional probability paths pt(xx1)p_t(x|x_1), for any distribution q(x1)q(x_1), the marginal vector field utu_t in equation (13) generates the marginal probability path ptp_t in equation (11), i.e., utu_t and ptp_t satisfy the continuity equation (equation (4)).

Proof:

To proof the theorem, we need to check that ptp_t and utu_t satisfy the continuity equation.

Since ut(xx1)u_t(x|x_1) generates pt(xx1)p_t(x|x_1), by continuity equation we have:

ddtpt(xx1)=div(pt(xx1)ut(xx1))\frac{d}{dt} p_t(x|x_1) = -\text{div} (p_t(x|x_1) u_t(x|x_1))

Then we inspect ddtpt(x)\frac{d}{dt} p_t(x):

ddtpt(x)=(ddtpt(xx1))q(x1)dx1=div(pt(xx1)ut(xx1))q(x1)dx1=div(pt(xx1)ut(xx1)q(x1)dx1)=div(pt(x)pt(xx1)ut(xx1)q(x1)dx1pt(x))=div(ut(x)pt(x))\begin{align*} \frac{d}{dt} p_t(x) &= \int \left(\frac{d}{dt} p_t(x | x_1)\right) q(x_1) dx_1 \\ &= -\int \text{div} \left(p_t(x|x_1) u_t(x|x_1) \right) q(x_1) dx_1 \\ &= - \text{div} \left(\int p_t(x|x_1) u_t(x|x_1) q(x_1) dx_1 \right) \\ &= - \text{div} \left( p_t(x) \frac{\int p_t(x|x_1) u_t(x|x_1) q(x_1) dx_1}{p_t(x)} \right) \\ &= -\text{div} (u_t(x) p_t(x)) \end{align*}

It shows that ut(x)u_t(x) and pt(x)p_t(x) satisfy the continuity equation.

End of proof.

7. Conditional Flow Matching (Objectives are consistent)

There are still problems. The probability path pt(x)p_t(x) in Equation (11) and the vector field ut(x)u_t(x) in Equation (13) are still intractable due to the integration.

To solve this problem, we introduce a new objective LCFM(θ)\mathcal{L}_{\text{CFM}}(\theta) defined as:

LCFM(θ)=Et,q(x1),pt(xx1)[vt(x)ut(xx1)2]\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, q(x_1), p_t(x|x_1)} \left[ ||v_t(x) - u_t(x|x_1)||^2 \right]

to avoid ut(x)u_t(x). And the following theorem states that the new objective LCFM(θ)\mathcal{L}_{\text{CFM}}(\theta) is equal to LFM(θ)\mathcal{L}_{\text{FM}}(\theta) up to a constant difference w.r.t. the model parameters θ\theta.

The new objective is tractable as long as we can sample from pt(xx1)p_t(x|x_1) and compute ut(xx1)u_t(x|x_1). Consequently, this allows us to train a CNF to generate the marginal probability path ptp_t, which approximates the unknown data distribution qq at t=1t=1. We don’t need to access to marginal probability path or marginal vector field. We simply need to design suitable conditional probability paths and vector fields.

Theorem 2. Assuming that pt(x)>0p_t(x) > 0 for all xRdx \in \mathbb{R}^d and t[0,1]t \in [0,1], then, up to a constant independent of θ\theta, LCFM\mathcal{L}_{\text{CFM}} and LFM\mathcal{L}_{\text{FM}} are equal. Hence, θLCFM=θLFM\nabla_{\theta} \mathcal{L}_{\text{CFM}} = \nabla_{\theta} \mathcal{L}_{\text{FM}}.

Proof:

Some assumptions to ensure the existence of integrals and the changing of integration order (by Fubini’s Theorem):

  1. q(x)q(x) and pt(xx1)p_t(x|x_1) decrease to zero at sufficient speed as x||x|| \rightarrow \infty.
  1. ut,vt,θvtu_t, v_t, \nabla_{\theta} v_t are bounded.

First, we rewrite the two objectives:

LFM(θ)=Et,pt(x)[vt(x)ut(x)2]=Et,pt(x)[vt(x)22<vt(x),ut(x)>+ut(x)2]\begin{align*} \mathcal{L}_{\text{FM}}(\theta) &= \mathbb{E}_{t, p_t(x)} \left[ ||v_t(x) - u_t(x)||^2 \right] \\ &= \mathbb{E}_{t, p_t(x)} \left[ ||v_t(x)||^2 - 2<v_t(x), u_t(x)> + ||u_t(x)||^2\right] \end{align*}
LCFM(θ)=Et,q(x1),pt(xx1)[vt(x)ut(xx1)2]=Et,q(x1),pt(xx1)[vt(x)22<vt(x),ut(xx1)>+ut(xx1)2]\begin{align*} \mathcal{L}_{\text{CFM}}(\theta) &= \mathbb{E}_{t, q(x_1), p_t(x|x_1)} \left[ ||v_t(x) - u_t(x|x_1)||^2 \right] \\ &= \mathbb{E}_{t, q(x_1), p_t(x|x_1)} \left[ ||v_t(x)||^2 - 2<v_t(x), u_t(x|x_1)> + ||u_t(x|x_1)||^2 \right] \\ \end{align*}

Note that ut(x)2||u_t(x)||^2 and ut(xx1)2||u_t(x|x_1)||^2 are both constant w.r.t the parameters θ\theta. Now we only need to prove that the expectations of the first two terms are equal.

Ept(x)vt(x)2=vt(x)2pt(x)dx=vt(x)2pt(xx1)q(x1)dx1dx=Eq(x1),pt(xx1)[vt(x)2]\begin{align*} \mathbb{E}_{p_t(x)} ||v_t(x)||^2 &= \int ||v_t(x)||^2 p_t(x) dx \\ &= \int \int ||v_t(x)||^2 p_t(x|x_1) q(x_1) dx_1dx \\ &= \mathbb{E}_{q(x_1), p_t(x|x_1)} [||v_t(x)||^2] \end{align*}

Therefore, the first term of two objectives are equal.

Ept(x)<vt(x),ut(x)>=<vt(x),ut(x)>pt(x)dx=<vt(x),1pt(x)ut(xx1)pt(xx1)q(x1)dx1>pt(x)dx=<vt(x),ut(xx1)pt(xx1)q(x1)dx1>dx=<vt(x),ut(xx1)pt(xx1)q(x1)>dx1dx=<vt(x),ut(xx1)>pt(xx1)q(x1)dx1dx=Eq(x1),pt(xx1)[<vt(x),ut(xx1)>]\begin{align*} \mathbb{E}_{p_t(x)} <v_t(x), u_t(x)> &= \int <v_t(x), u_t(x)> p_t(x) dx \\ &= \int <v_t(x), \frac{1}{p_t(x)} \int u_t(x|x_1) p_t(x|x_1) q(x_1) dx_1> p_t(x) dx \\ &= \int <v_t(x), \int u_t(x|x_1)p_t(x|x_1) q(x_1) dx_1> dx \\ &= \int \int <v_t(x), u_t(x|x_1) p_t(x|x_1) q(x_1)> dx_1 dx \\ &= \int \int <v_t(x), u_t(x|x_1) >p_t(x|x_1) q(x_1) dx_1 dx \\ &= \mathbb{E}_{q(x_1), p_t(x|x_1)} [<v_t(x), u_t(x|x_1)>] \end{align*}

Brief Explanation

Equality 3 and 5 use the linearity of inner product:

k<a,b>=<ka,b>;kR,a,bRdk <a,b> = <ka, b>; \qquad k\in \mathbb{R}, a,b \in \mathbb{R}^d

where pt(x)Rp_t(x) \in \mathbb{R} and pt(xx1)q(x1)Rp_t(x|x_1) q(x_1) \in \mathbb{R}.

Equality 4 change the order of inner product and the integral:

<a(x),b(x1)dx1>=i=1dai(x)bi(x1)dx1=i=1dai(x)bi(x1)dx1=[i=1dai(x)bi(x1)]dx1=<a(x),b(x1)>dx1\begin{align*} <a(x), \int b(x_1) dx_1> &= \sum_{i=1}^d a_i(x) \int b_i(x_1) dx_1 \\ &= \sum_{i=1}^d \int a_i(x) b_i(x_1) dx_1 \\ &= \int \left[ \sum_{i=1}^d a_i(x) b_i(x_1) \right] dx_1 \\ &= \int <a(x), b(x_1)> dx_1 \end{align*}

Therefore, the second term of two objectives are also equal. This two results lead to:

LFM(θ)=LCFM(θ)+CθLFM(θ)=θLCFM(θ)\mathcal{L}_{\text{FM}}(\theta) = \mathcal{L}_{\text{CFM}}(\theta) + C\Rightarrow \nabla_{\theta}\mathcal{L}_{\text{FM}}(\theta) = \nabla_{\theta}\mathcal{L}_{\text{CFM}}(\theta)

End of Proof.

8. Derive Conditional Vector Field from Conditional Probability Path (Gaussian)

The Conditional Flow Matching objective works with any choice of conditional probability path and conditional vector fields.

Here we consider a family of Gaussian conditional probability path:

pt(xx1)=N(xμt(x1),σt(x1)2I)\begin{equation} p_t(x | x_1) = \mathcal{N}(x | \mu_t(x_1), \sigma_t(x_1)^2 I) \end{equation}

where μ:[0,1]×RdRd\mu: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d is the time-dependent mean of the Gaussian distribution, while σ:[0,1]×RdR>0\sigma: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}_{>0} describes a time-dependent scalar standard deviation (std).

When t=0t=0, μ0(x1)=0\mu_0(x_1)=0 and σ0(x1)=1\sigma_0(x_1)=1, so that all conditional probability paths converge to the same standard Gaussian noise distribution.

When t=1t=1, μ1(x1)=x1\mu_1(x_1) = x_1 and σ1(x1)=σmin\sigma_1(x_1) = \sigma_{\text{min}}, so that p1(xx1)p_1(x|x_1) is a concentrated Gaussian distribution centered at x1x_1.

We decide to choose a simple form of flow (condition on x1x_1):

ψt(x)=σt(x1)x+μt(x1)\begin{equation} \psi_t(x) = \sigma_t(x_1) x + \mu_t(x_1) \end{equation}

If we inspect Equation (9), we can verify that ψt\psi_t pushes the noise distribution p0(xx1)=p(x)p_0(x|x_1) = p(x) to pt(xx1)p_t(x|x_1)

Prove that [ψt]p0(x)=pt(x)[\psi_t]_* p_0(\bold{x}) = p_t(\bold{x}).

Proof:

Since ψt(x)=σt(x1)x+μt(x1)\psi_t(\bold{x}) = \sigma_t(\bold{x_1}) \bold{x} + \mu_t(\bold{x_1}), we can derive its inverse function:

ψt1(x)=xμt(x1)σt(x1)\psi_t^{-1}(\bold{x}) = \frac{\bold{x} - \mu_t(\bold{x_1})}{\sigma_t(\bold{x_1})}

When t=0t=0, the probability density function is:

p0(x)=(2π)d/2det(I)1/2exp(12(x0)TI1(x0)=(2π)d/2exp(12xTx)\begin{align*} p_0(\bold{x}) &= (2\pi)^{-d/2} \det(I)^{-1/2}\exp \left( -\frac{1}{2} (\bold{x} - \bold{0})^T I^{-1} (\bold{x} - \bold{0} \right) \\ &= (2\pi)^{-d/2} \exp \left( -\frac{1}{2} \bold{x}^T \bold{x} \right) \end{align*}

According to Equation (9),

[ψt]p0(x)=p0(ψt1(x))det[ψt1x(x)]=(2π)d/2exp(121σt(x1)2(xμt(x1))T(xμt(x1)))det[1σt(x1)I]=(2π)d/2[σt(x1)2d]1/2exp(12(xμt(x1))T1σt(x1)2(xμt(x1)))=(2π)d/2det(σt(x1)2I)1/2exp(12(xμt(x1))T(σt(x1)2I)1(xμt(x1)))=N(xμt(x1),σt(x1)2I)=pt(x)\begin{align*} [\psi_t]_* p_0(\bold{x}) &= p_0(\psi_t^{-1}(\bold{x})) \det \left[ \frac{\partial \psi_t^{-1}}{\partial \bold{x}}(\bold{x}) \right] \\ &= (2\pi)^{-d/2} \exp \left( -\frac{1}{2} \frac{1}{\sigma_t(x_1)^2} (\bold{x} - \mu_t(\bold{x_1}))^T (\bold{x} - \mu_t(\bold{x_1})) \right) \det[\frac{1}{\sigma_t(\bold{x}_1)} I] \\ &= (2\pi)^{-d/2} [\sigma_t(\bold{x}_1)^{2d}]^{-1/2} \exp \left( -\frac{1}{2} (\bold{x} - \mu_t(\bold{x_1}))^T \frac{1}{\sigma_t(\bold{x}_1)^2} (\bold{x} - \mu_t(\bold{x_1})) \right) \\ &= (2\pi)^{-d/2} \det(\sigma_t(\bold{x}_1)^2 I)^{-1/2}\exp \left( -\frac{1}{2} (\bold{x} - \mu_t(\bold{x_1}))^T (\sigma_t(\bold{x}_1)^2 I)^{-1} (\bold{x} - \mu_t(\bold{x_1})) \right) \\ &= \mathcal{N}(\bold{x}| \mu_t(x_1), \sigma_t(x_1)^2 I) = p_t(x) \end{align*}

End of Proof.

This flow ψt\psi_t provides a vector field that generates the conditional probability path:

ddtψt(x)=ut(ψt(x)x1)\begin{equation} \frac{d}{dt} \psi_t(x) = u_t(\psi_t(x) | x_1) \end{equation}

Reparameterizing pt(xx1)p_t(x|x_1) in terms of just x0x_0 and plugging Equation (16) in the CFM loss we get:

LCFM(θ)=Et,q(x1),p(x0)vt(ψt(x0))ddtψt(x0)2\begin{equation} \mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, q(x_1), p(x_0)} ||v_t(\psi_t(x_0)) - \frac{d}{dt} \psi_t(x_0)||^2 \end{equation}

Derivation of Equation (17)

The original CFM loss is:

LCFM(θ)=Et,q(x1),pt(xx1)[vt(x)ut(xx1)2]\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, q(x_1), p_t(x|x_1)} \left[ ||v_t(x) - u_t(x|x_1)||^2 \right]

According to Equation (14), sample from pt(xx1)p_t(x|x_1) is equivalent to sample x0x_0 from standard Gaussian and reparameterize it by

x=σt(x1)x0+μt(x1)=ψt(x0)(condition on x1)x = \sigma_t(x_1) x_0 + \mu_t(x_1) = \psi_t(x_0) \quad (\text{condition on }x_1)

Therefore,

LCFM(θ)=Et,q(x1),p(x0)vt(ψt(x0))ut(ψt(x0)x1)2=Et,q(x1),p(x0)vt(ψt(x0))ddtψt(x0)2\begin{align*} \mathcal{L}_{\text{CFM}}(\theta) &= \mathbb{E}_{t, q(x_1), p(x_0)} ||v_t(\psi_t(x_0)) - u_t(\psi_t(x_0)|x_1)||^2 \\ &= \mathbb{E}_{t, q(x_1), p(x_0)} ||v_t(\psi_t(x_0)) - \frac{d}{dt} \psi_t(x_0)||^2 \end{align*}

Theorem 3. Let pt(xx1)p_t(x|x_1) be a Gaussian probability path as in Equation (14), and ψt\psi_t its corresponding flow map as in Equation (15). Then, the unique vector field that defines ψt\psi_t has the form:

ut(xx1)=σt(x1)σt(x1)(xμt(x1))+μt(x1)u_t(x|x_1) = \frac{\sigma_t' (x_1)}{\sigma_t(x_1)} (x - \mu_t(x_1)) + \mu_t'(x_1)

Consequently, ut(xx1)u_t(x|x_1) generates the Gaussian path pt(xx1)p_t(x|x_1).

Prove Theorem 3.

Proof:

For notation simplicity, we denote wt(x)=ut(xx1)w_t(x) = u_t(x|x_1). Consider Equation (2) and plug Equation (15) into it:

ddtψt(x)=wt(ψt(x))=ddt[σt(x1)x+μt(x1)]=σt(x1)x+μt(x1)\begin{align*} \frac{d}{dt} \psi_t(x) &= w_t(\psi_t(x)) \\ &= \frac{d}{dt} [\sigma_t(x_1) x + \mu_t(x_1)] \\ &= \sigma_t'(x_1) x + \mu_t' (x_1) \end{align*}

Set x=ψt1(y)=1σt(x1)(yμt(x1))x = \psi_t^{-1} (y) = \frac{1}{\sigma_t(x_1)}(y - \mu_t(x_1)), then we have:

wt(ψt(ψt1(y)))=wt(y)=σt(x1)σt(x1)(yμt(x1))+μt(x1)\begin{align*} w_t(\psi_t(\psi_t^{-1}(y))) &= w_t(y) \\ &= \frac{\sigma_t' (x_1)}{\sigma_t(x_1)} (y - \mu_t(x_1)) + \mu_t'(x_1) \end{align*}

Therefore,

ut(xx1)=wt(x)=σt(x1)σt(x1)(xμt(x1))+μt(x1)u_t(x|x_1) = w_t(x) = \frac{\sigma_t' (x_1)}{\sigma_t(x_1)} (x - \mu_t(x_1)) + \mu_t'(x_1)

End of Proof.

9. Instances of Gaussian Conditional Probability Paths

The above formulation is general for arbitrary functions μt(x)\mu_t(x) and σt(x)\sigma_t(x). They can be set to any differentiable function satisfying desired boundary conditions.

📖

Recap on the solution to the first order linear ODE

A first order linear ODE takes the form:

dydy+p(t)y=g(t)\frac{dy}{dy} + p(t)y = g(t)

The solution is given by:

y(t)=μ(t)g(t)dt+cμ(t)y(t) = \frac{\int \mu(t) g(t) dt + c}{\mu(t)}

where μ(t)=ep(t)dt\mu(t) = e^{\int p(t) dt} and the constant cc is determined by the boundary condition.

Example 1. Diffusion Conditional VFs.

Diffusion models start with data points and gradually add noise until it approximates the pure noise. The process can be formulated as stochastic processes, resulting in Gaussian conditional probability paths pt(xx1)p_t(x|x_1), with specific choices of mean ut(x1)u_t(x_1) and std σt(x1)\sigma_t(x_1).

The Variance Exploding (VE) path has the form:

pt(x)=N(xx1,σ1t2I)\begin{equation} p_t(x) = \mathcal{N}(x | x_1, \sigma_{1-t}^2 I) \end{equation}

where σt\sigma_t is an increasing function, σ0=0\sigma_0 = 0 and σ11\sigma_1 \gg 1. This this case, μt(x1)=x1\mu_t(x_1) = x_1 and σt(x1)=σ1t\sigma_t(x_1) = \sigma_{1-t}. By Theorem 3 we have:

ut(xx1)=σ1tσ1t(xx1)\begin{equation} u_t(x|x_1) = -\frac{\sigma_{1-t}'}{\sigma_{1-t}} (x - x_1) \end{equation}

Derive the flow for VE path

According to Equation (2),

dϕt(x)dt=σ1tσ1t(ϕt(x)x1)dϕt(x)dt+σ1tσ1tϕt(x)=σ1tσ1tx1\begin{align*} \frac{d \phi_t(x)}{dt} &= -\frac{\sigma_{1-t}'}{\sigma_{1-t}}(\phi_t(x) - x_1) \\ \frac{d \phi_t(x)}{dt} + \frac{\sigma_{1-t}'}{\sigma_{1-t}} \phi_t(x) &= \frac{\sigma_{1-t}'}{\sigma_{1-t}} x_1 \end{align*}

p(t)=Ap(t) = A, g(t)=Ax1g(t) = Ax_1, and A=σ1tσ1tA = \frac{\sigma_{1-t}'}{\sigma_{1-t}}. So we have:

μ(t)=ep(t)dt=c1eAt\mu(t) = e^{\int p(t) dt} = c_1 e^{At}
μ(t)g(t)dt=Ax1c1eAtdt=x1c1eAt+c2\begin{align*} \int \mu(t) g(t) dt &= \int Ax_1c_1 e^{At} dt\\ &= x_1c_1 e^{At} + c_2 \end{align*}
ϕt(x)=x1c1eAt+c2+cc1eAt=x1+ceAt\begin{align*} \phi_t(x) &= \frac{x_1c_1 e^{At} + c_2 + c}{c_1 e^{At}} \\ &= x_1 + ce^{-At} \end{align*}

According to the boundary condition ϕ0(x)=x\phi_0(x) = x, we can solve c=xx1c = x - x_1. Therefore,

ϕt(x)=x1+(xx1)eAt=x1+(xx1)eσ1tσ1tt\phi_t(x) = x_1 + (x - x_1) e^{-At} = x_1 + (x - x_1)e^{-\frac{\sigma_{1-t}'}{\sigma_{1-t}} t}

The Variance Preserving diffusion path has the form:

pt(xx1)=N(xα1tx1,(1α1t2)I)p_t(x|x_1) = \mathcal{N}(x|\alpha_{1-t} x_1, (1 - \alpha_{1-t}^2)I)

where αt=e12T(t)\alpha_t = e^{-\frac{1}{2}T(t)}, T(t)=0tβ(s)dsT(t) = \int_0^t \beta(s) ds, and β\beta is the noise scale function. It provides the choices of μt(x1)=α1tx1\mu_t(x_1) = \alpha_{1-t}x_1 and σt(x1)=1α1t2\sigma_t(x_1) = \sqrt{1 - \alpha_{1-t}^2}. Plug them into Theorem 3, we can get the corresponding vector field.

Derive the vector field for VP paths.

Since μt(x1)=α1tx1\mu_t(x_1) = \alpha_{1-t} x_1, μt(t)=α1tx1\mu_t'(t) = -\alpha_{1-t}' x_1.

Since σt(x1)=(1α1t2)1/2\sigma_t(x_1) = (1 - \alpha_{1-t}^2)^{1/2},

σt(x1)=12(1α1t2)1/2(2α1t)α1t(1)=(1α1t2)1/2α1tα1t\begin{align*} \sigma_t'(x_1) &= \frac{1}{2}(1 - \alpha_{1-t}^2)^{-1/2} \cdot (-2 \alpha_{1-t}) \cdot \alpha_{1-t}' \cdot (-1)\\ &= (1 - \alpha_{1-t}^2)^{-1/2} \alpha_{1-t} \alpha_{1-t}' \end{align*}

Since α1t=e12T(1t)\alpha_{1-t} = e^{-\frac{1}{2} T(1-t)}, α1t=12T(1t)e12T(1t)\alpha_{1-t}' = -\frac{1}{2} T'(1-t)e^{-\frac{1}{2}T(1-t)}.

Then, according to Theorem (3),

ut(xx1)=α1tα1t1α1t2(xα1tx1)α1tx1=α1t1α1t2(α1txx1)=12T(1t)e12T(1t)1eT(1t)(e12T(1t)xx1)=T(1t)2[eT(1t)xe12T(1t)x11eT(1t)]\begin{align*} u_t(x|x_1) &= \frac{\alpha_{1-t} \alpha_{1-t}'}{1 - \alpha_{1-t}^2}(x - \alpha_{1-t} x_1) - \alpha_{1-t}' x_1 \\ &= \frac{\alpha_{1-t}'}{1 - \alpha_{1-t}^2}(\alpha_{1-t}x - x_1)\\ &= \frac{-\frac{1}{2} T'(1-t) e^{-\frac{1}{2}T(1-t)} }{1 - e^{-T(1-t)}} (e^{-\frac{1}{2}T(1-t)} x - x_1) \\ &= -\frac{T'(1-t)}{2} \left[ \frac{e^{-T(1-t)}x - e^{-\frac{1}{2}T(1-t)}x_1}{1 - e^{-T(1-t)}} \right] \end{align*}

Example 2. Optimal Transport Conditional VFs.

Another natural choice for conditional probability paths is to define the mean and std to change linearly in time:

μt(x)=tx1, and σt(x)=1(1σmin)t.\begin{equation} \mu_t(x) = tx_1, \text{ and } \sigma_t(x) = 1 - (1 - \sigma_{\text{min}})t. \end{equation}

According to Theorem (3),

ut(xx1)=(1σmin)1(1σmin)t(xtx1)+x1=x1(1σmin)x1(1σmin)t\begin{equation} \begin{align*} u_t(x|x_1) &= \frac{-(1 - \sigma_{\text{min}})}{1 - (1 - \sigma_{\text{min}})t}(x - tx_1) + x_1 \\ &= \frac{x_1 - (1 - \sigma_{\text{min}}) x}{1 - (1 - \sigma_{\text{min}})t} \end{align*} \end{equation}

In contrast to diffusion conditional VF (VE and VP). this vector field is defined for all t[0,1]t \in [0, 1]. The conditional flow that corresponds to ut(xx1)u_t(x | x_1) is

ψt(x)=(1(1σmin)t)x+tx1\psi_t(x) = (1 - (1 - \sigma_{\text{min}})t)x + tx_1

Derive the conditional flow for Optimal Transport

According to Equation (2),

ddtψt(x)=x1(1σmin)ψt(x)1(1σmin)tddtψt(x)+1σmin1(1σmin)tψt(x)=x11(1σmin)t\begin{align*} \frac{d}{dt} \psi_t(x) &= \frac{x_1 - (1 - \sigma_{\text{min}}) \psi_t(x)}{1 - (1 - \sigma_{\text{min}})t} \\ \frac{d}{dt} \psi_t(x) + \frac{1 - \sigma_{\text{min}} }{1 - (1 - \sigma_{\text{min}}) t} \psi_t(x) &=\frac{x_1}{1 - (1 - \sigma_{\text{min}})t} \end{align*}