Diffusion Maps

[@Lafon2004] uses the language of functional analysis.

Symbols

$\Gamma$: a set, either finite (graph) or continuous (manifold).
$\mathrm{d}x$: the Lebesgue measure ($\mathbb{R}^n$); the Riemannian volume (Riemannian manifold).
$(\Gamma, \mathrm{d}\mu)$: a measure space with point set $\Gamma$ and measure $\mathrm{d}\mu$.
$f$, $F$: a function (restricted to) a manifold $\Gamma$; a function (extended to) the ambient space $\mathbb{R}^n$;
$L^2(\Gamma, \mathrm{d}\mu)$: the space of square integrable functions on $\Gamma$. $f \in L^2(\Gamma, \mathrm{d}\mu) \Leftrightarrow \int_\Gamma |f(x)|^2~\mathrm{d}\mu(x) < \infty$.
$H_s(\Gamma, \mathrm{d}\mu)$: a Sobolev space of functions on $\Gamma$. $f \in H_s(\Gamma, \mathrm{d}\mu) \Leftrightarrow \forall |r| \le s, \partial^r f \in L^2(\Gamma,\mathrm{d}\mu)$. Here $|r| = \sum_i r_i, r_i \in \mathbb{N}$.
$\langle f, g \rangle_\Gamma$: inner product of functions in $L^2(\Gamma, \mathrm{d}\mu)$. $\langle f,g \rangle_\Gamma = \int_\Gamma \overline{f(x)} g(x)~\mathrm{d}\mu(x)$.
$\|f\|_s$: norm of a function in $H_s(\Gamma, \mathrm{d}\mu)$. $\|f\|_s^2 = \sum_{|r| \le s} \int_\Gamma |\partial^r f(x)|^2~\mathrm{d}\mu(x)$.
$k(x, y)$, $p_t(x, y)$: an integral kernel; the Neumann heat kernel on a manifold.
$v^2(x)$: integral transform of 1 with kernel $k(x, y)$.
$\tilde{a}(x, y)$: a Markov kernel, transition probability of a diffusion process.
$a(x, y)$: a diffusion kernel, the symmetric conjugate to $\tilde{a}$ by $v(x)$.
$K$, $\tilde{A}$, $A$: an integral operator; a Markov operator; a diffusion operator.
$\Delta$: the Laplacian ($\mathbb{R}^n$); the Laplace-Beltrami operator (Riemannian manifold). Signs are defined so that $\Delta$ is positive semi-definite.
$\{\phi_i\}$, $\{\nu_i^2\}$: eigenfunctions and eigenvalues of $\Delta$, i.e. $\Delta \phi_i = \nu_i^2 \phi_i$.
$\Phi$, $\Phi_k$: diffusion map; diffusion map truncated to dimension $k$.

Concepts

The geometry of a set $Γ$ is a set of rules that describe the relationship between its elements. The geometry is intrinsic if it does not refer to a superset or its structures. The geometry is extrinsic if it is induced from the geometry of a superset.

A dual perspective in functional analysis, from inverse problems such as inverse scattering, potential theory, and spectral geometry: The geometry of a set $Γ$ can be understood through the geometry of the space of functions defined on $Γ$ (and operators on these functions).

(Manifold learning.)

Graph Laplacian

Spectral graph theory analyzes the spectrum (eigenvalues and eigenvectors) of a matrix representing a graph. Spectral clustering/partitioning: construct a matrix representation of the graph; eigendecompose the matrix and embed the vertices to a Euclidean space using one or more eigenvectors; group the points based on their coordinates.

The Laplacian matrix of a graph, aka graph Laplacian, is its degree matrix minus its adjacency matrix, $L = D - A$, where $D = \mathrm{diag}\{d_i\}_{i=1}^n$. The Laplacian matrix is a discrete analog of the Laplacian operator $Δ$ in multivariable calculus. The Laplacian matrix is symmetric, so its eigenvectors can form an orthogonal basis; additionally, all eigenvalues are non-negative. The (symmetric) normalized Laplacian is $L' = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2}$.

Diffusion Maps

Diffusion processes (or random walks, Markov processes) can be useful for the geometric descriptions of data sets.

An integral kernel $k(x,y)$ defines an integral transform/operator $K$ for functions on measure space $(\Gamma, \mathrm{d}\mu)$, such that $K f(x) = \int_\Gamma k(x, y) f(y)~\mathrm{d}\mu(y)$. Here we call an integral kernel admissible, if it is symmetric, positivity-preserving, and positive semi-definite: $k(x, y) = k(y, x)$; $k \ge 0$; $\langle f, K f \rangle \ge 0$. Admissible kernels can be designed to represent a local metric (e.g. degree of similarity) or exhibit a local behavior.

A Markov/stochastic kernel $\tilde{a}(x, y)$ defines a Markov process (an averaging operator) $\tilde{A}$, where $\tilde{a}(x, \cdot)$ is a probability measure. Every positivity-preserving integral kernel can be transformed into a Markov kernel $\tilde{a}(x, y)= k(x, y) / v^2(x)$, where $v^2(x) = K 1$. Assume the Markov kernel has eigendecomposition $\tilde{a}(x, y) = \sum_{i \in \mathbb{N}} \lambda_i \psi_i(x) \psi_i(y)$. The eigenvectors $\{\psi_i\}$ need not be orthogonal.

A diffusion kernel $a(x, y)$ is the symmetric conjugate of a Markov kernel $\tilde{a}$ by $v$, $a(x, y) = v(x) \tilde{a}(x, y) / v(y)$. The diffusion operator $A$ with kernel $a$ is positive semi-definite, bounded, with the largest eigenvalue 1 and a corresponding eigenfunction $v(x)$. Assume that $A$ is compact (not just bounded), then its spectrum is discrete and its eigendecomposition can be written as $a(x, y) = \sum_{i \in \mathbb{N}} \lambda_i \phi_i(x) \phi_i(y)$, where $A \phi_i(x) = \lambda_i \phi_i(x)$. The diffusion operator $A$ and the Markov operator $\tilde{A}$ have the same spectrum $\lambda$, and their eigenfunctions are related by conjugation by $v$: $\psi_i = \mathrm{diag}(v^{-1}) \phi_i$. The t-step diffusion operator $A^t$ aggregates the local information, and its kernel $a^{(t)}$ has eigendecomposition $a^{(t)}(x, y) = \sum_{i \in \mathbb{N}} \lambda_i^t \phi_i(x) \phi_i(y)$.

The diffusion metric/distance $D_t(x, y)$ at time step $t$ is defined as $D_t^2 (x, y) \equiv a^{(t)}(x, x) + a^{(t)}(y, y) - 2 a^{(t)}(x, y)$, which is equivalent to $D_t^2 (x, y) = \sum_{i \in \mathbb{N}} \lambda_i^t [\phi_i(x) - \phi_i(y)]^2$ and $D_t^2(x,y) = \| a^{(t/2)}(x,\cdot) - a^{(t/2)}(y,\cdot) \|^2$. Thus $D_t$ is a semi-metric on the set, which becomes a metric if the diffusion kernel $a$ is positive definite. A diffusion map $\Phi: \Gamma \to l^2(\mathbb{N})$ is a mapping of the data to a Euclidean space, based on the eigenfunctions of the diffusion kernel: $\Phi(x) = \left( \phi_i(x) \right)_{i \in \mathbb{N}}$. Diffusion map $\Phi$ defines an embedding of the data that preserves the diffusion distance $D_t$ via a weighted Euclidean metric $D_t^2(x,y) = \langle \Phi(x), \Lambda^t \Phi(y) \rangle$. A truncated diffusion map $\Phi_k: \Gamma \to \mathbb{R}^k$ with $\Phi_k(x) = \left( \phi_i(x) \right)_{i=1}^k$ preserves the diffusion distance with certain accuracy. The dimension $k$ of the new representation depends on the diffusion process, and is in general greater than the dimension $d$ of the set/manifold.

(Inverse mapping.)

Heat Diffusion

Given $\Gamma$ is a compact smooth submanifold of dimension $d$ in $\mathbb{R}^n$, i.e. a Riemannian manifold with Riemannian metric induced from the ambient space $\mathbb{R}^n$. Let $\mu$ be a measure on $\Gamma$ and $\mathrm{d}x$ be the Riemannian measure on $\Gamma$, then density $p(x)$ is given by $p(x) = \mathrm{d}\mu(x) / \mathrm{d}x$. Let $\{e_i\}_{i=1}^d$ be an orthonormal basis of the tangent plane $T_x$ to $Γ$ at $x$, and the exponential map $\mathrm{exp}_x$ of the coordinates of the tangent plane forms a chart on $Γ$ around $x$, which provides normal coordinates $(s_i)_{i=1}^d$. The Laplace-Beltrami operator $\Delta$ on the Riemannian manifold can be written as $\Delta f = - \sum_{i=1}^d \partial^2 f / \partial s_i^2$, where $f \in C^\infty(\Gamma)$ is a smooth function on the manifold. The Neumann heat operator on the manifold (heat diffusion on the manifold with the Neumann boundary condition) is $e^{−t \Delta}$, which corresponds to the Neumann heat kernel $p_t (x, y)$ when seen as an integral operator.

Consider rotation-invariant kernels $k_\varepsilon (x, y) = h(\|x-y\|^2 / \varepsilon)$ with scale parameter $\varepsilon$, where $h$ is smooth and decays exponentially, and $\|\cdot\|$ is the Euclidean metric. As $\varepsilon$ goes to 0, the local geometry specified by $k_\varepsilon$ coincide with that of the manifold. Given an averaging operator $\tilde{A}_\varepsilon f(x) = K_\varepsilon f(x) / v_\varepsilon^2 (x)$ and $v_\varepsilon^2 (x) = K_\varepsilon 1$, the weighted graph Laplacian operator is $\tilde{\Delta}_\varepsilon = (I - \tilde{A}_\varepsilon) / \varepsilon$. On certain subspaces $E_K$ of $L^2(\Gamma)$, as $\varepsilon$ goes to 0, the limit operator of $\tilde{\Delta}_\varepsilon$ is $H f = c (\Delta f + 2 \langle \nabla p / p , \nabla f \rangle)$, where $c = m_2 / (2 m_0)$, $m_0 = \int_{\mathbb{R}^d} h(\|u\|^2)~\mathrm{d}u$, and $m_2 = \int_{\mathbb{R}^d} u_i^2 h(\|u\|^2)~\mathrm{d}u$. Under a conjugation by the density $p$, we get $p H(g/p) = c (\Delta g - g \nabla p / p)$, where $g = p f$. Thus, when the density $p$ is uniform, the weighted graph Laplacian $\tilde{\Delta}_\varepsilon$ converges to (a multiple of) the Laplace-Beltrami operator $\Delta$ on the manifold [@Belkin2003]; but when the density is non-uniform, such reconstruction does not hold.

To separate the distribution $p$ on the manifold from its intrinsic geometry $k_\varepsilon$, use kernel $\tilde{k}_\varepsilon(x, y) = k_\varepsilon(x, y)/(p_\varepsilon(x) p_\varepsilon(y))$ where $p_\varepsilon(x) = K_\varepsilon 1$. This leads to an averaging operator $\bar{A}_\varepsilon f(x) = \tilde{K}_\varepsilon f(x) / \tilde{v}_\varepsilon^2 (x)$ where $\tilde{v}_\varepsilon^2 (x) = \tilde{K}_\varepsilon 1$. Define a Laplace operator $\Delta_\varepsilon = (I - \bar{A}_\varepsilon) / \varepsilon$ acting on the subspaces $E_K$ of $L^2(\Gamma)$, its limit operator $\lim_{\varepsilon \to 0} \Delta_\varepsilon \equiv \Delta_0 = \Delta / c$ is a multiple of the Laplace-Beltrami operator. The limit operator of $\bar{A}_\varepsilon^{t/\varepsilon}$ is the Neumann heat operator $e^{−t \Delta_0}$; in other words, averaging kernel $\bar{a}_\varepsilon^{(t/\varepsilon)}(x, y)$ with small $\varepsilon$, close to a fine scale Gaussian kernel, approximates the Neumann heat kernel $p_t (x, y)$. The averaging operator $\bar{A}_\varepsilon$ is compact and its eigendecomposition can be written as $\bar{A}_\varepsilon = \sum_{i \in \mathbb{N}} \lambda_{\varepsilon, i} P_{\varepsilon, i}$, where $P_{\varepsilon, i}$ is the orthogonal projector on the eigenspace corresponding to eigenvalue $\lambda_{\varepsilon, i}$. If the eigendecomposition of the Laplace-Beltrami operator is $\Delta = \sum_{i \in \mathbb{N}} \nu_i^2 P_i$, where $P_i$ is the orthogonal projector on the eigenspace corresponding to eigenvalue $\nu_i^2$. Then the eigenfunctions and eigenvalues of $\bar{A}_\varepsilon^{t/\varepsilon}$ converge to those of the heat operator $e^{-t\Delta}$: $\lim_{\varepsilon \to 0} \lambda_{\varepsilon, i}^{t/\varepsilon} = e^{-t\nu_i^2}$, $\lim_{\varepsilon \to 0} P_{\varepsilon, i} = P_i$. Let $\{\phi_i\}_{i \in \mathbb{N}}$ be the eigenfunctions of the Laplace-Beltrami operator (also of the heat kernel) on the manifold $\Gamma$, then the eigendecomposition of the heat kernel is $p_t(x, y) = \sum_{i \in \mathbb{N}} e^{-t \nu_i^2} \phi_i(x) \phi_i(y)$. Although $\{\phi_i\}_{i \in \mathbb{N}}$ is usually viewed as a Hilbert basis of $L^2(Γ)$, it also forms a set of coordinates on the submanifold $Γ$ via diffusion maps $\Phi$, which accurately preserves the heat diffusion distance with dimension $k \ge d$.

Numerical procedure for approximate Neumann heat diffusion:

Choose scale parameter $\varepsilon$ to be (on the order of) the mean square distance within the set: $\varepsilon = \sum_x \min_{y \ne x} \|x - y\|^2 m(x) / N$, where $m(x)$ is the number of observations identical to $x$ in the data.
Construct operator $[K_\varepsilon]$ with entries $[K_\varepsilon]_{xy} = k_\varepsilon(x, y)$, where $k_\varepsilon(x, y) = e^{-\|x-y\|^2/\varepsilon}$ is a Gaussian kernel.
Compute approximate density $p_\varepsilon = [K_\varepsilon] \mathbf{1}$ by substituting true density $p(x)$ with the empirical measure.
Compute density-neutralizing operator $[\tilde{K}_\varepsilon] = \mathrm{diag}(p_\varepsilon^{-1}) [K_\varepsilon] \mathrm{diag}(p_\varepsilon^{-1})$. It can be implemented as $[K_\varepsilon] / [p_\varepsilon p_\varepsilon^T]$, where $/$ is element-wise division.
Compute normalizing vector $\tilde{v}_\varepsilon^2 = [\tilde{K}_\varepsilon] \mathbf{1}$. Take a square root to get $\tilde{v}_\varepsilon$.
Compute diffusion operator $[A_\varepsilon] = \mathrm{diag}(\tilde{v}_\varepsilon^{-1}) [\tilde{K}_\varepsilon] \mathrm{diag}(\tilde{v}_\varepsilon^{-1})$. It can be implemented as $[\tilde{K}_\varepsilon] / [\tilde{v}_\varepsilon \tilde{v}_\varepsilon^T]$.
Eigendecompose $[A_\varepsilon] = \Phi \Lambda_\varepsilon \Phi^T$. The Neumann heat operator has approximate spectrum $\lambda_\varepsilon$ and eigenvectors $\{\phi_i / \phi_1\}_{i=2}^N$. Note that $A_\varepsilon \tilde{v}_\varepsilon = \tilde{v}_\varepsilon$, so $\phi_1$ and $\tilde{v}_\varepsilon$ are proportional.

Anisotropic Diffusion

(for differential and dynamical systems)

Geometric Harmonics

Geometric harmonics is a set of functions that allows out-of-sample extension of empirical functions on the data set.

Intrinsic geometry, Fourier analysis of the manifold (eigenfunctions of the Laplace-Beltrami operator). Extrinsic geometry, Fourier analysis of the ambient space.

Restriction.

Extension (a version of the Heisenberg uncertainty principle): extend the eigenfunctions to a band-limited function of band $\mathcal{O}(\nu_i)$, localized in a tube of radius $\mathcal{O}(1/\nu_i)$ around the manifold.

(Extension of f = 1 for discription of the manifold?)

Multiscale extension: empirical functions are decomposed into frequency bands, and each band is extended to a certain distance so that it satisfies some version of the Heisenberg principle.

(Intrinsic and extrinsic diffusion.)