Probabilistic approaches to machine learning, or probabilistic learning [@Ghahramani2015], include probabilistic modeling, defining likelihoods, parameter estimation using likelihood and Bayesian techniques, probabilistic approaches to classification, clustering, and regression, and related topics such as model selection and bias/variance tradeoffs.

Probabilistic learning on manifolds concerns a probability distribution concentrated on a (measure-zero) subset of the Euclidean space, which is viewed as a manifold.

Move points toward the "principal curve" [@Hastie, Stuetzle, 1989]; A point is on the "density ridge" if the maximum negative curvature of the PDF at the point is perpendicular to the gradient of the PDF at the point [@Scott1991b].

Other references (directional statistics): Statistical Analysis of Spherical Data [@Fisher, Lewis, Embleton, 1987]; Statistics on Spheres [@Watson1985];

Distribution on Manifold

Probability density estimation and approximation on Riemannian manifolds. Considerations: computational cost, convergence rate, assumption applicability. See [@Brigant2019] for a survey.

Parametric families of probability densities on manifolds: (MISE is usually $O(N^{-1})$ by parametric estimation)

  1. heat kernels: wrapped Gaussian for 1-sphere;
  2. maximum entropy distributions with moment constraints [@Pennec2006]:
    • von Mises-Fisher (vMF) [@Fisher1953] for n-spheres, hyperbolic vMF [@Barndorff-Nielsen1978], matrix vMF [@Chikuse2003] for Stiefel manifolds;
    • group-invariant natural exponential families on symplectic manifolds;
  3. based on geodesic distance:
    • Gaussian-like distribution $f(p) \propto \exp[- d^2(p, \mu) / (2 \sigma^2)]$ on symmetric spaces [@Said2018], e.g. positive definite matrix manifold [@Said2017];

Wrapped Gaussian distribution on the 1-sphere is its heat kernel, which "wraps" the Gaussian distribution around the unit circle: $f_{WN} (\theta; \mu, \sigma) = \sum_{k \in \mathbb{Z}} \exp[-(\theta-\mu+2\pi k)^2/(2\sigma^2)] / \sqrt{2 \pi \sigma^2}$. Heat kernels of complete Riemannian manifolds do not have closed forms in general, see Geometric Diffusion.

von Mises distribution on the 1-shpere is a closed-form approximation to the wrapped Gaussian distribution; von Mises–Fisher distribution [@Fisher1953] generalizes the von Mises distribution to n-spheres, and is the most commonly used distribution in directional/circular/spherical statistics: $f(x; a, b) \propto \exp(b \cos(x-a))$, where $x, a \in [0, 2\pi), b \ge 0$.

Non-parametric density estimation:

  1. projection of the density function onto eigenfunctions of the Laplace–Beltrami operator [@Hendriks1990]: $\text{MISE} = O(Q^{-s} + N^{-1} Q^{d/2})$, optimal case $O(N^{-2s/(2s+d)})$, where $s$ is smoothness class of the density;
  2. kernel density estimation (local linearization) [@Pelletier2005]: function of geodesic distance, support smaller than injectivity radius, volume density bounded away from 0; $\text{MISE} = O(r^4 + N^{-1} r^{-d})$, optimal case $O(N^{-4/(4+d)})$.

Quantization: approximate a random variable by a quantized/discretized version; optimal quantization minimizes the $L^p$ distance to the true measure from a discrete measure of $n$ supporting points; any quantization defines a clustering by Voronoï cells; the optimal centers are asymptotically distributed as $h^{d/(d+p)}$; Competitive Learning Vector Quantization is a classical algorithm for quadratic ($p = 2$) vector quantization.

Manifold Sampling

Sampling density estimates on (high dimensional) manifolds defined by limited data, done by a synthesis of methods [@Soize2016]:

  1. (Implicit): Kernel density estimation (KDE) for the probability distribution of the sample matrix;
  2. Diffusion maps for the "local" geometric structure, aka manifold, of the dataset: top eigenvectors of the diffusion map is used for a reduced-order representation of the sample matrix;
  3. Markov chain Monte Carlo (MCMC) method based on Ito stochastic differential equation (ISDE) for generating realizations of the sample matrix;

Preliminary: Sampling a Gaussian KDE

The Gaussian KDE of a $v×N$ sample matrix $[η]$, modified as in [@Soize2015]: $$p_H(η) = 1/N \sum_i π(η; η^i s'/s, s')$$

  • $π(η; m, σ) = \exp\{ - \|(η - m)/σ\|^2 / 2 \} / (\sqrt{2 π} σ)^v$ is the Gaussian kernel;
  • $s = (4 / ((v + 2)N))^{1 / (v + 4)}$ is the optimal Silverman bandwidth;
  • $s' = s / \sqrt{s^2 + N / (N − 1)}$;

Sampling the Gaussian KDE of the random vector by solving an ISDE [@Soize1994, pp. 211-216, Thm. 4-7]. The following Markov stochastic process of a nonlinear second-order dissipative Hamiltonian dynamical system has a unique invariant measure and a unique solution that is a second-order diffusion stochastic process, which is stationary, ergodic, and $U(t) \sim p_{H}(η)$: $$\begin{cases} dU = V dt \\ dV = L(U) dt − 1/2 f_0 V dt + \sqrt{f_0} dW \\ U(0) = H;\quad V(0) = N \end{cases}$$

  • $L(u) = -∇ν(u)$ is the conservative force;
  • $ν(u) = -\text{LogSumExp}\{ -\|(u - η^i s'/s) / s'\|^2 / 2\}$ is the potential (Hamiltonian);
  • $f_0$ is a dissipation parameter such that the transient response of the ISDE are rapidly killed;
  • $W$ is the v-dimensional normalized Wiener processes (increments are standard Gaussian);
  • $H \sim p_{H}(η)$ is a random vector with realizations $[η]$;
  • $N$ is the v-dimensional normalized Gaussian vector;

Procedure: Sampling a manifold-reduced Gaussian KDE

  1. Shift and scale the data $$, a matrix of $p$ attributes by $N$ observations, to $(ϵ, 1)$;
  2. Normalize the data $[x_0]$ by principal component analysis (PCA): $[η] = \mathrm{diag}(μ)^{−1/2} [φ]^T [x_0]$
    • $μ$ are the $v \le p$ positive eigenvalues of the empirical covariance matrix $[c] = [x_0] [x_0]^T /(N-1)$;
    • $φ$ the corresponding $v$ orthonormal eigenvectors;
  3. Characterize the manifold using a diffusion maps basis: $[g] = [P]^κ [ψ] = [ψ] [\Lambda]^κ$
    • $[P] = \mathrm{diag}\{[K] 1\}^{−1} [K]$ is the diffusion map (a transition matrix), $\{ψ\}$ is the right eigenvectors of $[P]$, $[\Lambda]$ is the diagonal matrix of the corresponding eigenvalues, and $κ$ is the "analysis scale" of the local geometric structure of the dataset;
    • $[K]_{ij} = k_ε(η^i, η^j)$ are transition likelihood, $k_ε(x,y)=\exp\{− \|x − y\|^2 / (4ε)\}$ is the Gaussian kernel with smoothing parameter $ε$, the kernel may be set to any decreasing function of a metric $f(d(x, y))$ with $f(0) = 1, f(\infty) = 0$;
    • $[η] = [z] [g]^T$, where $[z] = [η] [a]$ and $[a] = [g] ([g]^T [g])^{-1}$, because full projection $P_{[g]} = [g] ([g]^T [g])^{-1} [g]^T = I$;
    • $[η](m) = [η] P_{[g](m)}$ is a reduced-order representation of $[η]$ that projects $[η]$ on $[g](m)$, the first $m$ vectors of $[g]$;
  4. Sample the reduced-order sample matrix by solving an ISDE: $[η](t) = [Z](t) [g](m)^T$, $t = l M_0 Δt$, $l = 1, 2, ...$
    • $m$ satisfies mean-square convergence criterion $\|[c](m) - [c]\|_F < ε_0 \|[c]\|_F$ for some $ε_0 = O(10^{-3})$;
    • $[Z](t)$ satisfies the following ISDE where $[L](u) = (-∇ν(u^j))_j$: $$\begin{cases} d[Z] = [Y] dt \\ d[Y] = [L]([Z] [g](m)^T) [a](m) dt − 1/2 f_0 [Y] dt + \sqrt{f_0} d[W] [a](m) \\ [Z](0) = [H] [a](m);\quad [Y](0) = [N] [a](m) \end{cases}$$ If exists $[L'] = \nabla \log \rho'$ such that $[L']([Z]) = [L]([Z] [g](m)^T) [a](m)$, then $[Z] \sim \rho'$.
    • The Störmer–Verlet discretization scheme of the ISDE (preserves energy for non-dissipative Hamiltonian dynamical systems): $$\begin{cases} [Z_{l+1/2}] = [Z_l] + [Y_l] Δt/2 \\ [Y_{l+1}] = ((1-b) [Y_l] + [L_{l+1/2}] [a](m) Δt + \sqrt{f_0} [ΔW_{l+1}] [a](m)) / (1+b)\\ [Z_{l+1}] = [Z_{l+1/2}] + [Y_{l+1}] Δt/2 \end{cases}$$ where $[L_{l+1/2}] = [L]([Z_{l+1/2}] [g](m)^T)$ and $b = f_0 Δt/4$.
    • $Δt = 2πs' / \text{Fac}$ is the sampling step of the integration scheme (oversampled if Fac>1);
    • $M_0 Δt > 4 / f_0$, the relaxation time of the dynamical system, so samples are approximately independent, e.g. $M_0 > 2 \log(100) \text{Fac} / (\pi f_0 s')$;

Parameters: $(ε, κ = 1, ε_0, f_0 = 1.5, Δt, M_0 = 110)$, or replace $Δt$ with Fac.

The computational cost is no greater than the direct MCMC in the Preliminary section. But the main advantage is a probability distribution concentrated on manifold.

Misc

Extensions:

  • Inference;
  • polynomial chaos representation [@Soize2017];
  • stochastic non-convex optimization [@Ghanem2018]: well placement [@Ghanem2018]; design optimization [@Ghanem2019];
  • entropy‐based closure [@Soize2019];
  • physical or statistical constraints [@Soize2020];

Applications: optimization (cheap sample on unknown feasible set, a submanifold): stochastic optimization [@Kingma2015]; maximum likelihood; Bayesian inference [@ChenYC2020]; dynamical system on manifolds [@Talmon2013; @Talmon2015];


🏷 Category=Manifold