Probabilistic approaches to machine learning, or probabilistic learning [@Ghahramani2015],
include probabilistic modeling, defining likelihoods,
parameter estimation using likelihood and Bayesian techniques,
probabilistic approaches to classification, clustering, and regression,
and related topics such as model selection and bias/variance tradeoffs.
Probabilistic learning on manifolds concerns
a probability distribution concentrated on a (measurezero) subset of the Euclidean space,
which is viewed as a manifold.
Move points toward the "principal curve" [@Hastie, Stuetzle, 1989];
A point is on the "density ridge" if the maximum negative curvature of the PDF at the point
is perpendicular to the gradient of the PDF at the point [@Scott1991b].
Other references (directional statistics):
Statistical Analysis of Spherical Data [@Fisher, Lewis, Embleton, 1987];
Statistics on Spheres [@Watson1985];
Density estimation and approximation on Manifold
Probability density estimation and approximation on manifolds.
Considerations: computational cost, convergence rate, assumption applicability.
See [@Brigant2019] for a survey.
Parametric families of probability densities on manifolds:
MISE is usually $\mathcal{O}(N^{1})$ by parametric estimation.
 heat kernels: wrapped Gaussian for 1sphere;
 maximum entropy distributions with moment constraints [@Pennec2006]:

matrix manifolds:
von MisesFisher (vMF) for nspheres, matrix vMF for Stiefel manifolds, etc.;
 hyperbolic spaces: hyperbolic vMF;
 symplectic manifolds: groupinvariant natural exponential families;
 based on geodesic distance:
 Riemannian symmetric spaces:
Gaussianlike distribution $f(p) \propto \exp[ d^2(p, \mu) / (2 \sigma^2)]$ [@Said2018],
e.g. positive definite matrices [@Said2017] with a certain structure,
e.g. complex, Toeplitz, or blockToeplitz;
Nonparametric density estimation:
 projection of the density function onto
eigenfunctions of the Laplace–Beltrami operator [@Hendriks1990]:
$\text{MISE} = \mathcal{O}(Q^{s} + N^{1} Q^{d/2})$, optimal case $\mathcal{O}(N^{2s/(2s+d)})$,
where $s$ is smoothness class of the density;
 kernel density estimation (local linearization) [@Pelletier2005]:
function of geodesic distance, support smaller than injectivity radius,
volume density bounded away from 0;
$\text{MISE} = \mathcal{O}(r^4 + N^{1} r^{d})$, optimal case $\mathcal{O}(N^{4/(4+d)})$.
Quantization: approximate a random variable by a quantized/discretized version;
optimal quantization minimizes the $L^p$ distance to the true measure
from a discrete measure of $n$ supporting points;
any quantization defines a clustering by Voronoï cells;
the optimal centers are asymptotically distributed as $h^{d/(d+p)}$;
Competitive Learning Vector Quantization is a classical algorithm
for quadratic ($p = 2$) vector quantization.
Manifold Sampling
Sampling density estimates on (high dimensional) manifolds defined by limited data,
done by a synthesis of methods [@Soize2016]:
 (Implicit): Kernel density estimation (KDE) for the probability distribution of the sample matrix;
 Diffusion maps for the "local" geometric structure, aka manifold, of the dataset:
top eigenvectors of the diffusion map is used
for a reducedorder representation of the sample matrix;
 Markov chain Monte Carlo (MCMC) method based on Ito stochastic differential equation (ISDE)
for generating realizations of the sample matrix;
Preliminary: Sampling a Gaussian KDE
The Gaussian KDE of a $v×N$ sample matrix $[η]$, modified as in [@Soize2015]:
$$p_H(η) = 1/N \sum_i π(η; η^i s'/s, s')$$
 $π(η; m, σ) = \exp\{  \(η  m)/σ\^2 / 2 \} / (\sqrt{2 π} σ)^v$ is the Gaussian kernel;
 $s = (4 / ((v + 2)N))^{1 / (v + 4)}$ is the optimal Silverman bandwidth;
 $s' = s / \sqrt{s^2 + N / (N − 1)}$;
Sampling the Gaussian KDE of the random vector by solving an ISDE [@Soize1994, pp. 211216, Thm. 47].
The following Markov stochastic process of a nonlinear secondorder dissipative Hamiltonian
dynamical system has a unique invariant measure
and a unique solution that is a secondorder diffusion stochastic process,
which is stationary, ergodic, and $U(t) \sim p_{H}(η)$:
$$\begin{cases}
dU = V dt \\
dV = L(U) dt − 1/2 f_0 V dt + \sqrt{f_0} dW \\
U(0) = H;\quad V(0) = N
\end{cases}$$
 $L(u) = ∇ν(u)$ is the conservative force;
 $ν(u) = \text{LogSumExp}\{ \(u  η^i s'/s) / s'\^2 / 2\}$ is the potential (Hamiltonian);
 $f_0$ is a dissipation parameter such that the transient response of the ISDE are rapidly killed;
 $W$ is the vdimensional normalized Wiener processes (increments are standard Gaussian);
 $H \sim p_{H}(η)$ is a random vector with realizations $[η]$;
 $N$ is the vdimensional normalized Gaussian vector;
Procedure: Sampling a manifoldreduced Gaussian KDE

Shift and scale the data $[ x ]$, a matrix of $p$ attributes by $N$ observations, to $(ϵ, 1)$;
 Normalize the data $[x_0]$ by principal component analysis (PCA):
$[η] = \mathrm{diag}(μ)^{−1/2} [φ]^T [x_0]$
 $μ$ are the $v \le p$ positive eigenvalues of the empirical covariance matrix
$[c] = [x_0] [x_0]^T /(N1)$;
 $φ$ the corresponding $v$ orthonormal eigenvectors;
 Characterize the manifold using a diffusion maps basis: $[g] = [P]^κ [ψ] = [ψ] [\Lambda]^κ$
 $[P] = \mathrm{diag}\{[K] 1\}^{−1} [K]$ is the diffusion map (a transition matrix),
$\{ψ\}$ is the right eigenvectors of $[P]$,
$[\Lambda]$ is the diagonal matrix of the corresponding eigenvalues, and
$κ$ is the "analysis scale" of the local geometric structure of the dataset;
 $[K]_{ij} = k_ε(η^i, η^j)$ are transition likelihood,
$k_ε(x,y)=\exp\{− \x − y\^2 / (4ε)\}$ is the Gaussian kernel with smoothing parameter $ε$,
the kernel may be set to any decreasing function of a metric $f(d(x, y))$
with $f(0) = 1, f(\infty) = 0$;
 $[η] = [z] [g]^T$, where $[z] = [η] [a]$ and $[a] = [g] ([g]^T [g])^{1}$,
because full projection $P_{[g]} = [g] ([g]^T [g])^{1} [g]^T = I$;
 $[η](m) = [η] P_{[g](m)}$ is a reducedorder representation of $[η]$
that projects $[η]$ on $[g](m)$, the first $m$ vectors of $[g]$;
 Sample the reducedorder sample matrix by solving an ISDE:
$[η](t) = [Z](t) [g](m)^T$, $t = l M_0 Δt$, $l = 1, 2, ...$
 $m$ satisfies meansquare convergence criterion $\[c](m)  [c]\_F < ε_0 \[c]\_F$
for some $ε_0 = \mathcal{O}(10^{3})$;
 $[Z](t)$ satisfies the following ISDE where $[L](u) = (∇ν(u^j))_j$:
$$\begin{cases}
d[Z] = [Y] dt \\
d[Y] = [L]([Z] [g](m)^T) [a](m) dt − 1/2 f_0 [Y] dt + \sqrt{f_0} d[W] [a](m) \\
[Z](0) = [H] [a](m);\quad [Y](0) = [N] [a](m)
\end{cases}$$
If exists $[L'] = \nabla \log \rho'$ such that $[L']([Z]) = [L]([Z] [g](m)^T) [a](m)$,
then $[Z] \sim \rho'$.
 The Störmer–Verlet discretization scheme of the ISDE
(preserves energy for nondissipative Hamiltonian dynamical systems):
$$\begin{cases}
[Z_{l+1/2}] = [Z_l] + [Y_l] Δt/2 \\
[Y_{l+1}] = ((1b) [Y_l] + [L_{l+1/2}] [a](m) Δt + \sqrt{f_0} [ΔW_{l+1}] [a](m)) / (1+b)\\
[Z_{l+1}] = [Z_{l+1/2}] + [Y_{l+1}] Δt/2
\end{cases}$$
where $[L_{l+1/2}] = [L]([Z_{l+1/2}] [g](m)^T)$ and $b = f_0 Δt/4$.
 $Δt = 2πs' / \text{Fac}$ is the sampling step of the integration scheme (oversampled if Fac>1);
 $M_0 Δt > 4 / f_0$, the relaxation time of the dynamical system,
so samples are approximately independent, e.g. $M_0 > 2 \log(100) \text{Fac} / (\pi f_0 s')$;
Parameters: $(ε, κ = 1, ε_0, f_0 = 1.5, Δt, M_0 = 110)$, or replace $Δt$ with Fac.
The computational cost is no greater than the direct MCMC in the Preliminary section.
But the main advantage is a probability distribution concentrated on manifold.
Misc
Extensions:
 Polynomial chaos expansion (PCE) [@Soize2017] around a manifold;
 Inference: estimate geological structural properties using well logs [@Thimmisetty2018];
 Stochastic optimization [@Ghanem2018] using PLoM as a surrogate:
 well placement [@Ghanem2018a] in a reservoir for maximum oil production;
 scramjet design [@Ghanem2019], maximize the combustion efficiency,
subject to 3 inequality constratins (on the burned equivalence ratio,
the stagnation pressure loss across the combustor, and the maximum pressure RMS),
with 5 control variables (global equivalence ratio, primarysecondary ratio,
primary/secondary injector locations, primary injector angle);
 additive manufacturing [@Marmarelis2020], infer the relation between build parameters and
material properties, minimize the chance of failure and learn the Pareto frontier;
 Maximumentropy selection of diffusion kernel bandwidth [@Soize2019]:
applications to simulations of a ScramJet engine/combustion and a climate model;
 Physical or statistical constraints [@Soize2020], applied to a stochastic (elliptic) PDE;
 Inverse problems: computational mechanics;
ultrasonic wave propagation to identify elasticity field in human bones [@Soize2020a];
Applications:
optimization (cheap sample on unknown feasible set, a submanifold):
stochastic optimization [@Kingma2015]; maximum likelihood;
Bayesian inference [@ChenYC2020];
dynamical system on manifolds [@Talmon2013; @Talmon2015];
🏷 Category=Manifold