Manifold Learning

Probabilistic learning on manifold.

A probability distribution concentrated on a subset of the Euclidean space, which is viewed as a manifold.

Probability Density Estimation and Sampling

A synthesis of methods: {Soize2016}

Kernel density estimation (KDE) for the probability distribution of the sample matrix;
Diffusion maps for the "local" geometric structure (manifold) of the dataset;
- reduced-order representation of the sample matrix by top eigenvectors of the diffusion map;
Markov chain Monte Carlo (MCMC) method based on Itô stochastic differential equation (ISDE) for generating realizations of the sample matrix;

(Applications: data security/anonymization) (TODO: KDE for discrete distributions.) (Diffusion maps have difficulty scaling to large-N data sets.)

Preliminary

The Gaussian KDE of a random vector is \(p_H(η) = 1/N \sum_i π(η; η^i s'/s, s')\):

\(π(η; m, σ) = \exp\{ - \|(η - m)/σ\|^2 / 2 \} / (\sqrt{2 π} σ)^v \) is the Gaussian kernel;
\(s = (4 / ((v + 2)N))^{1 / (v + 4)}\) is the optimal Silverman bandwidth;
\(s' = s / \sqrt{s^2 + N / (N − 1)}\);

(Given a manifold, scattering in certain directions become less likely.) (Better do homogenized scatter on manifold: data + scatter along local tangent-space/PCA/displacements.) (Manifold described by an atlas of charts.) (Local information is summarized to the second order, and reconstructed as a Gaussian distribution.)

Sampling the Gaussian KDE of the random vector: The following Markov stochastic process of a nonlinear second-order dissipative Hamiltonian dynamical system has a unique invariant measure and a unique solution that is a second-order diffusion stochastic process, which is stationary and ergodic, and \(U \sim p_{H}(η)\): \[\begin{cases} dU = V dt \\ dV = L(U) dt − f_0/2 V dt + \sqrt{f_0} dW \\ U(0) = H;\quad V(0) = N \end{cases}\]

\(L(u) = -∇ν(u)\) is the conservative force;
\(ν(u) = -\text{LogSumExp}\{ -\|(u - η^i s'/s) / s'\|^2 / 2\}\) is the potential (Hamiltonian);
\(f_0\) is a dissipation parameter such that the transient response of the ISDE are rapidly killed;
\(W\) is the v-dimensional normalized Wiener processes (increments are standard Gaussian);
\(N\) is the v-dimensional normalized Gaussian vectors;
\(H \sim p_{H}(η) \) is a random vector with realizations \([η]\);

(Why not use bootstrap plus Gaussian noise?)

Procedure

Scale the data to (ϵ, 1): \(\), a matrix of n attributes by N observations;
Normalize the data by principal component analysis (PCA): \([η] = [μ]^{−1/2} [φ]^T [x_0]\)
- μ are the v positive eigenvalues of covariance matrix \([c] = [x_0] [x_0]^T /(N-1)\);
- φ the corresponding orthonormal eigenvectors;
Characterize the manifold using a diffusion maps basis: \([η] = [z] [g]^T\)
- \([g] = [P]^κ [ψ]\) is the "diffusion maps basis", \([P] = \mathrm{diag}\{[K] 1\}^{−1} [K]\) is a transition matrix with right eigenvectors {ψ}, κ is the analysis scale of the local geometric structure of the dataset, \([K]_{ij} = k_ε(η^i, η^j)\) are transition likelihood, \(k_ε(x, y) = \exp\{− \|x − y\|^2 / (4ε)\}\) is the Gaussian kernel (or any symmetric non-negative function) with a smoothing parameter ε;
- \([z] = [η] [a]\), \([a] = [g] ([g]^T [g])^{-1} \), so that \([z] [g]^T = [η] P_{[g]}\);
- \([η](m) = [η] P_{[g](m)}\) is a reduced-order representation that projects \([η]\) on the first m diffusion maps basis vectors \([g](m)\);
Sample the reduced-order sample matrix by solving an Itô stochastic differential equation (ISDE): \([η](t) = [Z](t) [g](m)^T\), \(t = l M_0 Δt\), \(l = 1, 2, ...\)
- m satisfies mean-square convergence criterion \(\|[c](m) - [c]\|_F < ε_0 \|[c]\|_F\) for some \(ε_0 = O(10^{-3})\);
- \([Z](t)\) satisfies the following ISDE such that \([Z] [g](m)^T \sim p_{[H](m)}(η)\): \[\begin{cases} d[Z] = [Y] dt \\ d[Y] = [L]([Z] [g](m)^T) [a](m) dt − f_0/2 [Y] dt + \sqrt{f_0} d[W] [a](m) \\ [Z](0) = [H] [a](m);\quad [Y](0) = [N] [a](m) \end{cases}\] which can be solved by the Störmer–Verlet discretization scheme (preserves energy for non-dissipative Hamiltonian dynamical systems): \[\begin{cases} [Z_{l+1/2}] = [Z_l] + [Y_l] Δt/2 \\ [Y_{l+1}] = \left((1-b) [Y_l] + [L_{l+1/2}] [a](m) Δt + \sqrt{f_0} [ΔW_{l+1}] [a](m)\right) / (1+b)\\ [Z_{l+1}] = [Z_{l+1/2}] + [Y_{l+1}] Δt/2 \end{cases}\] where \([L_{l+1/2}] = [L]([Z_{l+1/2}] [g](m)^T)\), \([L](u) = (-∇ν(u^j))_j\), and \(b = f_0 Δt/4\).
- \(Δt = 2 π s' / \text{Fac}\) is the sampling step of the integration scheme (oversampled if Fac > 1);
- \(M_0 Δt > 4 / f_0\), the relaxation time of the dynamical system, e.g. \(M_0 > 2 \log(100) \text{Fac} / (\pi f_0 s')\), so samples are approximately independent;

Parameters: \((ε, κ = 1, ε_0, f_0 = 1.5, Δt, M_0 = 110)\), or replace Δt with Fac.

The computational cost is no greater than the direct MCMC in the Preliminary section. But the main advantage is a probability distribution concentrated on manifold.

Polynomial Chaos representation

{Soize2017}

Probabilistic non-convex optimization

{Ghanem2018}