Statistical Manifold

This article is about the geometric structure of statistical models.

Concepts

Statistical manifold is a set of probability distributions endowed with a Riemannian manifold structure. Usually the set is parameterized, and the parameter space is often a regular domain of a Euclidean space; but this is not always the case.

Fisher information metric or Fisher–Rao metric (attribute to [@Rao1945]) of a parametric family of probability distributions is the Fisher information of the family as a Riemannian metric on the family. Recall that the Fisher information of a vector parameter is a covariance matrix, which defined an inner product at that point of the parameter space. More generally, the Fisher information defines a Riemannian metric on any smooth manifold of parameters. Information geometry [@Amari1985; @Amari2000; @Amari2016] is a field that applies differential geometry to statistics. In particular, it adopts the Fisher information metric and α-connections, a one-parameter family of affine connections.

If the Fisher information is non-analytic, some related quantities can be useful. For example, the empirical Fisher information is the finite sample estimate of the (expected) Fisher information. The observed Fisher information [@Efron1978] is the negative Hessian of the log PDF evaluated at the maximum likelihood estimate, $-\nabla_\theta^2\log p(\mathbf{x}; \theta^{ML})$, which need not be positive deﬁnite.

Hessian metric for exponential families is a Riemannian metric defined by the Hessian of a convex real-valued function. Such a manifold naturally inherits two flat affine connections, as well as a canonical Bregman divergence.

Local Riemannian metrics have also been proposed, such as the preferred point metric [@Critchley1993].

Applications of the geometric perspective of probability models:

diffusion kernel from heat equation for support vector machines [@Lafferty2005], which can be efficiently approximated for multinomial distributions (a simplex endowed with the Fisher information metric);
efficient MCMC sampling in high dimensions [@Girolami2011].

Fisher information geometry

Gaussian distributions

Riemannian metric, $g_C(T_1, T_2) = \text{tr}(T_1 C^{-1} T_2 C^{-1})$. This is the same as the natural metric on the positive-definite manifold $\mathcal{S}_+(n)$.

For n = 1, this manifold can be regarded as the upper half plane with the hyperbolic metric [@Amari1985].

$L^2$-Wasserstein geometry

L^2-Wasserstein space $(\mathcal{P}_2(M), W_2)$, over a complete separable metric space $(M, d_M)$, is the metric space consisting of the set of Borel probability measures on M with finite second moments, and the $L^2$-Wasserstein metric (see $L^p$-Wasserstein metric): $\mathcal{P}_2(M) = \{ \mu : \mu(\Sigma) = [0, 1], \mathbb{E}_\mu d_M^2(\chi, x_0) < \infty \}$, $W_2^2(\mu, \nu) = \inf_\pi \int_{M^2} d_M^2(x, y)~\pi(d(x,y))$. Here, $\pi$ is any transport plan between $\mu$ and $\nu$, i.e. a joint probability measure with marginals $\mu$ and $\nu$. It can be shown that $W_2$ is indeed a metric on $\mathcal{P}_2(M)$, see e.g. [@Villani2009]. An $L^2$-Wasserstein space is an Alexandrov space of non-negative curvature if and only if so is its underlying space M [@Sturm2006]. For $L^2$-Wasserstein space over a compact Riemannian manifold, [@Lott2008] computed the Levi-Civita connection and curvature.

L^2-Wasserstein geometry of PDFs $(\mathcal{P}^\text{ac}, g)$ is the infinite-dimensional Riemannian manifold consisting of the set of absolutely continuous probability measures on the Euclidean n-space, and the Riemannian metric introduced in [@Otto2001]. Here, the set of probability density functions (PDFs) on the Euclidean n-space is: $\mathcal{P}^\text{ac} = \{\rho \in L^1(\mathbb{R}^n) : \rho \ge 0, \|\rho\|_1 = 1 \}$, which is the infinite-dimensional simplex $\Delta^{\mathbb{R}^n}$. The tangent space at a point is $T_\rho \mathcal{P}^\text{ac} = \{s \in L^1(\mathbb{R}^n) : \|s\|_1 = 0 \}$, which can be identified with the quotient set $C^2(\mathbb{R}^n) / \sim$ where functions that differ by an additive constant are equivalent. The identification is defined via the elliptic equation $s = - \nabla \cdot (\rho \nabla p)$. With this identification, the Riemannian metric, which we call the L^2-Wasserstein metric, can be written as $g_\rho(s_1, s_2) = \int \rho \nabla p_1 \nabla p_2 = \int s_1 p_2$. This manifold can be induced by a flat Riemannian space of all diffeomorphisms of $\mathbb{R}^n$ via a Riemannian submersion (Sec 4.1). The Riemannian distance function is defined for measures with finite second-moments, which equals the $L^2$-Wasserstein metric. In other words, we have a complete Riemannian manifold $(\mathcal{P}^\text{ac} \cap \mathcal{P}_2(\mathbb{R}^n), g, W_2)$.

Properties of the Wasserstein geometry of PDFs: (1) exponential map, $\exp_\rho(s) = [\nabla(|x|^2/2 + p)] \#\rho$, where p is identified to s as above, and $\Phi\#\rho$ denotes the pushforward of the density ρ under the transformation $\Phi$ of $\mathbb{R}^n$. (2) Riemannian distance function, $d^2(\rho_0, \rho_1) = \inf_{\rho_1 = \Phi \# \rho_0} \int \rho_0 |x - \Phi|^2$, which equals the $L^2$-Wasserstein metric between probability measures $d^2(\mu_0, \mu_1) = \inf_{\mu = P(\mu_0, \mu_1)} \int |x_1 - x_0|^2 \mu(dx_0 dx_1)$; it is finite if both measures have finite second moments. The $L^2$-Wasserstein metric metrizes the topology of weak-∗ convergence (up to second moments) of probability measures; that is, convergence in the $L^2$-Wasserstein metric implies weak-∗ convergence. (3) Sectional curvature (eq. 88), which is non-negative. The manifold is flat if n = 1 and non-flat if n > 1. Per the Radon-Nikodym Theorem, we can identify $\mathcal{P}^\text{ac}_2(\mathbb{R}^n) \cong \Delta^{\mathbb{R}^n} \cap L^1_{x^2}(\mathbb{R}^n)$, which is the intersection of an infinite-dimensional simplex and a Hilbert space. Therefore, the $L^2$-Wasserstein space $\mathcal{P}^\text{ac}_2(\mathbb{R}^n)$ now has the geometry of a Riemannian submanifold.

Gaussian distributions

Preliminaries. $L^2$-Wasserstein metric between any Gaussian measures: $W_2^2(P, \tilde P) = |m - \tilde m|^2 + \text{tr}(\Sigma) + \text{tr}(\tilde \Sigma) - 2~\text{tr}\left(\sqrt{\Sigma^{1/2} \tilde \Sigma \Sigma^{1/2}}\right)$.

The space $\mathcal{N}^n$ of (regular) Gaussian measures on the Euclidean n-space is a geodesically convex (and therefore totally geodesic) submanifold of the infinite-dimensional Riemannian manifold $(\mathcal{P}^\text{ac}_2(\mathbb{R}^n), g)$ (see e.g. [@McCann1997, Ex. 1.7]). It is (Riemannian) isometric to a product Riemannian manifold: $(\mathcal{N}^n, g) \cong (\mathbb{R}^n \times \mathcal{S}_+(n), (\cdot, \cdot) \times g)$, where $(\mathcal{S}_+(n), g)$ is the positive definite manifold with the $L^2$-Wasserstein geometry, with the isometry $f(P) = (m, \Sigma)$, (see e.g. [@Takatsu2011]). Therefore we can focus on the space $\mathcal{N}^n_0$ of zero-mean (regular) Gaussian measures. The tangent space at a point is $T_C \mathcal{N}^n_0 \cong \mathcal{S}(n)$.

Properties of the Wasserstein geometry of zero-mean (regular) Gaussian measures: (0) Riemannian metric, $g_C(T_1, T_2) = \text{tr}(T_1 C T_2)$. (1) exponential map, $\exp_C(T) = (I + T) C (I + T)$. (2) Riemannian distance function equals the $L^2$-Wasserstein metric (see before). (3) logarithm map, $\log_C(\tilde{C}) = \tilde{C}^{1/2} (\tilde{C}^{1/2} C \tilde{C}^{1/2})^{-1/2} \tilde{C}^{1/2} - I$, see e.g. [@Dowson1982, @Olkin1982]; which satisfies $\log_C(\tilde{C}) + I = (\log_{\tilde{C}}(C) + I)^{-1}$. (4) sectional curvature, $K_C(X, Y) = 3/4~\text{tr}\{ ([Y,X] - S) C ([Y,X] - S)^T \} / q_C(X, Y)$, where bracket product $[X, Y] = X Y - Y X$, symmetric matrix S is one that makes $([Y,X] - S) C$ anti-symmetric, and $q_C(X, Y) = g_C(X, X) g_C(Y, Y) - g_C^2(X, Y)$; it depends only on the eigenvalues of C [@Takatsu2010]. The two versions of L2-Wasserstein metrics, on tangent space $S(n)$ and on horizontal tangent space $M_{k,n}$ (see [@Massart2020]), are equivalent: with some abuse of notation, $g_C(\log_C \tilde{C}, \log_C \hat{C}) = \bar{g}_Y(\log_Y \tilde{Y}, \log_Y \hat{Y})$, where $C = Y Y^T$.

The completion $\overline{\mathcal{N}^n_0}$ of the metric space $(\mathcal{N}^n_0, W_2)$, i.e. completion in the sense of weak convergence of probability measures, can be identified with $\mathcal{S}_{\ge 0}(n)$. Properties: (1) It is an Alexandrov space of non-negative curvature (see Metric-Space). (2) It has a finite stratification into topological manifolds, $\{\mathcal{S}_{\ge 0}(k, n)\}_{k = 0}^n$, which can be assigned a consistent set of Riemannian metrics equivalent to the $L^2$-Wasserstein metric on $\mathcal{N}^n_0$ (see [@Massart2020] and Matrix-Manifold). (3) tangent cones: Let $C = Q~\text{diag}\{(\lambda_i)_{i=1}^k, 0_{n-k}\} Q^T$, $\lambda > 0$, be an EVD. The tangent cone at the point C is $T_C \mathcal{S}_{\ge 0}(n) = \mathcal{S}(n, P_{n-k}) := \{S \in \mathcal{S}(n) : Q_{n-k}^T S Q_{n-k} \in \mathcal{S}_{\ge 0}(n-k) \}$, where $Q_{n-k}$ is the last n-k columns of Q and symmetric projection matrix $P_{n-k} = Q_{n-k} Q_{n-k}^T$. The tangent cone at C is isometric to that at $Q C Q^T$ for every Q in O(n). (4) It has a cone structure: it is isometric to the tangent cone at zero, $T_0 \mathcal{S}_{\ge 0}(n) = (\mathcal{S}_{\ge 0}(n), W_2)$.

🏷 Category=Statistics Category=Topology