This article is about the geometric structure of statistical models.

Concepts

Statistical manifold is a set of probability distributions endowed with a Riemannian manifold structure. Usually the set is parameterized, and the parameter space is often a regular domain of a Euclidean space; but this is not always the case.

Fisher information metric or Fisher–Rao metric (attribute to [@Rao1945]) of a parametric family of probability distributions is the Fisher information of the family as a Riemannian metric on the family. Recall that the Fisher information of a vector parameter is a covariance matrix, which defined an inner product at that point of the parameter space. More generally, the Fisher information defines a Riemannian metric on any smooth manifold of parameters. Information geometry [@Amari1985; @Amari2000; @Amari2016] is a field that applies differential geometry to statistics. In particular, it adopts the Fisher information metric and α-connections, a one-parameter family of affine connections.

If the Fisher information is non-analytic, some related quantities can be useful. For example, the empirical Fisher information is the finite sample estimate of the (expected) Fisher information. The observed Fisher information [@Efron1978] is the negative Hessian of the log PDF evaluated at the maximum likelihood estimate, $-\nabla_\theta^2\log p(\mathbf{x}; \theta^{ML})$, which need not be positive definite.

Hessian metric for exponential families is a Riemannian metric defined by the Hessian of a convex real-valued function. Such a manifold naturally inherits two flat affine connections, as well as a canonical Bregman divergence.

Local Riemannian metrics have also been proposed, such as the preferred point metric [@Critchley1993].

Applications of the geometric perspective of probability models:

  • diffusion kernel from heat equation for support vector machines [@Lafferty2005], which can be efficiently approximated for multinomial distributions (a simplex endowed with the Fisher information metric);
  • efficient MCMC sampling in high dimensions [@Girolami2011].

Fisher information geometry

Gausssian distributions

Riemannian metric, $g_V(X, Y) = \text{tr}(X V^{-1} Y V^{-1})$. See also the natural metric on the positive-definite manifold $\mathcal{S}_+(n)$.

For n = 1, this manifold can be regarded as the upper half plane with the hyperbolic metric [@Amari1985].

$L^2$-Wasserstein geometry

L^2-Wasserstein space $(\mathcal{P}_2(M), W_2)$, over a complete separable metric space $(M, d_M)$, is the metric space consisting of the set of Borel probability measures on M with finite second moments, and the $L^2$-Wasserstein metric: $\mathcal{P}_2(M) = \{ \mu : \mu(\Sigma) = [0, 1], \mathbb{E}_\mu d_M^2(\chi, x_0) < \infty \}$, $W_2^2(\mu, \nu) = \inf_\pi \int_{M^2} d_M^2(x, y)~\pi(d(x,y))$. Here, $\pi$ is any transport plan between $\mu$ and $\nu$, i.e. a joint probability measure with marginals $\mu$ and $\nu$. It can be shown that $W_2$ is indeed a metric on $\mathcal{P}_2(M)$, see e.g. [@Villani2009]. An $L^2$-Wasserstein space is an Alexandrov space of non-negative curvature if and only if so is its underlying space M [@Sturm2006].

For $L^2$-Wasserstein space over a compact Riemannian manifold, [@Lott2008] computed the Levi-Civita connection and curvature.

For the set of absolutely continuous probability measures on the Euclidean n-space, [@Otto2001] introduced a Riemannian metric such that the Riemannian distance function is the $L^2$-Wasserstein metric, if both measures have finite second-moments. In particular, Otto defined an infinite-dimensional Riemannian manifold $(M, g)$, where M is the set of probability density functions (PDFs) on the Euclidean n-space: $M = \{\rho \in L^1(\mathbb{R}^n) : \rho \ge 0, \|\rho\|_1 = 1 \}$, which is an infinite-dimensional simplex $\Delta^{\mathbb{R}^n}$. The tangent space at a point is $T_\rho M = \{s \in L^1(\mathbb{R}^n) : \|s\|_1 = 0 \}$, which can be identified with the quotient set $C^2(\mathbb{R}^n) / \sim$ where functions that differ by an additive constant are equivalent. The identification is defined via the elliptic equation $s = - \nabla \cdot (\rho \nabla p)$. With this identification, the Riemannian metric is $g_\rho(s_1, s_2) = \int \rho \nabla p_1 \nabla p_2 = \int s_1 p_2$. This manifold can be induced by a flat Riemannian space of all diffeomorphisms of $\mathbb{R}^n$ via a Riemannian submersion (Sec 4.1).

Properties of the Wasserstein geometry of PDFs: (1) exponential map, $\exp_\rho(s) = [\nabla(|x|^2/2 + p)] \#\rho$, where p is identified to s as above, and $\Phi\#\rho$ denotes the pushforward of the density ρ under the transformation $\Phi$ of $\mathbb{R}^n$. (2) Riemannian distance function, $d^2(\rho_0, \rho_1) = \inf_{\rho_1 = \Phi \# \rho_0} \int \rho_0 |x - \Phi|^2$, which equals the $L^2$-Wasserstein metric between probability measures $d^2(\mu_0, \mu_1) = \inf_{\mu = P(\mu_0, \mu_1)} \int |x_1 - x_0|^2 \mu(dx_0 dx_1)$; it is finite if both measures have finite second moments. The $L^2$-Wasserstein metric metrizes the topology of weak-∗ convergence (up to second moments) of probability measures. (3) Sectional curvature (eq. 88), which is non-negative. The manifold is flat if n = 1 and non-flat if n > 1. Per the Radon-Nikodym Theorem, we can identify $\mathcal{P}^\text{ac}_2(\mathbb{R}^n) \cong \Delta^{\mathbb{R}^n} \cap L^1_{x^2}(\mathbb{R}^n)$, which is the intersection of an infinite-dimensional simplex and a Hilbert space. Therefore, the $L^2$-Wasserstein space $\mathcal{P}^\text{ac}_2(\mathbb{R}^n)$ now has the geometry of a Riemannian submanifold.

Gausssian distributions

Preliminaries. $L^2$-Wasserstein metric between any Gausssian measures: $W_2^2(P, \tilde P) = |m - \tilde m|^2 + \text{tr}(\Sigma) + \text{tr}(\tilde \Sigma) - 2~\text{tr}\left(\sqrt{\Sigma^{1/2} \tilde \Sigma \Sigma^{1/2}}\right)$.

The space $\mathcal{N}^n$ of (regular) Gaussian measures on the Euclidean n-space is a geodesically convex (and therefore totally geodesic) submanifold of the infinite-dimensional Riemannian manifold $(\mathcal{P}^\text{ac}_2(\mathbb{R}^n), g)$ (see e.g. [@McCann1997, Ex. 1.7]). It is (Riemannian) isometric to a product Riemannian manifold: $(\mathcal{N}^n, g) \cong (\mathbb{R}^n \times \mathcal{S}_+(n), (\cdot, \cdot) \times g)$, with the isometry $f(P) = (m, \Sigma)$ (see e.g. [@Takatsu2011]). Therefore we can focus on the space $\mathcal{N}^n_0$ of zero-mean (regular) Gaussian measures. The tangent space at a point is $T_V \mathcal{N}^n_0 = \mathcal{S}(n)$.

Properties of the Wasserstein geometry of zero-mean (regular) Gaussian measures: (0) Riemannian metric, $g_V(X, Y) = \text{tr}(X V Y)$. (1) exponential map, $\exp_V(X) = (I + X) V (I + X)$. (2) Riemannian distance function equals the $L^2$-Wasserstein metric (see before). (3) logarithm map, $\log_V(\tilde{V}) = \tilde{V}^{1/2} (\tilde{V}^{1/2} V \tilde{V}^{1/2})^{-1/2} \tilde{V}^{1/2} - I$, see e.g. [@Dowson1982, @Olkin1982]. (4) sectional curvature, $K_V(X, Y) = 3/4~\text{tr}\{ ([Y,X] - S) V ([Y,X] - S)^T \} / q_V(X, Y)$, where bracket product $[X, Y] = X Y - Y X$, symmetric matrix S is one that makes $([Y,X] - S) V$ anti-symmetric, and $q_V(X, Y) = g_V(X, X) g_V(Y, Y) - g_V^2(X, Y)$; it depends only on the eigenvalues of V [@Takatsu2011].

The completion of $\mathcal{N}^n_0$ as a metric space, $\overline{\mathcal{N}^n_0}$, can be identified with $\mathcal{S}_{\ge 0}(n)$. Properties: (1) It is an Alexandrov space of non-negative curvature. (2) It has a finite stratification $\{\mathcal{S}_{\ge 0}(k, n)\}_{k = 0}^n$.

It has a cone structure. Let $V = Q~\text{diag}\{(\lambda_i)_{i=1}^k, 0_{n-k}\} Q^T$, $\lambda > 0$, be an EVD. The tangent cone at the point is $T_V \mathcal{S}_{\ge 0}(n) = \mathcal{S}(n, P_{n-k}) := \{S \in \mathcal{S}(n) : Q_{n-k}^T S Q_{n-k} \in \mathcal{S}_{\ge 0}(n-k) \}$, where $Q_{n-k}$ is the last n-k columns of Q and symmetric projection matrix $P_{n-k} = Q_{n-k} Q_{n-k}^T$. The tangent cone at zero is isometric to itself: $T_0 \mathcal{S}_{\ge 0}(n) = (\mathcal{S}_{\ge 0}(n), W_2)$.


🏷 Category=Statistics Category=Topology