Markov chain Monte Carlo

Markov chain Monte Carlo (MCMC) methods sample from an arbitrary target distribution approximately, by simulating a Markov chain whose stationary distribution coincides with the target distribution. MCMC methods can also be used to sample from posterior distributions in Bayesian inference.

Metropolis adjustment.
Langevin dynamics.
Hamiltonian/hybrid Monte Carlo (HMC).

Detailed Balance

A Markov chain $P$ satisfies detailed balance w.r.t. to a distribution $\pi$ if, starting with $\pi$, the net direct flow between any pair of nodes is zero: $\forall i, j \in n$, $\pi_i p_{ij} = \pi_j p_{ji}$. Alternatively, we say the Markov chain is reversible w.r.t. to $\pi$, because it is statistically indistinguishable running forward or backward in time. The transition matrix of a reversible Markov chain is similar to a symmetric matrix, and therefore has $n$ real eigenvalues and eigenvectors. The spectral decomposition gives insight into the timescales associated with the Markov chain. For example, the second-largest eigenvalue in absolute value affects the time for a chain to reach equilibrium.

If a Markov chain satisfies detailed balance w.r.t. to a distribution, then the distribution is a stationary distribution of the chain. Therefore, detailed balance gives a tractable way to find a stationary distribution.

Given a discrete, not necessarily normalized distribution, the Metropolis-Hastings algorithm constructs a Markov chain whose stationary distribution is the given distribution.

Metropolis-Hasting algorithm for target distribution $\pi$:

Proposal state $Y$ given state $X_n$ satisfies $P(Y = j | X_n = i) = H_{ij}$,
- for discrete distribution, proposal matrix $H$ is any irreducible transition matrix;
- for continuous distribution, proposal kernel $h(y|x)$ is a conditional probability density;
Acceptance probability of a proposal step (Metropolis-Hastings formula): $a_{ij} = \min\{1, (\pi_j H_{ji}) / (\pi_i H_{ij})\}$;

The final induced Markov chain has transition probabilities $P_{ij} = H_{ij} a_{ij}$, $j \ne i$, and $P_{ii} = 1 - \sum_{j \ne i} P_{ij}$. Because $P$ satisfies detailed balance w.r.t. $\pi$, the chain has stationary distribution $\pi$.

Your proposal steps need to be large enough to explore the whole space quickly, but not so large that they are often rejected. A rule of thumb is to achieve a desired average acceptance ratio about 25-50%, depending on the dimension.

The number of runs before a chain actually reaches its stationary distribution is sometimes referred to as the burn-in time $T_0$. The total number of runs should be much larger: $T \gg T_0$.

Because the generated points are correlated even when equilibrium is reached, the number of mutually independent points is less than that of all generated points. When computing a statistic $f(\mathbf{x})$, the effective sample size should be devided by the correlation time: $n_{\text{eff}} = n / \tau_f$.

Diffusion-based MCMC

Diffusion-based MCMC samples from a target distribution by simulating a diffusion process whose stochastic differential follows an autonomous Ito stochastic differential equation (ISDE): $\text{d} \mathbf{x} = a(\mathbf{x})~\text{d} t + b(\mathbf{x})~\text{d} \mathbf{w}$, such that $\mathbf{x}_t \overset{d}{\to} \rho(x)$. In first-order Langevin dynamics (LD), $a(x) = - \nabla U(x)$ and $b = \sqrt{2} I_n$, where potential $U(x) = -\log f(x)$ and $f(x) \propto \rho(x)$ is an unnormalized density. In second-order Langevin dynamics, aka Hamiltonian dyanmics, $\text{d} \mathbf{x} = \mathbf{v}~\text{d} t$, $\text{d} \mathbf{v} = (- \nabla U(x) - D \mathbf{v})~\text{d} t + \sqrt{2D} I_n~\text{d} \mathbf{w}$, where dissipation parameter $D > 0$.

Numerical integration schemes for this ISDE approximate the transition function of the continuous-time diffusion with repeated application of a transition function of a small time step: given $P^t = e^{t \mathcal{L}}$, $t \in T$, design $P_h$, $h \in T$, such that $P_h^l \mathbf{x} \approx P^{l h} \mathbf{x}$. We denote the numerical solution as $\mathbf{x}^n_t$, e.g. at the l-th step $\mathbf{x}^n_{l h} = P_h^l \mathbf{x}_0$; we further simplify this notation by $\mathbf{x}_l$ when it causes no confusion. Kth-order local integrator is a numerical integrator $P_h$ whose action on every smooth bounded function approximates that of the diffusion to the $K$-th order: $\forall f \in (C^2(X), \|\cdot\|_\infty)$, $P_h f = P^h f + \mathcal{O}(h^{K+1})$.

The simpliest numerical integrator is the the Euler integrator, which is of order one. The Euler integrator for the first-order Langevin dynamics is: $\mathbf{x}_l = \mathbf{x}_{l-1} - \nabla U(\mathbf{x}_{l-1}) h + \sqrt{2h} \mathbf{z}$ where $\mathbf{z}$ is the n-dimensional standard Gaussian vector. The Euler integrator for the first-order Hamiltonian dynamics is: $\mathbf{v}_l = \mathbf{v}_{l-1} - \nabla U(\mathbf{x}_{l-1}) h - D \mathbf{v}_{l-1} h + \sqrt{2Dh} \mathbf{z}$ and $\mathbf{x}_l = \mathbf{x}_{l-1} + \mathbf{v}_l h$. Symmetric splitting integrator (SSI) is a second-order integrator, which splits the local generator into a symmetric sequence of sub-generators that can be solved analytically [@ChenCY2015]; it also applies to stochastic gradient MCMC, where a stochastic version of the full gradient is used.

MCMC on Riemann manifolds:

constrained HMC (CHMC) [@Brubaker2012], has inner iteration;
geodesic Monte Carlo (GMC) [@Byrne2013], on manifolds with known geodesic flow, e.g. Stiefel manifolds, incl. spheres;

Stochastic gradient MCMC (SG-MCMC):

Langevin dynamics (SGLD) [@ChenTQ2014a];
HMC (SGHMC) [@ChenTQ2014b];
Nosé-Hoover thermostats (SGNHT) [@DingN2014];

SG-MCMC on Riemann manifold:

Langevin (SGRLD) [@Patterson2013];
Hamiltonian (SGRHMC) [@MaYA2015];
Nosé-Hoover thermostats (gSGNHT) [@LiuC2016];
geodesic (SGGMC) [@LiuC2016];

RATTLE scheme

RATTLE [@Andersen1983], a variant of the SHAKE method [@Ryckaert1977], is a time discretization scheme of constrained Hamiltonian dynamics, with applications in molecular dynamics that models constraints such as bond length and angle. RATTLE is symmetric, symplectic, and convergent of order two [@Hairer2006, VII.1.4 Thm 1.3]. Let $V: \mathbb{R}^n \mapsto \mathbb{R}$ be a potential function and $F: \mathbb{R}^n \mapsto \mathbb{R}^c$ be a constraint function. A constrained Hamiltonian dynamics can be written as:

$$\begin{cases} \text{d} q = p \text{d} t \\ \text{d} p = -\nabla V(q) \text{d} t + \nabla F(q) \text{d} \lambda \\ F(q) = 0 \end{cases}$$

RATTLE scheme:

$$\begin{cases} p^{n+1/2} = p^n - \nabla V(q^n) \Delta t/2 + \nabla F(q^n) \lambda^{n+1/2} \\ q^{n+1} = q^n + p^{n+1/2} \Delta t \\ F(q^{n+1}) = 0 & (C_q) \\ p^{n+1} = p^{n+1/2} - \nabla V(q^{n+1}) \Delta t/2 + \nabla F(q^{n+1}) \lambda^{n+1} \\ (p^{n+1})^\text{T} \nabla F(q^{n+1}) = 0 & (C_p) \end{cases}$$

The position constraint $C_q$ is maintained by oblique projection, i.e. orthographic retraction. Note that, combining the first three equations gives $q^{n+1} = q^n + \Delta \tau^n + \Delta \nu^n$, where $\Delta \tau^n = (p^n - \nabla V(q^n) \Delta t / 2) \Delta t$ is a tangent step, and $\Delta \nu^n = \nabla F(q^n) \lambda^{n+1/2} \Delta t$ is a correction step in the normal directions $\nabla F(q^n)$ of the current position. Lagrange multiplier $\lambda^{n+1/2}$ is calculated so that the new position is on the constraint manifold.

Momentum constraint $C_p$ is maintained by orthogonal projection onto the tangent space at the new position. Lagrange multiplier $\lambda^{n+1}$ is calculated so that the new momentum is orthogonal to the normal directions $\nabla F(q^{n+1})$.

Time Reversibility

For sufficiently small timesteps, RATTLE is reversible w.r.t. momentum reversal, aka symmetric or time-reversible [@Hairer2006, Sec V.1 and VII.1.4 Thm 1.3]: let $\Phi_{\Delta t}(q^{n}, p^{n}) = (q^{n+1}, p^{n+1})$ denote the RATTLE scheme and let $S(q, p) = (q, -p)$ be momentum reversal, then $S \circ \Phi_{\Delta t} = \Phi_{\Delta t}^{-1} \circ S$.

For finite timesteps though, reverse projection check is necessary. [@Zappa2018] uses Metropolis adjustment for reversible random walks on constraint manifolds. Generalizing this work, [@Lelievre2019] proposes RATTLE dynamics with momentum reversal and reverse projection check, $\Psi_{\Delta t}^{\text{rev}}$.