Markov chain Monte Carlo (MCMC) methods sample from an arbitrary target distribution approximately, by simulating a Markov chain whose stationary distribution coincides with the target distribution. MCMC methods can also be used to sample from posterior distributions in Bayesian inference.

  1. Metropolis adjustment.
  2. Langevin dynamics.
  3. Hamiltonian/hybrid Monte Carlo (HMC).

Detailed Balance

A Markov chain $P$ satisfies detailed balance w.r.t. to a distribution $\pi$ if, starting with $\pi$, the net direct flow between any pair of nodes is zero: $\forall i, j \in n$, $\pi_i p_{ij} = \pi_j p_{ji}$. Alternatively, we say the Markov chain is reversible w.r.t. to $\pi$, because it is statistically indistinguishable running forward or backward in time. The transition matrix of a reversible Markov chain is similar to a symmetric matrix, and therefore has $n$ real eigenvalues and eigenvectors. The spectral decomposition gives insight into the timescales associated with the Markov chain. For example, the second-largest eigenvalue in absolute value affects the time for a chain to reach equilibrium.

If a Markov chain satisfies detailed balance w.r.t. to a distribution, then the distribution is a stationary distribution of the chain. Therefore, detailed balance gives a tractable way to find a stationary distribution.

Given a discrete, not necessarily normalized distribution, the Metropolis-Hastings algorithm constructs a Markov chain whose stationary distribution is the given distribution.

Metropolis-Hasting algorithm for target distribution $\pi$:

  1. Proposal state $Y$ given state $X_n$ satisfies $P(Y = j | X_n = i) = H_{ij}$,
    • for discrete distribution, proposal matrix $H$ is any irreducible transition matrix;
    • for continuous distribution, proposal kernel $h(y|x)$ is a conditional probability density;
  2. Acceptance probability of a proposal step (Metropolis-Hastings formula): $a_{ij} = \min\{1, (\pi_j H_{ji}) / (\pi_i H_{ij})\}$;

The final induced Markov chain has transition probabilities $P_{ij} = H_{ij} a_{ij}$, $j \ne i$, and $P_{ii} = 1 - \sum_{j \ne i} P_{ij}$. Because $P$ satisfies detailed balance w.r.t. $\pi$, the chain has stationary distribution $\pi$.

Your proposal steps need to be large enough to explore the whole space quickly, but not so large that they are often rejected. A rule of thumb is to achieve a desired average acceptance ratio about 25-50%, depending on the dimension.

The number of runs before a chain actually reaches its stationary distribution is sometimes referred to as the burn-in time $T_0$. The total number of runs should be much larger: $T \gg T_0$.

Because the generated points are correlated even when equilibrium is reached, the number of mutually independent points is less than that of all generated points. When computing a statistic $f(\mathbf{x})$, the effective sample size should be devided by the correlation time: $n_{\text{eff}} = n / \tau_f$.

Diffusion-based MCMC

Diffusion-based MCMC samples from a target distribution by simulating a diffusion process whose stochastic differential follows an autonomous Ito stochastic differential equation (ISDE): $\text{d} \mathbf{x} = a(\mathbf{x})~\text{d} t + b(\mathbf{x})~\text{d} \mathbf{w}$, such that $\mathbf{x}_t \overset{d}{\to} \rho(x)$. In first-order Langevin dynamics (LD), $a(x) = - \nabla U(x)$ and $b = \sqrt{2} I_n$, where potential $U(x) = -\log f(x)$ and $f(x) \propto \rho(x)$ is an unnormalized density. In second-order Langevin dynamics, aka Hamiltonian dyanmics, $\text{d} \mathbf{x} = \mathbf{v}~\text{d} t$, $\text{d} \mathbf{v} = (- \nabla U(x) - D \mathbf{v})~\text{d} t + \sqrt{2D} I_n~\text{d} \mathbf{w}$, where dissipation parameter $D > 0$.

Numerical integration schemes for this ISDE approximate the transition function of the continuous-time diffusion with repeated application of a transition function of a small time step: given $P^t = e^{t \mathcal{L}}$, $t \in T$, design $P_h$, $h \in T$, such that $P_h^l \mathbf{x} \approx P^{l h} \mathbf{x}$. We denote the numerical solution as $\mathbf{x}^n_t$, e.g. at the l-th step $\mathbf{x}^n_{l h} = P_h^l \mathbf{x}_0$; we further simplify this notation by $\mathbf{x}_l$ when it causes no confusion. Kth-order local integrator is a numerical integrator $P_h$ whose action on every smooth bounded function approximates that of the diffusion to the $K$-th order: $\forall f \in (C^2(X), \|\cdot\|_\infty)$, $P_h f = P^h f + \mathcal{O}(h^{K+1})$.

The simpliest numerical integrator is the the Euler integrator, which is of order one. The Euler integrator for the first-order Langevin dynamics is: $\mathbf{x}_l = \mathbf{x}_{l-1} - \nabla U(\mathbf{x}_{l-1}) h + \sqrt{2h} \mathbf{z}$ where $\mathbf{z}$ is the n-dimensional standard Gaussian vector. The Euler integrator for the first-order Hamiltonian dynamics is: $\mathbf{v}_l = \mathbf{v}_{l-1} - \nabla U(\mathbf{x}_{l-1}) h - D \mathbf{v}_{l-1} h + \sqrt{2Dh} \mathbf{z}$ and $\mathbf{x}_l = \mathbf{x}_{l-1} + \mathbf{v}_l h$. Symmetric splitting integrator (SSI) is a second-order integrator, which splits the local generator into a symmetric sequence of sub-generators that can be solved analytically [@ChenCY2015]; it also applies to stochastic gradient MCMC, where a stochastic version of the full gradient is used.

MCMC on Riemann manifolds:

  • constrained HMC (CHMC) [@Brubaker2012], has inner iteration;
  • geodesic Monte Carlo (GMC) [@Byrne2013], on manifolds with known geodesic flow, e.g. Stiefel manifolds, incl. spheres;

Stochastic gradient MCMC (SG-MCMC):

  • Langevin dynamics (SGLD) [@ChenTQ2014a];
  • HMC (SGHMC) [@ChenTQ2014b];
  • Nosé-Hoover thermostats (SGNHT) [@DingN2014];

SG-MCMC on Riemann manifold:

  • Langevin (SGRLD) [@Patterson2013];
  • Hamiltonian (SGRHMC) [@MaYA2015];
  • Nosé-Hoover thermostats (gSGNHT) [@LiuC2016];
  • geodesic (SGGMC) [@LiuC2016];

RATTLE scheme

RATTLE [@Andersen1983], a variant of the SHAKE method [@Ryckaert1977], is a time discretization scheme of constrained Hamiltonian dynamics, with applications in molecular dynamics that models constraints such as bond length and angle. RATTLE is symmetric, symplectic, and convergent of order two [@Hairer2006, VII.1.4 Thm 1.3]. Let $V: \mathbb{R}^n \mapsto \mathbb{R}$ be a potential function and $F: \mathbb{R}^n \mapsto \mathbb{R}^c$ be a constraint function. A constrained Hamiltonian dynamics can be written as:

$$\begin{cases} \text{d} q = p \text{d} t \\ \text{d} p = -\nabla V(q) \text{d} t + \nabla F(q) \text{d} \lambda \\ F(q) = 0 \end{cases}$$

RATTLE scheme:

$$\begin{cases} p^{n+1/2} = p^n - \nabla V(q^n) \Delta t/2 + \nabla F(q^n) \lambda^{n+1/2} \\ q^{n+1} = q^n + p^{n+1/2} \Delta t \\ F(q^{n+1}) = 0 & (C_q) \\ p^{n+1} = p^{n+1/2} - \nabla V(q^{n+1}) \Delta t/2 + \nabla F(q^{n+1}) \lambda^{n+1} \\ (p^{n+1})^\text{T} \nabla F(q^{n+1}) = 0 & (C_p) \end{cases}$$

The position constraint $C_q$ is maintained by oblique projection, i.e. orthographic retraction. Note that, combining the first three equations gives $q^{n+1} = q^n + \Delta \tau^n + \Delta \nu^n$, where $\Delta \tau^n = (p^n - \nabla V(q^n) \Delta t / 2) \Delta t$ is a tangent step, and $\Delta \nu^n = \nabla F(q^n) \lambda^{n+1/2} \Delta t$ is a correction step in the normal directions $\nabla F(q^n)$ of the current position. Lagrange multiplier $\lambda^{n+1/2}$ is calculated so that the new position is on the constraint manifold.

Momentum constraint $C_p$ is maintained by orthogonal projection onto the tangent space at the new position. Lagrange multiplier $\lambda^{n+1}$ is calculated so that the new momentum is orthogonal to the normal directions $\nabla F(q^{n+1})$.

Time Reversibility

For sufficiently small timesteps, RATTLE is reversible w.r.t. momentum reversal, aka symmetric or time-reversible [@Hairer2006, Sec V.1 and VII.1.4 Thm 1.3]: let $\Phi_{\Delta t}(q^{n}, p^{n}) = (q^{n+1}, p^{n+1})$ denote the RATTLE scheme and let $S(q, p) = (q, -p)$ be momentum reversal, then $S \circ \Phi_{\Delta t} = \Phi_{\Delta t}^{-1} \circ S$.

For finite timesteps though, reverse projection check is necessary. [@Zappa2018] uses Metropolis adjustment for reversible random walks on constraint manifolds. Generalizing this work, [@Lelievre2019] proposes RATTLE dynamics with momentum reversal and reverse projection check, $\Psi_{\Delta t}^{\text{rev}}$.