Monte Carlo methods

Monte Carlo methods are methods that repeatly draw random samples from an artificial population to solve a mathematical or statistical problem. The term "Monte Carlo" was used by von Neumann and Ulam during World War II as a code word for the Manhattan Project at Los Alamos, which involved simulation of random neutron diffusion in nuclear materials. Monte Carlo methods use true random numbers or pseudo-random numbers; quasi-Monte Carlo methods use quasi-random sequences.

Random Number Generation

Although random number generation is not part of the Monte Carlo methodology, the latter feeds from and has helped the development of the former. Types of random numbers in decreasing order of randomness:

True random numbers: numbers generated from stochastic processes;
Pseudorandom numbers: deterministic integer sequences with long period that pass some statistical tests of random sequences when provided a random initial state (pioneered by von Neumann in 1946 for the Manhattan project);
Low-discrepancy sequences (quasi-random sequences): all subsequences have low discrepancy;

Randomness tests are statistical tests on whether a binary sequence is patternless; a widely used collection of tests is the Diehard Battery of Tests [@Marsaglia1996]. TestU01 [@L’Ecuyer and Simard, 2007] is a popular randomness test suite written in ANSI C.

True Random Number Generator

True random number generator (TRNG) or hardware random number generator is a physical device that generates random numbers from a physical stochastic process, e.g. thermal noise, the photoelectric effect involving a beam splitter, and other quantum phenomena.

Random number can also be generated from macroscopic processes, justified by unstable dynamic systems and chaos theory. Common practices use tools such as yarrow, coin, dice, roulette wheel, and lottery machines.

Hardware RNGs have limited random bits per second. As a trade-off between randomness and speed, TRNGs typically serve as the entropy source (seed provider) for CSPRNGs or PRNGs.

Pseudo-Random Number Generator

Pseudo-random number generator (PRNG) or deterministic random bit generator (DRBG) is an algorithm using deterministic polynomial-time computable functions to generate numbers computationally indistinguishable from true random numbers: generator $\{G_k\}_{k \in \mathbb{N_+}} \subset \mathbf{PTIME}$, $G_k : \{0,1\}^k \mapsto \{0,1\}^{p(k)}$, where $k$ is state length and $p \in \mathcal{N_+}[k]$ is output block length; usually $G_k(s_i) = (s_{i+1}, y_i)$, where $s_{i+1}$ is the new state of the generator and $y_i$ is the generated pseudo-random number; let $\mathbf{x} \sim U\{0,1\}^k$ and $\mathbf{r} \sim U\{0,1\}^{p(k)}$, the generator must satisfy $\forall A: \{0,1\}^{p(k)} \mapsto \{0,1\}$, $A \in \mathbf{PTIME}$, $\exists \mu(k) \ll 1$: $|\Pr\{A(G_k(\mathbf{x})) = 1\} - \Pr\{A(\mathbf{r}) = 1\}| < \mu(k)$.

Classes of PRNGs in increasing order of quality:

Linear congruential generators (LCG, 线性同余): defined by the recurrence relation $N_{n+1} = (a N_{n} + c) \bmod m$; in case increment $c=0$, the LCG is called a multiplicative congruential generator (MCG) or Lehmer RNG, such as Wichmann-Hill;
Lagged Fibonacci generator (LFG): defined by the recurrence relation $S_{n} = S_{n-j} \star S_{n-k} \pmod m$ where $0<j<k$ and $\star$ is a general binary operator; an example LFG is subtract-with-carry [@Marsaglia and Zaman, 1991];
Linear recurrences: linear feedback shift registers (LFSR), Mersenne-Twister, Well Equidistributed Long-period Linear (WELL) [@Panneton, L'Ecuyer, Matsumoto, 2006], xorshift [@Marsaglia2003];

The first fast and high-quality PRNG is Mersenne-Twister (MT, 梅森旋转) [@Matsumoto and Nishimura, 1998], which is a twisted generalised LFSR (GLFSR) of rational normal form with period of a Mersenne prime. Mersenne-Twister is by far the most widely used general-purpose PRNG, and is the default PRNG in may software systems such as GLib, GMP, GSL, R and Python. MT19937 is the most commonly used version of Mersenne Twister, which has a 624-dimensional state space of 32-bit integers, a period of $M_{19937} = 2^{19937} - 1$, and is equi-distributed in up to 623 dimensions of 32-bit integers. Disadvantages of MT19937: large state buffer (2496 Bytes), slow, does not pass all randomness tests (linear complexity), very slow recovery from state spaces with a large number of zeros.

Cryptographically secure PRNG (CSPRNG) is one that is considered suitable for use in cryptography, such as key generation. Forward-secure PRNG is one such that given a random initial state, any finite-time sequence it generates is computationally indistinguishable from an equal-length sequence of true random numbers.

Quasi-Random Sequence

Discrepancy of a finite sequence in the unit n-interval is the uniform distance between the n-volume and the empirical measure defined by the sequence, which estimates the (relative) n-volume of an n-interval by the proportion of points in the sequence that are in the n-interval: $D(P) = \sup_{B \in \mathcal{J}} \left| \frac{|P \cap B|}{|P|} - \lambda(B) \right|$, where $P = (x_i)_{i = 1}^N$, $x_i \in \mathbb{R}^n$, and $\mathcal{J} = \{[a, b] : a, b \in I^n\}$. Low-discrepancy sequence or quasi-random sequence is a sequence with low discrepancy, especially lower than that of uniformly distributed true or pseudo-random sequences.

Sampling

Monte Carlo sampling...

Exact sampling, independent samples from certain classes of probability distributions.

Direct Methods

Convolution method samples from a probability model as the sum of some other random variables, such that the convolution of those distributions equals the target distribution: given $f(x) = g_1 * \dots * g_m(x)$, sample by $\mathbf{x} = \sum_{i=1}^m \mathbf{x}_i$, where $\mathbf{x}_i \sim g_i$. Composition method samples from a mixture probability model by first sampling the component of the mixture and then sampling the component probability model: given $f(x) = \sum_{i=1}^m p_i g_i(x)$, sample by $\mathbf{x} = \sum_{i=1}^m \mathbf{x}_i 1(\mathbf{y}=i)$.

Inverse transform method samples a real random variable by sampling from the standard uniform distribution and transforming it with the inverse of the cumulative distribution function: given $\mathbf{x} \sim F$, assume $F^{-1}$ is easy to compute, sample by $\mathbf{x} = F^{-1}(\mathbf{u})$, where $\mathbf{u} \sim U[0,1]$. Flow-based probabilistic model, aka normalizing flows, generalizes the inverse transform method to random vectors, which expresses a continuous probability distribution as a sequence of simple, invertible, differentiable transformations of a simple univariate distribution.

Indirect Method

Acceptance-rejection method or rejection sampling [@vonNeumann1951] samples a bounded, compactly supported probability density function, using a proposal PDF that is easy to sample from and majorizes the target PDF after scaling: given $C \ge 1$, $C g(x) \ge f(x)$, sample $\mathbb{x} \sim g$, $\mathbb{u} \sim U[0, 1]$, return $\mathbb{x}$ if $\mathbb{u} < \frac{f(\mathbb{x})}{C g(\mathbb{x})}$. Acceptance probability is the reciprocal of the scaling constant, so the acceptance-rejection method is most efficient if the scaling constant is small.

Methods for Special Distributions

To sample from the standard Gaussian distribution given uniform pseudo-random numbers, several algorithms have been developed that are faster than the inverse transform method. Box–Muller transform [@Box and Muller, 1958; @Paley and Wiener, 1934]: sample $(\mathbf{u}_1, \mathbf{u}_2) \sim U[0,1]^2$, let $\mathbf{r} = \sqrt{-2 \ln \mathbf{u}_1}$ and $\boldsymbol{\theta} = 2 \pi \mathbf{u}_2$, then $(\mathbf{r} \cos \boldsymbol{\theta}, \mathbf{r} \sin \boldsymbol{\theta}) \sim N(0, I_2)$. Marsaglia polar method [@Marsaglia and Bray, 1964] is a variant of the Box–Muller transform that avoids the trigonometric functions but uses rejection sampling (efficiency $1/C = π/4$): sample $(\mathbf{u}_1, \mathbf{u}_2) \sim U(\mathbb{B}^2)$ by the acceptance-rejection method, let $\mathbf{s} = \mathbf{u}_1^2 + \mathbf{u}_2^2$, notice that $\mathbf{r} = \sqrt{-2 \ln \mathbf{s}}$, $\cos \boldsymbol{\theta} = \mathbf{u}_1 / \sqrt{\mathbf{s}}$, and $\sin \boldsymbol{\theta} = \mathbf{u}_2 / \sqrt{\mathbf{s}}$, we have $\sqrt{\frac{-2 \ln \mathbf{s}}{\mathbf{s}}} (\mathbf{u}_1 , \mathbf{u}_2) \sim N(0, I_2)$.

To sample from a discrete distribution, partition the unit interval into intervals of lengths equal to the probabilities, and then take uniform pseudo-random numbers and search for the index of the enclosing interval. Time complexity of search algorithms: linear search, $O(n)$; binary search, $O(\log n)$; indexed search (aka cutpoint method). Alias method [@Walker1977] samples from a discrete distribution in constant time $O(1)$, by looking up pre-computed tables instead of searching, with table generation time $O(n)$, or $O(n \log n)$ for optimized probability table: it creates a probability table of length $n$ with values close to 1, and an alias table containing alternative indices; at sampling, first draw a random index and a random fraction, return the index if the fraction is less than the corresponding probability, otherwise return the alternative index.

To sample from a monotonic PDF, ziggurat method (金字形神塔) [@Marsaglia and Tsang, 1984, 2000] covers the PDF with horizontal equal-area rectangles except for a tail of the same area, samples the inner rectangle w.r.t. each covering shape by comparison, samples the remainder of the PDF in each covering rectangle by the rejection method, and samples the tail by some efficient method, e.g. [@Marsaglia1964] for the Gaussian tail. The ziggurat algorithm (32-bit version; non-increasing PDF on the non-negative half-line): find $(x_i)_{i=0}^{255}$ such that $x_0 = 0$ and $(f(x_{i-1}) -f(x_i)) x_i = 2^{-32}$; store $(f_i)_{i=0}^{255}$, where $f_i = f(x_i)$; compute $(k_i)_{i=0}^{255}$, where $k_0 = \lfloor (r f(r) / v) 2^{32} \rfloor$ and $k_i = \lfloor (x_{i-1} / x_i) 2^{32} \rfloor$ otherwise; compute $(w_i)_{i=0}^{255}$, where $w_0 = 2^{-32} v/f(r)$ and $w_i = 2^{-32} x_i$ otherwise; sample $\mathbf{j} \sim U(2^{32})$, let $\mathbf{i} = \mathbf{j} \bmod 2^8$; if $\mathbf{j} < k_\mathbf{i}$, return $\mathbf{x} = \mathbf{j} w_{\mathbf{i}}$; if $\mathbf{i} \ne 0$, sample $\mathbf{u} \sim U[0,1]$, if $\mathbf{u} < \frac{f(\mathbf{x}) - f_i}{f_{i-1} - f_i}$, return $\mathbf{x}$; otherwise, sample from the tail $f(x) \mid \mathbf{x}>x_{255}$; The ziggurat method improves on other methods, e.g. those for sampling from the Gaussian or exponential distribution, because for most cases (~99%) it avoids evalutating expensive math functions e.g. logarithm and square root, and only requires two table lookups, a floating point comparison, and a floating point multiplication.

Markov Chain Monte Carlo

Markov chain Monte Carlo (MCMC) methods create correlated, approximate sample from an arbitrary distribution, or the posterior distribution of objective Bayesian inference, by simulating a Markov chain whose stationary distribution coincides with the target distribution.

Metropolis-Hastings algorithm [@Hastings1970], requires a good proposal distribution;
Gibbs sampling [@Geman1984], requires the conditional distribution;
Hamiltonian Monte Carlo (HMC; previouly know as hybrid Monte Carlo) [@Duane1987];
Metropolis-adjusted Langevin algorithm (MALA) [@Besag1994];
Wang and Landau algorithm [@WangFG2001], slice sampling [@Neal2003];

Stochastic evolution equations.

Mean field particle Monte Carlo:

quantum Monte Carlo, diffusion Monte Carlo;
sequential Monte Carlo (particle filter);

Perfect sampling is an MCMC method that samples exactly from a target distribution, e.g. coupling from the past [@Propp and Wilson, 1966]. But perfect sampling is much more computationally intensive than simple MCMC, and is difficult or impossible to use for most continuous simulation systems.

Simulation

Monte Carlo simulation is a type of simulation (a fictitious representation of reality) that uses repeated Monte Carlo samples to assess the performance of a statistical method, e.g. sampling distribution, sensitivity, misspecification.

The typical setup in Monte Carlo simulation is as follows. (We use bold face to denote random entities.) $\mathbf{x}$ is a random vector which we can sample from, whose probability density function is $f$. $H(x)$ is a real-valued function on the outcomes, which we call the "performance function". Expectation of the performance $l = \mathbb{E}H(\mathbf{x})$ cannot be evaluated analytically. We can use simulation to estimate the expectation by sample mean: $\mathbf{\hat l} = N^{-1} \sum_{i=1}^N H(\mathbf{x}_i)$. Variance reduction techniques are methods to obtain more accurate estimators of an expectation using known information about the random model.

Importance sampling samples from a distribution whose density relative to that of the original random vector is approximately proportional to the absolute value of the performance function. Importance sampling density, proposal density, or instrumental density $g$ is the density used in importance sampling. Likelihood ratio $W(x)$ is the ratio of the original density and the proposal density: $W(x) = f(x) / g(x)$. Importance sampling estimator or likelihood ratio estimator is an estimator of the expectation, defined by: $\mathbf{\hat l} = \frac{1}{N} \sum_{k=1}^N \frac{H(\mathbf{x}_k) f(\mathbf{x}_k)}{g(\mathbf{x}_k)}$, where $\mathbf{x}_k$ are distributed as the proposal density. An importance sampling estimator has the smallest variance when the proposal density is proportional to the original density multiplied by the absolute value of the performance function: $g^*(x) = \frac{|H(x)| f(x)}{\int |H(x)| f(x) dx}$; if the performance function is non-negative, importance sampling estimator with the optimal proposal density has zero variance.

Stratified sampling partitions the sample space into disjoint regions of know probability (aka strata; singular: stratum) and sample from each. Assume that we can sample the random vector conditional on a categorical random variable with a known probability vector, the stratified sampling estimator is defined as: $\mathbf{\widehat{l^s}} = \sum_{i=1}^m \frac{p_i}{N_i} \sum_{j=1}^{N_i} H(\mathbf{x}_{ij})$, where $\mathbf{x}_{ij} \sim \mathbf{x}|_{y=i}$. The variance of the stratified sampling estimator is: $\text{Var}~\mathbf{\widehat{l^s}} = \sum_{i=1}^m \frac{p_i^2 \sigma_i^2}{N_i}$, where $\sigma_i^2 = \text{Var}(H(\mathbf{x}|_{y=i}))$. Systematic sampling is stratified sampling with equal probability and equal sample size in each stratum: $\mathbf{\widehat{l^s}} = \frac{1}{N} \sum_{i=1}^m \sum_{j=1}^{N/m} H(\mathbf{x}_{ij})$.

Conditional Monte Carlo or Rao-Blackwell-ization translates the expectation into a conditional expectation that can be evaluated analytically, and estimate the expectation of the new function by sample mean: $\mathbf{\widehat{l^c}} = \frac{1}{N} \sum_{i=1}^N \mathbb{E}[H(\mathbf{x})|\mathbf{y}_i]$. The variance of the conditional Monte Carlo estimator is $\text{Var}~\mathbb{E}(H(\mathbf{x})|\mathbf{y})$. Because $\text{Var}~H(\mathbf{x}) = \mathbb{E} \text{Var}(H(\mathbf{x})|\mathbf{y}) + \text{Var}~\mathbb{E}(H(\mathbf{x})|\mathbf{y})$, the reduction in variance is good as long as the residual variance is relatively small.

Splitting method constructs a random object via a sequence of intermediate objects, decomposing an estimation problem into a sequence of problems. The splitting method is particularly suitable for rare-event simulation. To estimate rare-event probability $l = P(\mathbf{x} \in A)$, form a decreasing chain of events $A_1 \supset \dots \supset A_T = A$, then at each stage, sample the random object while initializing on the successful subsample of the previous stage: $\mathbf{x}_1 \sim f$, and $\forall t \in \{t\}_{t=1}^{T-1}$, $\mathbf{x}_{t+1} \sim f \mid \mathbf{x}_t \cap A_t$, the estimator $\mathbf{\hat l} = \frac{|\mathbf{X}_T \cap A|}{\prod_{t=1}^T |\mathbf{X}_t|}$. The splitting method has other applications, e.g. approximate sampling from a distribution, Monte Carlo counting, and randomized optimization.

Sensitivity analysis of a system is the evaluation of sensitivities (e.g. gradient, Hessian) of performance measures w.r.t. controllable system parameters for selecting parameters that optimize the performance measures.

Numerical Integration

Multivariate integration with irregular boundaries:

Recursive stratified sampling
Importance sampling: the Metropolis algorithm

In numerical integration, quasi-Monte Carlo methods typically have a faster order of convergence than Monte Carlo methods.

Stochastic Optimization

Stochastic programming are optimization problems where the objective function and some of the constraints are unknown and need to be obtained via simulation.

🏷 Category=Computation