Monte Carlo methods

Monte Carlo methods are methods that repeatly draw random samples from an artificial population to solve a mathematical or statistical problem. The term "Monte Carlo" was used by von Neumann and Ulam during World War II as a code word for the Manhattan Project at Los Alamos, which involved simulation of random neutron diffusion in nuclear materials. Monte Carlo methods use true random numbers or pseudo-random numbers; quasi-Monte Carlo methods use quasi-random sequences.

Random Number Generation

Although random number generation is not part of the Monte Carlo methodology, the latter feeds from and has helped the development of the former. Types of random numbers in decreasing order of randomness:

True random numbers: numbers generated from stochastic processes;
Pseudorandom numbers: deterministic integer sequences with long period that pass some statistical tests of random sequences when provided a random initial state (pioneered by von Neumann in 1946 for the Manhattan project);
Low-discrepancy sequences (quasi-random sequences): all subsequences have low discrepancy;

Randomness tests are statistical tests on whether a binary sequence is patternless; a widely used collection of tests is the Diehard Battery of Tests [@Marsaglia1996]. TestU01 [@L’Ecuyer and Simard, 2007] is a popular randomness test suite written in ANSI C.

True Random Number Generator

True random number generator (TRNG) or hardware random number generator is a physical device that generates random numbers from a physical stochastic process, e.g. thermal noise, the photoelectric effect involving a beam splitter, and other quantum phenomena.

Random number can also be generated from macroscopic processes, justified by unstable dynamic systems and chaos theory. Common practices use tools such as yarrow, coin, dice, roulette wheel, and lottery machines.

Hardware RNGs have limited random bits per second. As a trade-off between randomness and speed, TRNGs typically serve as the entropy source (seed provider) for CSPRNGs or PRNGs.

Pseudo-Random Number Generator

Pseudo-random number generator (PRNG) or deterministic random bit generator (DRBG) is an algorithm using deterministic polynomial-time computable functions to generate numbers computationally indistinguishable from true random numbers: generator $\{G_k\}_{k \in \mathbb{N_+}} \subset \mathbf{PTIME}$, $G_k : \{0,1\}^k \mapsto \{0,1\}^{p(k)}$, where $k$ is state length and $p \in \mathcal{N_+}[k]$ is output block length; usually $G_k(s_i) = (s_{i+1}, y_i)$, where $s_{i+1}$ is the new state of the generator and $y_i$ is the generated pseudo-random number; let $\mathbf{x} \sim U\{0,1\}^k$ and $\mathbf{r} \sim U\{0,1\}^{p(k)}$, the generator must satisfy $\forall A: \{0,1\}^{p(k)} \mapsto \{0,1\}$, $A \in \mathbf{PTIME}$, $\exists \mu(k) \ll 1$: $|\Pr\{A(G_k(\mathbf{x})) = 1\} - \Pr\{A(\mathbf{r}) = 1\}| < \mu(k)$.

Classes of PRNGs in increasing order of quality:

Linear congruential generators (LCG, 线性同余): defined by the recurrence relation $N_{n+1} = (a N_{n} + c) \bmod m$; in case increment $c=0$, the LCG is called a multiplicative congruential generator (MCG) or Lehmer RNG, such as Wichmann-Hill;
Lagged Fibonacci generator (LFG): defined by the recurrence relation $S_{n} = S_{n-j} \star S_{n-k} \pmod m$ where $0<j<k$ and $\star$ is a general binary operator; an example LFG is subtract-with-carry [@Marsaglia and Zaman, 1991];
Linear recurrences: linear feedback shift registers (LFSR), Mersenne-Twister, Well Equidistributed Long-period Linear (WELL) [@Panneton, L'Ecuyer, Matsumoto, 2006], xorshift [@Marsaglia2003];

The first fast and high-quality PRNG is Mersenne-Twister (MT, 梅森旋转) [@Matsumoto and Nishimura, 1998], which is a twisted generalised LFSR (GLFSR) of rational normal form with period of a Mersenne prime. Mersenne-Twister is by far the most widely used general-purpose PRNG, and is the default PRNG in may software systems such as GLib, GMP, GSL, R and Python. MT19937 is the most commonly used version of Mersenne Twister, which has a 624-dimensional state space of 32-bit integers, a period of $M_{19937} = 2^{19937} - 1$, and is equi-distributed in up to 623 dimensions of 32-bit integers. Disadvantages of MT19937: large state buffer (2496 Bytes), slow, does not pass all randomness tests (linear complexity), very slow recovery from state spaces with a large number of zeros.

Cryptographically secure PRNG (CSPRNG) is one that is considered suitable for use in cryptography, such as key generation. Forward-secure PRNG is one such that given a random initial state, any finite-time sequence it generates is computationally indistinguishable from an equal-length sequence of true random numbers.

Quasi-Random Sequence

Discrepancy of a finite sequence in the unit n-interval is the uniform distance between the n-volume and the empirical measure defined by the sequence, which estimates the (relative) n-volume of an n-interval by the proportion of points in the sequence that are in the n-interval: $D(P) = \sup_{B \in \mathcal{J}} \left| \frac{|P \cap B|}{|P|} - \lambda(B) \right|$, where $P = (x_i)_{i = 1}^N$, $x_i \in \mathbb{R}^n$, and $\mathcal{J} = \{[a, b] : a, b \in I^n\}$. Low-discrepancy sequence or quasi-random sequence is a sequence with low discrepancy, especially lower than that of uniformly distributed true or pseudo-random sequences.

Sampling

Monte Carlo sampling...

Exact sampling, independent samples from certain classes of probability distributions.

Direct Methods

Convolution method samples from a probability model as the sum of some other random variables, such that the convolution of those distributions equals the target distribution: given $f(x) = g_1 * \dots * g_m(x)$, sample by $\mathbf{x} = \sum_{i=1}^m \mathbf{x}_i$, where $\mathbf{x}_i \sim g_i$. Composition method samples from a mixture probability model by first sampling the component of the mixture and then sampling the component probability model: given $f(x) = \sum_{i=1}^m p_i g_i(x)$, sample by $\mathbf{x} = \sum_{i=1}^m \mathbf{x}_i 1(\mathbf{y}=i)$.

Inverse transform method samples a continuous random variable by sampling from the standard uniform distribution and transforming it with the inverse of the cumulative distribution function: given $\mathbf{x} \sim F$, assume $F^{-1}$ is easy to compute, sample by $\mathbf{x} = F^{-1}(\mathbf{u})$, where $\mathbf{u} \sim U[0,1]$. For random vectors, the components can be generated sequentially using the inverses of conditional CDFs: $\mathbf{x}_1 = F_1^{-1}(\mathbf{u}_1)$, and $\mathbf{x}_i = F_i^{-1}(\mathbf{u}_i \mid \mathbf{x}_1, \cdots, \mathbf{x}_{i-1})$. Flow-based probabilistic model, aka normalizing flows, is an application of the inverse transform method, which expresses a continuous probability distribution as a sequence of simple, invertible, differentiable transformations of a simple univariate distribution.

Acceptance-rejection Method

Acceptance-rejection method or rejection sampling [@vonNeumann1951] indirectly samples a bounded, compactly supported probability density function, using a proposal PDF that is easy to sample from and majorizes the target PDF after scaling: given $C \ge 1$, $C g(x) \ge f(x)$, sample $\mathbb{x} \sim g$, $\mathbb{u} \sim U[0, 1]$, return $\mathbb{x}$ if $\mathbb{u} < \frac{f(\mathbb{x})}{C g(\mathbb{x})}$. Acceptance probability is the reciprocal of the scaling constant, so the acceptance-rejection method is most efficient if the scaling constant is small. The acceptance-rejection method can be used to efficiently sample a Gamma distribution $\Gamma(\alpha, 1)$ using a transformed Gaussian sample (acceptance probability $1/C > 0.95$).

Methods for Special Distributions

To sample from the standard Gaussian distribution given uniform pseudo-random numbers, several algorithms have been developed that are faster than the inverse transform method. Box–Muller transform [@Box and Muller, 1958; @Paley and Wiener, 1934]: sample $(\mathbf{u}_1, \mathbf{u}_2) \sim U[0,1]^2$, let $\mathbf{r} = \sqrt{-2 \ln \mathbf{u}_1}$ and $\boldsymbol{\theta} = 2 \pi \mathbf{u}_2$, then $(\mathbf{r} \cos \boldsymbol{\theta}, \mathbf{r} \sin \boldsymbol{\theta}) \sim N(0, I_2)$. Marsaglia polar method [@Marsaglia and Bray, 1964] is a variant of the Box–Muller transform that avoids the trigonometric functions but uses rejection sampling (efficiency $1/C = π/4$): sample $(\mathbf{u}_1, \mathbf{u}_2) \sim U(\mathbb{B}^2)$ by the acceptance-rejection method, let $\mathbf{s} = \mathbf{u}_1^2 + \mathbf{u}_2^2$, notice that $\mathbf{r} = \sqrt{-2 \ln \mathbf{s}}$, $\cos \boldsymbol{\theta} = \mathbf{u}_1 / \sqrt{\mathbf{s}}$, and $\sin \boldsymbol{\theta} = \mathbf{u}_2 / \sqrt{\mathbf{s}}$, we have $\sqrt{\frac{-2 \ln \mathbf{s}}{\mathbf{s}}} (\mathbf{u}_1 , \mathbf{u}_2) \sim N(0, I_2)$.

To sample from a discrete distribution, partition the unit interval into intervals of lengths equal to the probabilities, and then take uniform pseudo-random numbers and search for the index of the enclosing interval. Time complexity of search algorithms: linear search, $\mathcal{O}(n)$; binary search, $\mathcal{O}(\log n)$; indexed search (aka cutpoint method). Alias method [@Walker1977] samples from a discrete distribution in constant time $\mathcal{O}(1)$, by looking up pre-computed tables instead of searching, with table generation time $\mathcal{O}(n)$, or $\mathcal{O}(n \log n)$ for optimized probability table: it creates a probability table of length $n$ with values close to 1, and an alias table containing alternative indices; at sampling, first draw a random index and a random fraction, return the index if the fraction is less than the corresponding probability, otherwise return the alternative index.

To sample from a monotonic PDF, ziggurat method (金字形神塔) [@Marsaglia and Tsang, 1984, 2000] covers the PDF with horizontal equal-area rectangles except for a tail of the same area, samples the inner rectangle w.r.t. each covering shape by comparison, samples the remainder of the PDF in each covering rectangle by the rejection method, and samples the tail by some efficient method, e.g. [@Marsaglia1964] for the Gaussian tail. The ziggurat algorithm (32-bit version; non-increasing PDF on the non-negative half-line): find $(x_i)_{i=0}^{255}$ such that $x_0 = 0$ and $(f(x_{i-1}) -f(x_i)) x_i = 2^{-32}$; store $(f_i)_{i=0}^{255}$, where $f_i = f(x_i)$; compute $(k_i)_{i=0}^{255}$, where $k_0 = \lfloor (r f(r) / v) 2^{32} \rfloor$ and $k_i = \lfloor (x_{i-1} / x_i) 2^{32} \rfloor$ otherwise; compute $(w_i)_{i=0}^{255}$, where $w_0 = 2^{-32} v/f(r)$ and $w_i = 2^{-32} x_i$ otherwise; sample $\mathbf{j} \sim U(2^{32})$, let $\mathbf{i} = \mathbf{j} \bmod 2^8$; if $\mathbf{j} < k_\mathbf{i}$, return $\mathbf{x} = \mathbf{j} w_{\mathbf{i}}$; if $\mathbf{i} \ne 0$, sample $\mathbf{u} \sim U[0,1]$, if $\mathbf{u} < \frac{f(\mathbf{x}) - f_i}{f_{i-1} - f_i}$, return $\mathbf{x}$; otherwise, sample from the tail $f(x) \mid \mathbf{x}>x_{255}$; The ziggurat method improves on other methods, e.g. those for sampling from the Gaussian or exponential distribution, because for most cases (~99%) it avoids evalutating expensive math functions e.g. logarithm and square root, and only requires two table lookups, a floating point comparison, and a floating point multiplication.

Mean Field Particle Methods

Mean field particle methods are Monte Carlo methods that use sequential interacting samples, i.e. a sequence of probability distributions each conditional on its predecessor, and every particle interacts with the empirical measures of the process.

quantum Monte Carlo, diffusion Monte Carlo.

Sequential importance sampling (SIS) or dynamic importance sampling @Hammersley and Morton, 1954; @Rosenbluth1955] is importance sampling in a sequential procedure, when it is easy to sample from the proposal density sequentially by the product rule of probability: given $g(x) = g_1(x_1) \prod_{i=2}^n g_i(x_i \mid x_{< i})$, where $x = (x_i)_{i=1}^n$; find auxiliary marginal PDFs $(\tilde f(x_{\le i}))_{i=1}^{n-1}$ and let $\tilde f(x) = f(x)$; write weight $W(x) = \frac{f(x)}{g(x)} = \prod_{i=1}^n u_i(x_{\le i})$, where incremental updating weights $u_1(x_1) = \frac{\tilde f(x_1)}{g_1(x_1)}$ and $\forall i \ne 1$, $u_i = \frac{\tilde f(x_{\le i})}{\tilde f(x_{< i}) g_i(x_i \mid x_{< i})}$; the SIS estimator is $\mathbf{\hat l} = \frac{1}{N} \sum_{k=1}^N H(\mathbf{x}_k) W(\mathbf{x}_k)$. Like importance sampling, the incremental updating weights can also have unknown scaling factors, in which case one should scale the estimate by the total weights, just like the weighted sample estimator.

Sequential Monte Carlo is a class of methods that use sequential sampling for Bayesian inference. Sequential importance resampling (SIR) [@Rubin1987] is sequential importance sampling with stepwise bootstrap resampling, which can avoid degeneracy of the weights but may increase estimator variance. Bootstrap filter or particle filter [@Gordon1993; @Kitagawa1993]. Filtering in engineering and statistics refers to the estimation of the current state of a stochastic dynamical system, often modelled as a Markov process, given a sequence of indirect observation, which is a known function of system state and turns it into a hidden Markov model. Filtering problems include object tracking, missing data imputation, etc.

Markov Chain Monte Carlo methods

**Markov chain Monte Carlo** (MCMC) methods create correlated, approximate sample from an arbitrary distribution, or the posterior distribution of objective Bayesian inference, by simulating a Markov chain whose stationary distribution coincides with the target distribution.

Metropolis-Hastings algorithm [@Hastings1970], requires a good proposal distribution;
Gibbs sampling [@Geman1984], requires the conditional distribution;
Hamiltonian Monte Carlo (HMC; previouly know as hybrid Monte Carlo) [@Duane1987];
Metropolis-adjusted Langevin algorithm (MALA) [@Besag1994];
Wang and Landau algorithm [@WangFG2001], slice sampling [@Neal2003];

Perfect sampling is an MCMC method that samples exactly from a target distribution, e.g. coupling from the past [@Propp and Wilson, 1966]. But perfect sampling is much more computationally intensive than simple MCMC, and is difficult or impossible to use for most continuous simulation systems.

Simulation

Monte Carlo simulation is a type of simulation (a fictitious representation of reality) that uses repeated Monte Carlo samples to assess the performance of a statistical method, e.g. sampling distribution, sensitivity, misspecification.

The typical setup in Monte Carlo simulation is as follows. (We use bold face to denote random entities.) $\mathbf{x}$ is a random vector which we can sample from, whose probability density function is $f$. $H(x)$ is a real-valued function on the outcomes, which we call the "performance function". Expectation of the performance $l = \mathbb{E}H(\mathbf{x})$ cannot be evaluated analytically. We can use simulation to estimate the expectation by sample mean: $\mathbf{\hat l} = N^{-1} \sum_{i=1}^N H(\mathbf{x}_i)$. Variance reduction techniques are methods to obtain more accurate estimators of an expectation using known information about the random model.

Sensitivity analysis (SA) of a system is the evaluation of the variation of performance measures w.r.t. controllable system parameters. It may serve different goals such as variable screening (selecting parameters that optimize the performance measures), inverse problems (estimating model parameters from response), and function approximation. Sensitivity analysis can be broadly categorized into two types: local and global.

Local sensitivity refers to partial derivatives and related scalar quantities: if the response is a scalar, it is simply the partial derivative, where the sign may be removed by taking the absolute value, $s_i(f, x) = \frac{\partial f(x)}{\partial x_i}$; if the response is a vector, the sensitivity can be summarized by a norm, $s_i(f, x) = \lim_{\epsilon \to 0} \frac{|f(x + \epsilon e_i) - f(x)|}{|\epsilon|}$. Local sensitivity can also be generalized to directional derivatives, considering linear combinations of the variables: $s(f, x, \xi) = \lim_{\epsilon \to 0} \frac{|f(x + \epsilon \xi) - f(x)|}{|\epsilon|}$. The maximum local sensitivity of a function at a point, aka absolute condition number, equals the induced norm of the Fréchet derivative: $s(f, x) = |Df(x)| = \lim_{\epsilon \to 0+} \sup_{|\Delta x| \le \epsilon} \frac{|f(x + \Delta x) - f(x)|}{|\Delta x|}$. Computation of local sensitivity requires either an analytic form of the function or numerical approximation.

Global sensitivity refers to any measure that assigns to each variable a value of importance. It is often used to apportion the output variability to variability of the input parameters; one may replace "variability" with "uncertainty" [@Saltelli2004]. Global sensitivity measures need not be summaries of local sensitivity over the input space. Given a linear regression model, the standardised regression coefficient (SRC) normalizes a regression coefficient by the standard deviations of the variable and the response: $\beta_x = b_x \sigma_x / \sigma_y$. First order effect, sensitivity index, importance measure, or correlation ratio is the main effect of a variable on the response, normalized by the total variance of the response: $S_i = \text{Var}(\mathbb{E}(Y | X_i)) / \text{Var}(Y)$. Variance-based global sensitivity measures are derived from decompositions of the response variance. They include Sobol' indices [@Sobol2001] and Fourier analysis sensitivity test (FAST). Density-based (aka moment-independent) indices are derived from probability density functions [@Borgonovo2007]. A global sensitivity measure of a Gaussian process model is the reciprocal of squared lengthscale of an input variable, $\beta_i^{-2}$, which is related to the complexity of approximating functions from the GP [@Loeppky2009].

Variance Reduction Techniques

Stratified sampling partitions the sample space into disjoint regions of know probability (aka strata; singular: stratum) and sample from each. Assume that we can sample the random vector conditional on a categorical random variable with a known probability vector, the stratified sampling estimator is defined as: $\mathbf{\widehat{l^s}} = \sum_{i=1}^m \frac{p_i}{N_i} \sum_{j=1}^{N_i} H(\mathbf{x}_{ij})$, where $\mathbf{x}_{ij} \sim \mathbf{x}|_{y=i}$. The variance of the stratified sampling estimator is: $\text{Var}~\mathbf{\widehat{l^s}} = \sum_{i=1}^m \frac{p_i^2 \sigma_i^2}{N_i}$, where $\sigma_i^2 = \text{Var}(H(\mathbf{x}|_{y=i}))$. Systematic sampling is stratified sampling with equal probability and equal sample size in each stratum: $\mathbf{\widehat{l^s}} = \frac{1}{N} \sum_{i=1}^m \sum_{j=1}^{N/m} H(\mathbf{x}_{ij})$.

Conditional Monte Carlo or Rao-Blackwell-ization translates the expectation into a conditional expectation that can be evaluated analytically, and estimate the expectation of the new function by sample mean: $\mathbf{\widehat{l^c}} = \frac{1}{N} \sum_{i=1}^N \mathbb{E}[H(\mathbf{x})|\mathbf{y}_i]$. The variance of the conditional Monte Carlo estimator is $\text{Var}~\mathbb{E}(H(\mathbf{x})|\mathbf{y})$. Because $\text{Var}~H(\mathbf{x}) = \mathbb{E} \text{Var}(H(\mathbf{x})|\mathbf{y}) + \text{Var}~\mathbb{E}(H(\mathbf{x})|\mathbf{y})$, the reduction in variance is good as long as the residual variance is relatively small.

Importance Sampling

Importance sampling [@Marshall1956] samples from a distribution whose density relative to that of the original random vector is approximately proportional to the absolute value of the performance function, so that the effective performance function is approximately constant, and thus the corresponding estimator has negligible variance. Importance sampling density, proposal density, or instrumental density $g(x)$ is the density used in importance sampling, which should be easy to sample from. Likelihood ratio $W(x)$ is the ratio of the original and the proposal densities: $W(x) = f(x) / g(x)$. Importance sampling estimator or likelihood ratio estimator is an estimator of the expectation, defined by: $\mathbf{\hat l} = \frac{1}{N} \sum_{k=1}^N H(\mathbf{x}_k) W(\mathbf{x}_k)$, where $\mathbf{x}_k$ are distributed as the proposal density. Every importance sampling estimator is unbiased. Optimal importance sampling density is one such that the importance sampling estimator has the smallest variance, which can be shown to be proportional to the original density multiplied by the absolute value of the performance function: $g^∗(x) = \frac{|H(x)| f(x)}{\int |H(x)| f(x) dx}$; if the performance function is non-negative, importance sampling estimator with the optimal proposal density has zero variance. The optimal importance sampling density is inaccessible because it requires the expectation of the absolute performance function; in practice, we estimate the optimal density using a sample and its performance values.

Weighted sample estimator is an alternative to the likelihood ratio estimator in case the likelihood ratio is only known up to a constant: because the likelihood ratio has expectation one, $\mathbb{E}_g W(\mathbf{x}) = 1$, plug-in estimator to the ratio of the target and mean likelihood is consistent to the target, and scaling does not change the ratio; given $w(x) \propto W(x)$, estimate by $\mathbf{\hat l_w} = \frac{\sum_{k=1}^N H(\mathbf{x}_k) w(\mathbf{x}_k)}{\sum_{k=1}^N w(\mathbf{x}_k)}$.

Splitting Method

Splitting method constructs a random object via a sequence of intermediate objects, decomposing an estimation problem into a sequence of problems. The splitting method is particularly suitable for rare-event simulation. To estimate rare-event probability $l = P(\mathbf{x} \in A)$, form a decreasing chain of events $A_1 \supset \dots \supset A_T = A$, then at each stage, sample the random object while initializing on the successful subsample of the previous stage: $\mathbf{x}_1 \sim f$, and $\forall t \in \{t\}_{t=1}^{T-1}$, $\mathbf{x}_{t+1} \sim f \mid \mathbf{x}_t \cap A_t$, the estimator $\mathbf{\hat l} = \frac{|\mathbf{X}_T \cap A|}{\prod_{t=1}^T |\mathbf{X}_t|}$. The splitting method has other applications, e.g. approximate sampling from a distribution, Monte Carlo counting, and randomized optimization.

Numerical Integration

Multivariate integration with irregular boundaries:

Recursive stratified sampling
Importance sampling: the Metropolis algorithm

In numerical integration, quasi-Monte Carlo methods typically have a faster order of convergence than Monte Carlo methods.

Stochastic Optimization

Stochastic programming are optimization problems where the objective function and some of the constraints are unknown and need to be obtained via simulation.

🏷 Category=Computation