Bootstrap

Asymptotic analysis applies when sample size is large, and results are limited to statistics that are analytically tractable. Bootstrap [@Efron1979] is an empirical, nonparametric alternative to asymptotic analysis, estimating the sampling distribution of a statistic by resampling the original sample with replacement, i.e. resampling from the empirical PDF (EPDF). As a nonparametric method, bootstrap should be the default inference procedure when we have no knowledge of the population or sampling distribution, or when the sample size is insufficient.

Notations: $N$, sample size; $B$, bootstrap repetition; $\beta_1$, coefficient of interest; $H_0: \beta_1 = \beta_1^0$, null hypothesis; $\hat{\beta}_1$, estimator; $\hat{\beta}_{1b}^∗$, bootstrap estimator, which may be different from the standard estimator; $\hat{\beta}^R$, restricted estimator; $\mathbf{u}$, residuals; $s$, estimator of standard error; $w = (\hat{\beta}_1 - \beta_1^0) / s_{\hat{\beta}_1}$, Wald test statistic; $\alpha$, significance level;

Concepts

Resampling refers to any method that creates replicate datasets from available data, so that a given data analysis procedure can be repeated, and the collection of outcomes can be summarized to quantify uncertainty of the original outcome, without any analytical calculation. The data analysis procedure can be estimating a population parameter (jackknife, bootstrap), testing randomness (permutation tests), validating prediction models (cross validation), etc. Sampling or simulation, in comparison, is a process of gathering observations from an idealized population to estimate properties of the population, e.g. Monte Carlo methods.

Plug-in principle is the method of estimation of functionals of a population distribution by evaluating the same functionals at the empirical distribution based on a sample: $g\{X\} \dot \sim g\{x_I\}$.

Asymptotic refinement refers to a convergence rate faster than using first-order asymptotic theory. To have an asymptotic refinement, a bootstrap needs to be applied to an asymptotically pivotal statistic, i.e. a statistic whose asymptotic distribution does not contain unknown parameters.

Smoothed bootstrap, or smooth bootstrap, adds random noise to each resampled observation. It is equivalent to sample from a kernel density estimate of the data. Smoothed bootstrap only has second order asymptotic refinement over the bootstrap for statistics that are differentiable functions of vector means [@Hall1989], but the improvement can still be substantial in small samples [@Efron1982]. First order improvements are more likely for statistics of local properties of the PDF, e.g. mode [@Romano1988], quantiles [@Hall1989], and least absolute values [@Angelis1993] regression.

Random Sample

Bootstrap Sampling Distribution

bootstrap standard error;

bootstrap confidence interval: percentile bootstrap; bias-corrected and accelerated (BCa) bootstrap [@Efron1987];

Bootstrap is asymptotically more accurate than the standard confidence intervals obtained using sample variance and normality assumption. [@DiCiccio1996]

Bootstrap in Regression

Parametric bootstrap: simulate from the estimated parametric model; (in principle the sampling distributions can be obtained analytically)
Pairs bootstrap, case bootstrap, or non-parametric bootstrap: regressors and regressant are always paired together, only assumes independent observation;
Residual bootstrap: regressant is constructed from randomized sample residuals, assumes IID residuals, no assumption on residual distribution;
Wild bootstrap: regressant is constructed by flipping the sign of sample residual with equal probability (Rademacher weights), applicable to heteroskedastic models [@WuCF1986];

Pairs bootstrap is often acceptable if the data set is fairly large. But in regression problems, the explanatory variables are often fixed (i.e. no error), or at least observed with more control than the response variable. Also the range of the explanatory variables are informative. Therefore, each pairs bootstrap resample will lose some information.

Residual and wild bootstraps can impose the null hypothesis in resampling [@Davidson1999], where the bootstrap Wald statistics are centered on $\beta_1^0$ and the residuals bootstrapped are those from the restricted OLS estimator that imposes $H_0$.

Bootstrap in Hypothesis Testing

Bootstrap-t (percentile-t) [@Efron1981]: use OLS estimates of the standard error of the sample and resamples, reject by bootstrap distribution;
Bootstrap-se (standard error): use bootstrap estimate of the standard error $\hat{\sigma}_{\hat{\beta}_1} = s_{\hat{\beta}_{1B}^∗ }$, reject by normal distribution;

Bootstrap-t procedures provide asymptotic refinement, while bootstrap-se procedures do not.

Clustered Data

A sample may contain clusters of observational units such that regression errors of the observations are independent across clusters but correlated within. Such correlation effectively reduce sample size to the number of clusters in statistical inference, where errors are assumed to be independent across observations. See [@Cameron2015] for a good review on inference with clustered data.

Number of clusters in sample, $G$; Number of observations in cluster $g$, $N_g$; Subsample of cluster $g$, $(y_g, X_g)$; Covariance matrix of regression errors within cluster $g$, $\Sigma_g$; Individual $i$ in cluster $g$ have subscript $ig$;

Covariance matrix of the OLS estimator on clustered data is: $$\text{Var}(\hat{\boldsymbol{\beta}} \mid \textbf{X}) = (X' X)^{-1} \left( \sum_{g = 1}^G X_g' \Sigma_g X_g \right) (X' X)^{-1}$$

Cluster-robust variance estimator (CRVE), $\widehat{\text{Var}}_\text{CR}(\hat{\beta} )$, replaces $\Sigma_g$ with sample estimate $\tilde{u_g}' \tilde{u_g}$. Here $\tilde{u}$ is corrected residuals, and the standard CRVE simply uses the OLS residuals. [@Bell2002] proposed a correction $\tilde{u} = u_g \sqrt{G / (G-1)}$, which generalizes the HC3 measure in [@MacKinnon1985] and is equivalent to the jackknife estimator of $\text{Var}(\hat{\boldsymbol{\beta}} \mid \textbf{X})$. [@Cameron2008] referred to this correction as CR3.

Resampling methods:

Pairs cluster bootstrap
Residual cluster bootstrap
Wild cluster bootstrap

Residual cluster bootstrap requires balanced clusters.

Bag of Little Bootstraps

Bootstrap becomes computationally prohibitive in multiple testing on genomes and big data settings. Bag of little bootstraps (BLB) [@Kleiner2014] extends the idea of bootstrap for big data sets.

References

An introduction to the bootstrap. [@Efron1993]

Bootstrap Methods and their Application. [@Davison1997]

🏷 Category=Statistics