Probability density estimation infers from a finite sample its underlying probability density function. It is useful for assessing PDF structure (multimodality, skewness) {Silverman1986; Scott1992}, classification and discriminant analysis {Simonoff1996}, Monte Carlo methods (smoothed bootstrap, particle filter) {Doucet, De Freitas, Gordon, 2001}, summarizing Bayesian posteriors, conditioning, etc.

Currently, the most popular nonparametric density estimation method is kernel density estimation. Methods not covered in this article include (average shifted) histogram, frequency polygons, parametric distributions, spline, wavelet and Fourier series.

Kernel Density Estimation

Kernel Density Estimation (KDE), aka Parzen–Rosenblatt window {Rosenblatt1956, Parzen1962}, applies a smoothing kernel $K(x)$ with bandwidth (smoothing parameter) $h$ to each observation, such that \( K(x; x_i, h) = K((x - x_i) / h) /h \), and estimate by the average PDF: \[ \hat{f}_h(x) = \frac{1}{n} \sum_{i=1}^n K(x; x_i, h) \] In other words, KDE estimates random variable $X$ by \(\hat X = x_I + h ε\), where $I$ is the discrete uniform random variable on sample index, and $ε \sim K(x)$.

The choice of kernel function is not crucial to the accuracy of kernel density estimators. Common kernel functions (window functions) include uniform, triangular, Epanechnikov (parabolic), biweight (quartic), triweight, and Gaussian. The Epanechnikov kernel is optimal in a mean square error sense {Epanechnikov1969}, but the other common kernels have small losses of efficiency {Wand, Jones, 1995}.

Bandwidth selection, or bandwidth matrix selection in multivariate KDE, is very important for the accuracy of kernel density estimators. The typical optimality criterion is mean integrated squared error (MISE), i.e. the expected $L_2$ distance: \(\mathrm{MISE}(H) = \mathbb E_X \int (\hat f_H - f)^2 \mathrm{d}x\). As \(n \to \infty\), the MISE can be written as the asymptotic MISE (AMISE), which consists of squared bias and variance, plus a diminishing term: \[\mathrm{MISE}(H) = \mathrm{AMISE}(H) + o((n \sqrt{\det H})^{-1} + \mathrm{tr} H^2) \\ \mathrm{AMISE}(H) = (n \sqrt{\det H})^{-1} R(K) + (\mathrm{vec} H)' Ψ_4 (\mathrm{vec} H) σ_K^4 /4\] Here, $\mathrm{vec}$ stacks the columns of a matrix into a single vector, \(Ψ_4 = \int (\mathrm{vec} ∇^2f) (\mathrm{vec} ∇^2f)' \mathrm{d}x \), and \(R(K) = |K(x)|_2^2\). A reasonable bandwidth selector should be $O(n^{-2/(d+4)})$ elementwise, so that the bias and variance terms of its MISE are balanced and converge at rate $O(n^{−4/(d+4)})$. Note that parametric estimators typically converge in MISE at rate $O(n^{-1})$.

The bandwidth matrix selector based on AMISE \(H_{\mathrm{AMISE}} = \arg\min_H \mathrm{AMISE}(H)\) is not directly usable, because the true density function is unknown. Instead, selectors that minimize estimators of the AMISE are more useful. Plug-in selector {Sheather, Jones, 1991; Wand, Jones, 1994} replaces $Ψ^4$ in AMISE with \(\hat Ψ^4(G) = n^{−2} ∑_{i,j} (\mathrm{vec} ∇^2)(\mathrm{vec} ∇^2)' K_G(x_i - x_j)\), where $G$ is a pilot bandwidth matrix. Smoothed cross validation selector {Rudemo1982; Bowman1984; Hall, Marron, Park, 1992} replaces the second term in AMISE with \(n^{−2} ∑_{i,j} (K_{2H+2G} − 2K_{H+2G} + K_{2G})(x_i − x_j)\), The plug-in selector $\hat H_{\mathrm{PI}}$ and the smoothed cross validation selector $\hat H_{\mathrm{SCV}}$ both converge in probability to $H_{\mathrm{AMISE}}$ at at a relative rate of $O_p(n^{−2/(d+6)})$.

If Gaussian KDE is used on a Gaussian random vector, Silverman's rule of thumb {Silverman1986} gives the MISE-optimal bandwidth: \(H = \mathrm{diag}\{\hat σ_i ((p+2)n / 4)^{-1/(p+4)}\}\), where $\hat σ$ is sample standard deviation and $p$ is the dimension of the random vector.

Parameterizing over the full class of covariance matrices improves estimator accuracy {Wand, Jones, 1993}.

Variable KDE (adaptive KDE) adjusts kernel bandwidth matrix over the domain of estimation, which comes in two types {Abramson1982; Terrell, Scott, 1992; Sain2002}: 1. Balloon estimators vary bandwidths by the location of the estimate, e.g. inversely proportional to the probability density \(h(x) = k (n \hat f(x))^{-1/p}\). This reproduces the k-nearest neighbour algorithm when using a uniform kernel. The nearest-neighbor estimator over-adapts to the tails in the univariate and bivariate cases, but performs better than the fixed kernel estimator when $p ≥ 5$, at least for Gaussian samples. 1. Sample smoothing estimators vary bandwidths on the location of the sample points.

Characteristic function density estimator

Since characteristic function (CF) \(φ(t) = \mathbb{E}[\exp(itX)]\) is the Fourier transform of the PDF, each estimate of the CF corresponds to an estimate of the PDF by Fourier inversion: \(\hat f(x) = \mathcal{F}^{-1} \hat φ(t)\).

If we estimate the CF by sample mean, \(\hat φ(t) = \frac{1}{n} \sum_{j=1}^n \exp(itx_j)\), we are effectively estimating the PDF with the empirical PDF \(\hat f(x) = \frac{1}{n} \sum_i δ(x-x_i)\).

If we prefer a smooth PDF estimate, we may filter high frequency parts of the CF estimator by multiplying a damping function $ψ_h(t) = ψ(ht)$, In fact, \(\hat f(x) = \frac{1}{nh} \sum_{i=1}^n K((x - x_i) /h)\), where \(K(x) = \mathcal{F}^{-1} ψ(t)\). Thus, kernel density estimators can also be seen as characteristic function density estimators.

Kernel smoothing and regression

Kernel smoothing estimates a real-valued function as the weighted average of neighboring data: \[\hat y(x) = \sum_i y_i K(x; x_i, h_i) / \sum_i K(x; x_i, h_i) \] Nearest neighbor smoother uses equal weights on a fixed number of neighbors: $K(x) = 1/m$ with kernel radius \(h_i = d_{(m)}(x_i, \{x_j\})\), where $d_{(m)}$ denotes distance to the $m$-th closest element. Kernel average smoother uses a fixed radius kernel. In comparison, smoothing by locally weighted regression {Cleveland1979} weighs on the square errors of a regression model, and estimate by the fitted value: estimator \(\min_{\beta(x)} \sum_i K(x; x_i, h_i) (y_i - x_i' \beta(x))^2\) gives fitted value \(\hat y(x) = \mathbf{x}(X' W X)^{-1} X' W \mathbf{y}\), where \(W(x) = \mathrm{diag}\{ K(x; x_i, h_i) \}\).

Mixture models (cluster analysis)

https://en.wikipedia.org/wiki/Mixture_model

expectation maximization (EM)