Probability Density Estimation

Probability density estimation uses a finite sample to infer its underlying probability density function. It is useful for: visualizing and assessing PDF structure (skewness, modes) [@Silverman1986; @Scott1992]; parametric estimation (fit parametric model to nonparametric density estimate to improve robustness), goodness-of-fit test (parametric fit against smoothed probability estimators) [@Simonoff1996]; classification (nonparametric discriminant analysis) [@Hand1982]; survival analysis [@Watson, Leadbetter, 1964]; Monte Carlo methods e.g. smoothed bootstrap, particle filter [@Doucet2001]; summarizing Bayesian posteriors, conditioning, etc.

Currently, the most popular nonparametric density estimation method is kernel density estimation. Methods not covered in this article include estimation within a parametric family, (average shifted) histogram, frequency polygons, spline, wavelet and Fourier series.

Kernel Density Estimation

Kernel Density Estimation (KDE), aka Parzen–Rosenblatt window [@Rosenblatt1956, @Parzen1962], applies a smoothing kernel $K(x)$ with bandwidth (smoothing parameter) $h$ to each observation, such that $K(x; x_i, h) = K((x - x_i) / h) /h$, and estimate by the average PDF: $$\hat{f}_h(x) = \frac{1}{n} \sum_{i=1}^n K(x; x_i, h)$$ In other words, KDE estimates random variable $X$ by $\hat X = x_I + h ε$, where $I$ is the discrete uniform random variable on sample index and $ε \sim K(x)$, such that $\hat f = \hat f_X * K_h$, where $\hat f_X$ is the empirical PDF. Because sample variance is consistent to population variance, kernel density estimates are inflated due to smoothing.

The choice of kernel function is not crucial to the accuracy of kernel density estimators. Common kernel functions (window functions) include uniform, triangular, Epanechnikov (parabolic), biweight (quartic), triweight, and Gaussian. The Epanechnikov kernel is optimal in a mean square error sense [@Epanechnikov1969], but the other common kernels have small losses of efficiency [@Wand, Jones, 1995].

Bandwidth selection, or bandwidth matrix selection in multivariate KDE, is very important for the accuracy of kernel density estimators. Estimator accuracy improves when parameterizing over the class of covariance matrices, rather than just diagonal or scalar matrices [@Wand, Jones, 1993]. The typical optimality criterion is mean integrated squared error (MISE), i.e. the expected $L_2$ distance: $$\mathrm{MISE}(H) = \mathbb E_X \int (\hat{f}_H - f)^2 \mathrm{d}x$$. As $n \to \infty$, the MISE can be written as the asymptotic MISE (AMISE), which consists of squared bias and variance, plus a diminishing term: $$\mathrm{MISE}(H) = \mathrm{AMISE}(H) + o((n \sqrt{\det H})^{-1} + \mathrm{tr} H^2) \\ \mathrm{AMISE}(H) = (n \sqrt{\det H})^{-1} R(K) + (\mathrm{vec} H)' Ψ_4 (\mathrm{vec} H) σ_K^4 /4$$ Here, $\mathrm{vec}$ stacks the columns of a matrix into a single vector, $Ψ_4 = \int (\mathrm{vec} ∇^2f) (\mathrm{vec} ∇^2f)' \mathrm{d}x$, and $R(K) = \|K(x)\|_2^2$. A reasonable bandwidth selector should be $\mathcal{O}(n^{-2/(d+4)})$ elementwise, so that the bias and variance terms of its MISE are balanced and converge at rate $\mathcal{O}(n^{−4/(d+4)})$. Note that parametric estimators typically converge in MISE at rate $\mathcal{O}(n^{-1})$.

The bandwidth matrix selector based on AMISE $H_{\mathrm{AMISE}} = \arg\min_H \mathrm{AMISE}(H)$ is not directly usable, because the true density function is unknown. Instead, selectors that minimize estimators of the AMISE are more useful:

Plug-in selector [@Sheather, Jones, 1991; @Wand, Jones, 1994] replaces $Ψ^4$ in AMISE with $\hat Ψ^4(G) = n^{−2} ∑_{i,j} (\mathrm{vec} ∇^2)(\mathrm{vec} ∇^2)' K_G(x_i - x_j)$, where $G$ is a pilot bandwidth matrix.
Smoothed cross validation selector [@Rudemo1982; @Bowman1984; @Hall, Marron, Park, 1992] replaces the second term in AMISE with $n^{−2} ∑_{i,j} (K_{2H+2G} − 2K_{H+2G} + K_{2G})(x_i − x_j)$, The plug-in selector $\hat H_{\mathrm{PI}}$ and the smoothed cross validation selector $\hat H_{\mathrm{SCV}}$ both converge in probability to $H_{\mathrm{AMISE}}$ at at a relative rate of $O_p(n^{−2/(d+6)})$.

If Gaussian KDE is used on a Gaussian random vector, Silverman's rule of thumb [@Silverman1986] gives the MISE-optimal bandwidth: $H = (n (p+2) / 4)^{-1/(p+4)} \mathrm{diag} \{\hat σ_i\}$, where $\hat σ$ is sample standard deviation and $p$ is the dimension of the random vector.

Variable KDE (adaptive KDE) adjusts kernel bandwidth matrix over the domain of estimation, which comes in two types [@Abramson1982; @Terrell, Scott, 1992; @Sain2002]:

Balloon estimators vary bandwidths $H(x; \{x_i\})$ on the location of the estimate, e.g. inversely proportional to the probability density $h(x) = k (n \hat f(x))^{-1/p}$. This reproduces the k-nearest neighbour density estimator when using a uniform kernel. The nearest neighbor estimator over-adapts to the tails in the univariate and bivariate cases, but performs better than the fixed kernel estimator when $p ≥ 5$, at least for Gaussian samples.
Sample smoothing estimators (sample-point estimators) vary bandwidths $H(x_i)$ on the location of the sample points, which have preferable properties.

Kernel bandwidth and covariance can be selected by the leave-one-out maximum likelihood criterion. Some recent work on bandwidth selection for multivariate density estimators: [@Chacon and Duong, 2010, 2012], [@Panaretos and Konis 2012].

Characteristic Function Density Estimator

Since characteristic function (CF) $φ(t) = \mathbb{E}[\exp(itX)]$ is the Fourier transform of the PDF, each estimate of the CF corresponds to an estimate of the PDF by Fourier inversion: $\hat f(x) = \mathcal{F}^{-1} \hat φ(t)$.

If we estimate the CF by sample mean, $\hat φ(t) = \frac{1}{n} \sum_{j=1}^n \exp(itx_j)$, we are effectively estimating the PDF with the empirical PDF $\hat f_X(x) = \frac{1}{n} \sum_i δ(x-x_i)$.

If we prefer a smooth PDF estimate, we may filter high frequency parts of the CF estimator by multiplying a damping function $ψ_h(t) = ψ(ht)$, In fact, $\hat f(x) = \frac{1}{nh} \sum_{i=1}^n K((x - x_i) /h)$, where $K(x) = \mathcal{F}^{-1} ψ(t)$. Thus, kernel density estimators can also be seen as characteristic function density estimators.

Kernel Smoothing

Kernel smoothing estimates a real-valued function as the weighted average of neighboring data, similar to balloon KDE: $\hat y(x) = \sum_i y_i K(x; x_i, h_i) / \sum_i K(x; x_i, h_i)$. Nearest neighbor smoother uses equal weights on a fixed number of neighbors: $K(x) = 1/m$ with kernel radius $h_i = d_{(m)}(x_i, \{x_j\})$, where $d_{(m)}$ denotes distance to the m-th closest element. Kernel average smoother uses a fixed radius kernel.

Nonparametric Regression

Smoothing by locally weighted regression [@Cleveland1979] weighs on the square errors of a regression model, and estimate by the fitted value: estimator $\min_{\beta(x)} \sum_i K(x; x_i, h_i) (y_i - x_i' \beta(x))^2$ gives fitted value $\hat y(x) = \mathbf{x}(X' W X)^{-1} X' W \mathbf{y}$, where $W(x) = \mathrm{diag}\{ K(x; x_i, h_i) \}$.

spline estimators

Mixture Models (Clustering)

$$\hat f(x) = \sum_k w_k K(x; μ_k, Σ_k)$$

https://en.wikipedia.org/wiki/Mixture_model

expectation maximization (EM)

Maximum penalized likelihood estimators

(proposal) A non-parametric density estimator as the likelihood-weighted average of PDFs: $\hat f(x) = \int_{\mathcal L_1} f L(f | X) \mathrm{d}f / \int_{\mathcal L_1} L(f | X) \mathrm{d}f$