Processing math: 100%

Probability density estimation uses a finite sample to infer its underlying probability density function. It is useful for: visualizing and assessing PDF structure (skewness, modes) [@Silverman1986; @Scott1992]; parametric estimation (fit parametric model to nonparametric density estimate to improve robustness), goodness-of-fit test (parametric fit against smoothed probability estimators) [@Simonoff1996]; classification (nonparametric discriminant analysis) [@Hand1982]; survival analysis [@Watson, Leadbetter, 1964]; Monte Carlo methods e.g. smoothed bootstrap, particle filter [@Doucet2001]; summarizing Bayesian posteriors, conditioning, etc.

Currently, the most popular nonparametric density estimation method is kernel density estimation. Methods not covered in this article include estimation within a parametric family, (average shifted) histogram, frequency polygons, spline, wavelet and Fourier series.

Kernel Density Estimation

Kernel Density Estimation (KDE), aka Parzen–Rosenblatt window [@Rosenblatt1956, @Parzen1962], applies a smoothing kernel K(x) with bandwidth (smoothing parameter) h to each observation, such that K(x;xi,h)=K((xxi)/h)/h, and estimate by the average PDF: ˆfh(x)=1nni=1K(x;xi,h) In other words, KDE estimates random variable X by ˆX=xI+hε, where I is the discrete uniform random variable on sample index and εK(x), such that ˆf=ˆfXKh, where ˆfX is the empirical PDF. Because sample variance is consistent to population variance, kernel density estimates are inflated due to smoothing.

The choice of kernel function is not crucial to the accuracy of kernel density estimators. Common kernel functions (window functions) include uniform, triangular, Epanechnikov (parabolic), biweight (quartic), triweight, and Gaussian. The Epanechnikov kernel is optimal in a mean square error sense [@Epanechnikov1969], but the other common kernels have small losses of efficiency [@Wand, Jones, 1995].

Bandwidth selection, or bandwidth matrix selection in multivariate KDE, is very important for the accuracy of kernel density estimators. Estimator accuracy improves when parameterizing over the class of covariance matrices, rather than just diagonal or scalar matrices [@Wand, Jones, 1993]. The typical optimality criterion is mean integrated squared error (MISE), i.e. the expected L2 distance: MISE(H)=EX(ˆfHf)2dx. As n, the MISE can be written as the asymptotic MISE (AMISE), which consists of squared bias and variance, plus a diminishing term: MISE(H)=AMISE(H)+o((ndetH)1+trH2)AMISE(H)=(ndetH)1R(K)+(vecH)Ψ4(vecH)σ4K/4 Here, vec stacks the columns of a matrix into a single vector, Ψ4=(vec2f)(vec2f)dx, and R(K)=K(x)22. A reasonable bandwidth selector should be O(n2/(d+4)) elementwise, so that the bias and variance terms of its MISE are balanced and converge at rate O(n4/(d+4)). Note that parametric estimators typically converge in MISE at rate O(n1).

The bandwidth matrix selector based on AMISE HAMISE=argminHAMISE(H) is not directly usable, because the true density function is unknown. Instead, selectors that minimize estimators of the AMISE are more useful:

  1. Plug-in selector [@Sheather, Jones, 1991; @Wand, Jones, 1994] replaces Ψ4 in AMISE with ˆΨ4(G)=n2i,j(vec2)(vec2)KG(xixj), where G is a pilot bandwidth matrix.
  2. Smoothed cross validation selector [@Rudemo1982; @Bowman1984; @Hall, Marron, Park, 1992] replaces the second term in AMISE with n2i,j(K2H+2G2KH+2G+K2G)(xixj), The plug-in selector ˆHPI and the smoothed cross validation selector ˆHSCV both converge in probability to HAMISE at at a relative rate of Op(n2/(d+6)).

If Gaussian KDE is used on a Gaussian random vector, Silverman's rule of thumb [@Silverman1986] gives the MISE-optimal bandwidth: H=(n(p+2)/4)1/(p+4)diag{ˆσi}, where ˆσ is sample standard deviation and p is the dimension of the random vector.

Variable KDE (adaptive KDE) adjusts kernel bandwidth matrix over the domain of estimation, which comes in two types [@Abramson1982; @Terrell, Scott, 1992; @Sain2002]:

  1. Balloon estimators vary bandwidths H(x;{xi}) on the location of the estimate, e.g. inversely proportional to the probability density h(x)=k(nˆf(x))1/p. This reproduces the k-nearest neighbour density estimator when using a uniform kernel. The nearest neighbor estimator over-adapts to the tails in the univariate and bivariate cases, but performs better than the fixed kernel estimator when p5, at least for Gaussian samples.
  2. Sample smoothing estimators (sample-point estimators) vary bandwidths H(xi) on the location of the sample points, which have preferable properties.

Kernel bandwidth and covariance can be selected by the leave-one-out maximum likelihood criterion. Some recent work on bandwidth selection for multivariate density estimators: [@Chacon and Duong, 2010, 2012], [@Panaretos and Konis 2012].

Characteristic Function Density Estimator

Since characteristic function (CF) φ(t)=E[exp(itX)] is the Fourier transform of the PDF, each estimate of the CF corresponds to an estimate of the PDF by Fourier inversion: ˆf(x)=F1ˆφ(t).

If we estimate the CF by sample mean, ˆφ(t)=1nnj=1exp(itxj), we are effectively estimating the PDF with the empirical PDF ˆfX(x)=1niδ(xxi).

If we prefer a smooth PDF estimate, we may filter high frequency parts of the CF estimator by multiplying a damping function ψh(t)=ψ(ht), In fact, ˆf(x)=1nhni=1K((xxi)/h), where K(x)=F1ψ(t). Thus, kernel density estimators can also be seen as characteristic function density estimators.

Kernel Smoothing

Kernel smoothing estimates a real-valued function as the weighted average of neighboring data, similar to balloon KDE: ˆy(x)=iyiK(x;xi,hi)/iK(x;xi,hi). Nearest neighbor smoother uses equal weights on a fixed number of neighbors: K(x)=1/m with kernel radius hi=d(m)(xi,{xj}), where d(m) denotes distance to the m-th closest element. Kernel average smoother uses a fixed radius kernel.

Nonparametric Regression

Smoothing by locally weighted regression [@Cleveland1979] weighs on the square errors of a regression model, and estimate by the fitted value: estimator minβ(x)iK(x;xi,hi)(yixiβ(x))2 gives fitted value ˆy(x)=x(XWX)1XWy, where W(x)=diag{K(x;xi,hi)}.

spline estimators

Mixture Models (Clustering)

ˆf(x)=kwkK(x;μk,Σk)

https://en.wikipedia.org/wiki/Mixture_model

expectation maximization (EM)

Maximum penalized likelihood estimators

(proposal) A non-parametric density estimator as the likelihood-weighted average of PDFs: ˆf(x)=L1fL(f|X)df/L1L(f|X)df