Maximum Likelihood

Likelihood Function

Likelihood is the hypothetical probability that an event already occurred would yield certain outcome. The concept differs from probability in that a probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes.

In statistical inference, where a random sample has been realized with certain sample values, likelihood refers to the hypothetical probability that the random sample yields such observations, under some probabilistic model of the population.

A likelihood function is the probability (density) for the occurrence of a sample realization given that the probability density (with parameter) is known: $$L(\theta) = P \{\mathbf{X}=\mathbf{x}|X \sim f(x;\theta) \} = \prod_{i=1}^n f(x_i;\theta)$$

Maximum Likelihood

Maximum likelihood is the procedure of searching through the parameter space for a probabilistic model that maximizes the likelihood of current observations.

Maximum likelihood estimator (MLE) of a model parameter is the maximizer of the likelihood function: $$\hat{\boldsymbol{\theta}}(\mathbf{X}) = \arg \sup_{\boldsymbol{\theta}} L( \boldsymbol{\theta} | \mathbf{X} )$$

MLE is an optimization estimator: its objective function is the likelihood of current observations as a function of model parameters; the feasible domain is the parameter space. From another perspective, the feasible domain is a space of parametric probabilistic models, and the objective is to search through the space for a model that justifies your current observations most.

Difficulties of optimization:

finding and verifying the global maximum;
numerical sensitivity of the maximum;

Properties of MLEs

Invariance property

Induced likelihood function of a parameter dependent on the model parameter is the maximum likelihood of observations within a family of models indexed by the induced parameter. Symbolically, given induced parameter $\eta = g(\boldsymbol{\theta})$, the induced likelihood function of $\eta$ is: $$L^{*}(\eta|\mathbf{x}) = \sup_{ \boldsymbol{\theta}: g(\boldsymbol{\theta}) = \eta } L(\boldsymbol{\theta}|\mathbf{x})$$

The MLE of induced parameter is the induced parameter that maximizes the induced likelihood function: $$\hat{\eta}(\mathbf{X}) = \arg \sup_{\eta} L^{*}( \eta | \mathbf{X} )$$

Theorem: (Invariance property of MLEs) The MLE of any induced parameter equals to the induced value of the MLE of the model parameter. $$\widehat{g(\boldsymbol{\theta})} = g(\hat{\boldsymbol{\theta}}), \forall g$$

Proof

Consistency

MLEs are consistent in most cases.

Theorem: The MLE of an induced parameter is consistent, pointwise in the parameter space, if the induced parameter is a continuous function of the parameter and the following assumptions hold:

The parameter is identifiable.
The parametric model has common support and is differentiable in parameter space.
The true parameter is an interior point in parameter space.

Symbolically, if $g(\boldsymbol{\theta}) \in C^0(\boldsymbol{\Theta},\mathbb{R})$, and

$\boldsymbol{\theta} \ne \boldsymbol{\theta}' \implies f(x|\boldsymbol{\theta}) \ne f(x|\boldsymbol{\theta}')$
$\forall x \in \Omega, \nabla_{\boldsymbol{\theta}} f(x;\boldsymbol{\theta})$ exists.
$\exists \varepsilon > 0: B_{\varepsilon}(\boldsymbol{\theta}_0) \subseteq \boldsymbol{\Theta}$

Then, $$g(\hat{\boldsymbol{\theta}}) \overset{p}{\to} g(\boldsymbol{\theta}), \forall \boldsymbol{\theta} \in \boldsymbol{\Theta}$$

Asymptotic efficiency

MLEs are asymptotic efficient in most cases.

Theorem: The MLE of an induced parameter is asymptotic efficient, pointwise in the parameter space, if in addition to all the conditions for consistency the following assumptions hold:

$\forall x \in \Omega, f(x|\boldsymbol{\theta}) \in C^3 ( \boldsymbol{\Theta}, \mathbb{R} )$, and $\int f(x|\boldsymbol{\theta}) \mathrm{d} x$ is three times differentiable under the integral sign.
$\forall \boldsymbol{\theta}_0 \in \boldsymbol{\Theta}, \exists c>0, M(x): \lVert \nabla^3 \log f(x|\boldsymbol{\theta}) \rVert \leq M(x), \forall x \in \Omega, \boldsymbol{\theta} \in B_c(\boldsymbol{\theta}_0)$ and $PM(x)<\infty$

Dependence on sufficient statistics

MLEs are always functions of sufficient statistics.

Techniques for finding/verifying MLEs

differentiation
direct maximization (unique attainable global upper bound)
log likelihood (MLE also solves the score function, i.e. gradient of log likelihood function.)
successive maximizations
the EM algorithm

Using second derivative condition to check for maximum likelihood requires negative definiteness of the Hessian matrix, which is formidable.

It is always important to analysis the likelihood function as much as possible, to find the number and nature of its local maxima, before using numerical maximization.

The EM algorithm

This is an algorithm suited to find MLE with missing data problems, by constructing a sequence that is guaranteed to converge to the MLE.

List of MLEs

Parametric model	MLE
$N(\theta,b^2)$	$\hat{\theta} = \bar{X}$
$N(a,\sigma^2)$	$\widehat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (X_i-a)^2$
$N(\theta,\sigma^2)$	$\left( \hat{\theta},\widehat{\sigma^2} \right) = \left( \bar{X},\frac{1}{n} \sum_{i=1}^n (X_i- \bar{X})^2 \right)$
$Bernoulli(p) $	$\hat{p}=\bar{X}$
$U(0,\theta)$	$\hat{\theta} = X_{(n)}$

🏷 Category=Statistics