Likelihood is the hypothetical probability that an event already occurred would yield certain outcome. The concept differs from probability in that a probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes.
In statistical inference, where a random sample has been realized with certain sample values, likelihood refers to the hypothetical probability that the random sample yields such observations, under some probability model of the population.
A likelihood function is the probability (density) for the occurrence of a sample realization given that the probability density (with parameter) is known: \[ L(\theta) = P \{\mathbf{X}=\mathbf{x}|X \sim f(x;\theta) \} = \prod_{i=1}^n f(x_i;\theta) \]
Maximum likelihood is the procedure of searching through the parameter space for a probability model that maximizes the likelihood of current observations.
Maximum likelihood estimator (MLE) of a model parameter is the maximizer of the likelihood function: \[ \hat{\boldsymbol{\theta}}(\mathbf{X}) = \arg \sup_{\boldsymbol{\theta}} L( \boldsymbol{\theta} | \mathbf{X} ) \]
MLE is an optimization estimator: its objective function is the likelihood of current observations as a function of model parameters; the feasible domain is the parameter space. From another perspective, the feasible domain is a space of parametric probability models, and the objective is to search through the space for a model that justifies your current observations most.
Difficulties of optimization:
Induced likelihood function of a parameter dependent on the model parameter is the maximum likelihood of observations within a family of models indexed by the induced parameter. Symbolically, given induced parameter \( \eta = g(\boldsymbol{\theta}) \), the induced likelihood function of \( \eta \) is: \[ L^{*}(\eta|\mathbf{x}) = \sup_{ \boldsymbol{\theta}: g(\boldsymbol{\theta}) = \eta } L(\boldsymbol{\theta}|\mathbf{x}) \]
The MLE of induced parameter is the induced parameter that maximizes the induced likelihood function: \[ \hat{\eta}(\mathbf{X}) = \arg \sup_{\eta} L^{*}( \eta | \mathbf{X} ) \]
Theorem: (Invariance property of MLEs) The MLE of any induced parameter equals to the induced value of the MLE of the model parameter. \[ \widehat{g(\boldsymbol{\theta})} = g(\hat{\boldsymbol{\theta}}), \forall g \]
MLEs are consistent in most cases.
Theorem: The MLE of an induced parameter is consistent, pointwise in the parameter space, if the induced parameter is a continuous function of the parameter and the following assumptions hold:
Symbolically, if \( g(\boldsymbol{\theta}) \in C^0(\boldsymbol{\Theta},\mathbb{R}) \), and
Then, \[ g(\hat{\boldsymbol{\theta}}) \overset{p}{\to} g(\boldsymbol{\theta}), \forall \boldsymbol{\theta} \in \boldsymbol{\Theta} \]
MLEs are asymptotic efficient in most cases.
Theorem: The MLE of an induced parameter is asymptotic efficient, pointwise in the parameter space, if in addition to all the conditions for consistency the following assumptions hold:
MLEs are always functions of sufficient statistics.
Using second derivative condition to check for maximum likelihood requires negative definiteness of the Hessian matrix, which is formidable.
It is always important to analysis the likelihood function as much as possible, to find the number and nature of its local maxima, before using numerical maximization.
This is an algorithm suited to find MLE with missing data problems, by constructing a sequence that is guaranteed to converge to the MLE.
Parametric model | MLE |
---|---|
\(N(\theta,b^2)\) | \(\hat{\theta} = \bar{X}\) |
\(N(a,\sigma^2)\) | \(\widehat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (X_i-a)^2\) |
\(N(\theta,\sigma^2)\) | \( \left( \hat{\theta},\widehat{\sigma^2} \right) = \left( \bar{X},\frac{1}{n} \sum_{i=1}^n (X_i- \bar{X})^2 \right) \) |
\(Bernoulli(p) \) | \(\hat{p}=\bar{X}\) |
\(U(0,\theta)\) | \(\hat{\theta} = X_{(n)}\) |