A **point estimator** is any function $W(\mathbf{X})$ of a sample.
That is, any statistic is a point estimator.

- Empirical Distribution Function
- Method of Moments Estimators

- Likelihood Function
- Maximum Likelihood
- Properties of MLE
- Techniques for finding/verifying MLEs
- List of MLEs

- The Bayesian approach to statistics
- Bayes estimators

In a broad sense, **MLE** and **Bayesian estimation** are both model selection methods, and they are really similar.
While the former is comparatively easier to implement, the latter is more robust to assumptions.

*MLE* calculates the likelihood function of a given sample, then takes the model with the maximum score.
With likelihood being the objective function, MLE chooses the model that best justifies your available observations, which is a really strong assumption.

*Bayesian estimation* only provides a probability distribution over a probabilistic model subspace, rather than specify a specific probabilistic model.
Bayesian estimation takes not only random sample as input, but also a prior distribution over the model subspace.
The output of Bayesian estimation called a posterior, the normalized product of prior and likelihood function.
The posterior typically provides sharper prediction than the prior, if the prior is close to the likelihood function.
The only issue with Bayesian method is how to justify your prior.

- luck (Corollary)
- solve (Corollary)
- conditional (Lehmann-Scheffe)
- information inequality
- linear dependence

Def: unbiased

Def: UMVU

Notes on Completeness & Sufficiency

Thm: (Rao-Blackwell)

Thm: (Lehmann-Scheffe)

Cor:

Notes on Information Inequality

**Score function** of a parametric family of probability distributions with PDFs
is the partial gradient of the log PDF w.r.t. the parameters:
$u(\mathbf{x}; \theta) = \nabla_\theta \log p(\mathbf{x}; \theta)$.
That is, how sensitive the PDF is w.r.t. the parameters.

The score as a random variable/vector has mean zero, regardless of the parameters:
$\mathbb{E} u(\chi; \theta) = 0$.
**Fisher information** of a parametric family is the variance/covariance of the score,
seen as a random variable/vector depending on the parameters:
$I_\chi(\theta) = \text{Var} u(\chi; \theta)$; equivalently,
it is the negative expectation of the partial Hessian of the log PDF w.r.t. the parameters:
$I_\chi(\theta) = - \mathbb{E} \nabla^2_\theta \log p(\mathbf{x}; \theta)$.
Fisher information is additive to independent parametric families:
if $(\chi_i)*{i=1}^n (\theta)$ are jointly independent,
then $I*\chi(\theta) = \sum_{i=1}^n I_{\chi_i}(\theta)$.

*Information inequality, aka Cramer-Rao bound*:
For any transformation of a parametric random variable,
its mean cannot change faster than its Fisher information-normalized standard deviation:
let $g(\theta) = \mathbb{E} T(\chi; \theta)$, then
(1) univariate version, $|g'(\theta)| \le \sqrt{\text{Var}T(\chi; \theta) I_\chi(\theta)}$;
(2) multivariate version, $(\nabla g(\theta))^T (I_\chi^{-1}(\theta)) (\nabla g(\theta))
\le \text{Var}T(\chi; \theta)$.
The equality holds if and only if the transformation is linear.

Corollary: Any unbiased estimator of the parameters must have a variance no less that the reciprocal of the Fisher information: if $\mathbb{E} T(\chi; \theta) = \theta$, then $\text{Var} T(\chi; \theta) \ge 1 / I_\chi(\theta)$.

Note: An UMVU estimator does not have to achieve the information bound.

- Examples of sufficient/complete statistics & UMVU estimators
- Examples of Fisher information and UMVU estimators