A point estimator is any function $W(\mathbf{X})$ of a sample. That is, any statistic is a point estimator.

Methods of Finding Estimators

Substitution estimators

1. Empirical Distribution Function
2. Method of Moments Estimators

Maximum likelihood estimators

1. Likelihood Function
2. Maximum Likelihood
3. Properties of MLE
4. Techniques for finding/verifying MLEs
5. List of MLEs

Bayesian estimation

Notes on Bayesian estimation

1. The Bayesian approach to statistics
2. Bayes estimators

In a broad sense, MLE and Bayesian estimation are both model selection methods, and they are really similar. While the former is comparatively easier to implement, the latter is more robust to assumptions.

MLE calculates the likelihood function of a given sample, then takes the model with the maximum score. With likelihood being the objective function, MLE chooses the model that best justifies your available observations, which is a really strong assumption.

Bayesian estimation only provides a probability distribution over a probabilistic model subspace, rather than specify a specific probabilistic model. Bayesian estimation takes not only random sample as input, but also a prior distribution over the model subspace. The output of Bayesian estimation called a posterior, the normalized product of prior and likelihood function. The posterior typically provides sharper prediction than the prior, if the prior is close to the likelihood function. The only issue with Bayesian method is how to justify your prior.

Methods of Evaluating Estimators

1. luck (Corollary)
2. solve (Corollary)
3. conditional (Lehmann-Scheffe)
4. information inequality
5. linear dependence

Def: unbiased

Def: UMVU

Completeness & Sufficiency

Notes on Completeness & Sufficiency

Thm: (Rao-Blackwell)

Thm: (Lehmann-Scheffe)

Cor:

Information Inequality (Cramer-Rao bound)

Notes on Information Inequality

Score function of a parametric family of probability distributions with PDFs is the partial gradient of the log PDF w.r.t. the parameters: $u(\mathbf{x}; \theta) = \nabla_\theta \log p(\mathbf{x}; \theta)$. That is, how sensitive the PDF is w.r.t. the parameters.

The score as a random variable/vector has mean zero, regardless of the parameters: $\mathbb{E} u(\chi; \theta) = 0$. Fisher information of a parametric family is the variance/covariance of the score, seen as a random variable/vector depending on the parameters: $I_\chi(\theta) = \text{Var} u(\chi; \theta)$; equivalently, it is the negative expectation of the partial Hessian of the log PDF w.r.t. the parameters: $I_\chi(\theta) = - \mathbb{E} \nabla^2_\theta \log p(\mathbf{x}; \theta)$. Fisher information is additive to independent parametric families: if $(\chi_i){i=1}^n (\theta)$ are jointly independent, then $I\chi(\theta) = \sum_{i=1}^n I_{\chi_i}(\theta)$.

Information inequality, aka Cramer-Rao bound: For any transformation of a parametric random variable, its mean cannot change faster than its Fisher information-normalized standard deviation: let $g(\theta) = \mathbb{E} T(\chi; \theta)$, then (1) univariate version, $|g'(\theta)| \le \sqrt{\text{Var}T(\chi; \theta) I_\chi(\theta)}$; (2) multivariate version, $(\nabla g(\theta))^T (I_\chi^{-1}(\theta)) (\nabla g(\theta)) \le \text{Var}T(\chi; \theta)$. The equality holds if and only if the transformation is linear.

Corollary: Any unbiased estimator of the parameters must have a variance no less that the reciprocal of the Fisher information: if $\mathbb{E} T(\chi; \theta) = \theta$, then $\text{Var} T(\chi; \theta) \ge 1 / I_\chi(\theta)$.

Note: An UMVU estimator does not have to achieve the information bound.