Least Squares Estimators

This article discusses least squares in its linear, nonlinear, and weighted forms, in the context of regression analysis.

Convention on notations:

normal font for deterministic entities, boldface for random entities;
lowercase for vectors (and scalars), uppercase for matrices, subscript for components;
math symbols refer to separate entities, text symbols refers to one entity.

Note that, a standalone $1$ refers to a vector of ones, just like a standalone $0$ for a vector of zeros.

Ordinary Least Squares Estimator

Notes on OLS estimator

Ordinary least squares (OLS) or linear least squares estimator of regression coefficients minimizes the residual sum of squares, $\hat{\beta} \equiv \arg\min_{\beta} (y - X \beta)' (y - X \beta)$. The closed form solution of the OLS estimator is $\hat{\beta} = (X' X)^{-1} X' y$.

For a regression model with $k$ predictors estimated with a sample of $n$ observations, the outcome vector $y$ is n-by-1, the predictor matrix $X$ is n-by-k, and the regression coefficients $\beta$ is k-by-1. Typically $n > k$ is assumed, and sometimes people use $p$ instead of $k$.

Terminology:

Fitted (or predicted) value of outcome: $\hat{y} = X \hat{\beta}$
Residuals: $\varepsilon = y - X \hat{\beta}$
Residual sum of squares: $\text{RSS} = \varepsilon' \varepsilon$
Total sum of squares: $\text{TSS} = (y - \bar{y} 1)' (y - \bar{y} 1)$
Model sum of squares: $\text{MSS} = (\hat{y} - \bar{y} 1)' (\hat{y} - \bar{y} 1)$

Properties of OLS residuals and fitted values:

The residuals is orthogonal to the sample of predictors, $X' \varepsilon = 0$, which implies that:
- If a constant/intercept is included in the predictors, then the sample mean of the residuals is zero. $\bar{\varepsilon} = 0$
- The sample covariance of the residuals with each of the predictors is zero.
- The sample covariance of the residuals and the fitted values is zero.
If the model includes a constant:
- The sample averages of the outcome and predictors are on the regression line, $\bar{y} = \bar{x}' \hat{\beta}$
- Average fitted value equals sample average of the outcome, $\bar{y} = \bar{\hat{y}}$
- The sample variance of the outcome equals the sum of the sample variances of the fitted values and the residuals, TSS = MSS + RSS.

The most widely used estimator for residual variance $\sigma^2$ is the unbiased sample variance of residuals, $s^2 = \text{RSS} / (n-k)$. It is unbiased conditioning on predictors, $\mathbb{E}( \mathbf{s}^2 \mid \mathbf{X} ) = \sigma^2$. Its square root $s$ is called the standard error of the regression or residual standard error (RSE).

Statistical Properties of OLS Estimators

Gauss-Markov Theorem:

Among all linear unbiased estimators, OLS estimator has the "smallest" variance/covariance matrix.

Notes:

OLS estimator, conditioned on predictors, is linear in the outcome: $\hat{\boldsymbol{\beta}} \mid \mathbf{X} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{y}$
OLS estimator is (conditionally) unbiased: $\mathbb{E}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \beta$
OLS estimator has a conditional sampling variance $\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 (\mathbf{X}' \mathbf{X})^{-1}$

If the Conditional Normality (of residuals) Assumption holds, then additionally we have:

Regression coefficient estimators are conditionally normally distributed: $\hat{\boldsymbol{\beta}} \mid \mathbf{X} \sim N(\beta, \sigma^2 (\mathbf{X}' \mathbf{X})^{-1})$
Sample variance has a scaled chi-squared distribution: $\mathbf{s}^2 \mid \mathbf{X} \sim \frac{\sigma^2}{n-k} \chi^2_{n-k}$
Regression coefficient estimators and sample variance are conditionally independent (on predictors): $\hat{\boldsymbol{\beta}}$ ∐ $\mathbf{s}^2 \mid \mathbf{X}$
t-statistics under null hypothesis that a regression coefficient is correctly estimated has a Student's t-distribution, with degree of freedom being the "excess of information": $\frac{\hat{\boldsymbol{\beta}}_i - \beta_i}{\hat{\sigma}_{\hat{\boldsymbol{\beta}}_i}} \sim t_{n-k}$

Estimators of OLS estimator's statistical properties:

An unbiased estimator of sampling variance of the OLS estimator: $\widehat{\text{Var} \hat{\boldsymbol{\beta}}} = s^2 (\mathbf{X}' \mathbf{X})^{-1}$
An unbiased estimator of sampling variance of the k-th OLS estimator: $\widehat{\sigma^2}_{\hat{\boldsymbol{\beta}}_k} = s^2 [(\mathbf{X}' \mathbf{X})^{-1}]_{kk}$
An estimator of the standard error of the k-th OLS estimator: $\hat{\sigma}_{\hat{\boldsymbol{\beta}}_k} = \sqrt{s^2 [(\mathbf{X}' \mathbf{X})^{-1}]_{kk}}$

Linear Regression as Projection

An alternative to the standard perspective of random variables, the variables can be seen as vectors in a finite sample space, $y = X' \beta + u$. Hereby the OLS estimate of a linear regression model is the projection of the sample outcome on the linear span of sample predictors.

Terminology:

Projection matrix $P_X = X (X'X)^{-1} X'$
Annihilator matrix $M_X = I - P$
Centering matrix $M_1 = I - \frac{1}{n} 1 1'$
Idempotent matrix $A = A A$

The projection matrix $P_X$ projects a vector orthogonally onto the column space of $X$.

In algebra, an idempotent matrix is equivalent to a projection. While the projection matrix appreared here is actually an orthogonal projection since it's both idempotent and symmetric.

The projection matrix $P_X$ is sometimes called the hat matrix, because it transforms the outcome to the fitted values ( $\hat{y} = X (X'X)^{-1} X' y$ ). The diagonal elements of the hat matrix is called the hat values, where relatively large values are influential to the regression coefficients. Inspecting the hat values can be used to detect potential coding errors or other suspicious values that may unduly affect regression coefficients.

Partitioned Regression

Model: $\mathbf{y} \mid \mathbf{X} \sim N(\mathbf{X}_1 \beta_1 + \mathbf{X}_2 \beta_2, \sigma^2)$. OLS Estimator: $\begin{cases} \hat{\beta}_1 = (X_1' M_2 X_1)^{-1} X_1' M_2 y \\ \hat{\beta}_2 = (X_2' M_1 X_2)^{-1} X_2' M_1 y \end{cases}$, where $M_i$ stands for $M_{X_i}$.

If both the outcome and the (non-constant) predictors are centered before regression, the regression coefficients will be the same with those when an intercept is explicitly included in a regression on raw data. So save the effort of centering, and always include an intercept in the model.

If we do randomized experiment to a sample, the corresponding attribute will be uncorrelated (when centered) to any latent variable, so regression on this variable will not be affected by additional predictors.

Strategies to correctly estimate the marginal effect of a predictor:

Include all variables that are correlated with the predictor in the "true" model (control for potential confounders);
Use a random experiment.

Generalized Least Squares

Generalized least squares (GLS) [@Aitken1936] is a linear least squares problem with a known covariance matrix of the residuals: with a linear regression model $y = X \beta + \varepsilon$ where $\mathbb{E}(\varepsilon | X) = 0$ and $\text{Cov}(\varepsilon | X) = \Omega$, the GLS estimator is defined as $\hat{\beta} \equiv \arg\min_{\beta} (y - X \beta)' \Omega^{-1} (y - X \beta)$. Notice that the covariance matrix $\Omega$ is assumed to be known up to scaling. The closed form solution of the GLS estimator is $\hat{\beta} = (X' \Omega^{-1} X)^{-1} X' \Omega^{-1} y$.

Weighted least squares (WLS) is a least squares problem where each residual is assigned a non-negative weight. This is useful when the data points are of varying quality, or when error variances are not the same.

weights inversely proportional to the variance.

Weighted least squares can also be applied to nonlinear least squares problems.

Nonlinear Least Squares

A nonlinear least squares (NLS) regression model is one where the conditional mean function is nonlinear, $\mathbf{y} \mid \mathbf{x} \sim N(f(\mathbf{x}, \beta), \sigma^2)$. The optimization problem is nonconvex, and is often solved with iterative algorithms. For iterative approximation, $\beta^{k+1} - \beta^k = (J^{k'} J^k)^{-1} J^{k'} \varepsilon^k$. Here, $J^k_i = \nabla_{\beta} f(X_i, \beta^k)$, and $\varepsilon^k = y - \hat{y}^k$ with $\hat{y}^k_i = f(X_i, \beta^k)$.

Other algorithms include Gauss-Newton type methods, and variable projection [@Golub and Pereyra, 1973].