This article discusses least squares in its linear, nonlinear, and weighted forms, in the context of regression analysis.

Convention on notations:

- normal font for deterministic entities, boldface for random entities;
- lowercase for vectors (and scalars), uppercase for matrices, subscript for components;
- math symbols refer to separate entities, text symbols refers to one entity.

Note that, a standalone $1$ refers to a vector of ones, just like a standalone $0$ for a vector of zeros.

**Ordinary least squares** (OLS) or **linear least squares** estimator of regression coefficients
minimizes the residual sum of squares,
$\hat{\beta} \equiv \arg\min_{\beta} (y - X \beta)' (y - X \beta)$.
This is an unconstrained quadratic optimization problem.
The closed form solution of the OLS estimator is $\hat{\beta} = (X' X)^{-1} X' y$.

For a regression model with $k$ predictors estimated with a sample of $n$ observations, the outcome vector $y$ is n-by-1, the predictor matrix $X$ is n-by-k, and the regression coefficients $\beta$ is k-by-1. Typically $n > k$ is assumed, and sometimes people use $p$ instead of $k$.

Terminology:

- Fitted (or predicted) value of outcome: $\hat{y} = X \hat{\beta}$
- Residuals: $\varepsilon = y - X \hat{\beta}$
- Residual sum of squares: $\text{RSS} = \varepsilon' \varepsilon$
- Total sum of squares: $\text{TSS} = (y - \bar{y} 1)' (y - \bar{y} 1)$
- Model sum of squares: $\text{MSS} = (\hat{y} - \bar{y} 1)' (\hat{y} - \bar{y} 1)$

Properties of OLS residuals and fitted values:

- The residuals is orthogonal to the sample of predictors, $X' \varepsilon = 0$, which implies that:
- If a constant/intercept is included in the predictors, then the sample mean of the residuals is zero. $\bar{\varepsilon} = 0$
- The sample covariance of the residuals with each of the predictors is zero.
- The sample covariance of the residuals and the fitted values is zero.

- If the model includes a constant:
- The sample averages of the outcome and predictors are on the regression line, $\bar{y} = \bar{x}' \hat{\beta}$
- Average fitted value equals sample average of the outcome, $\bar{y} = \bar{\hat{y}}$
- The sample variance of the outcome equals the sum of the sample variances of the fitted values and the residuals, TSS = MSS + RSS.

The most widely used estimator for residual variance $\sigma^2$
is the **unbiased sample variance of residuals**,
$s^2 = \text{RSS} / (n-k)$.
It is unbiased conditioning on predictors, $\mathbb{E}( \mathbf{s}^2 \mid \mathbf{X} ) = \sigma^2$.
Its square root $s$ is called the **standard error of the regression**
or **residual standard error** (RSE).

Gauss-Markov Theorem:

Among all linear unbiased estimators, OLS estimator has the "smallest" variance/covariance matrix.

Notes:

- OLS estimator, conditioned on predictors, is linear in the outcome: $\hat{\boldsymbol{\beta}} \mid \mathbf{X} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{y}$
- OLS estimator is (conditionally) unbiased: $\mathbb{E}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \beta$
- OLS estimator has a conditional sampling variance $\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 (\mathbf{X}' \mathbf{X})^{-1}$

If the Conditional Normality (of residuals) Assumption holds, then additionally we have:

- Regression coefficient estimators are conditionally normally distributed: $\hat{\boldsymbol{\beta}} \mid \mathbf{X} \sim N(\beta, \sigma^2 (\mathbf{X}' \mathbf{X})^{-1})$
- Sample variance has a scaled chi-squared distribution: $\mathbf{s}^2 \mid \mathbf{X} \sim \frac{\sigma^2}{n-k} \chi^2_{n-k}$
- Regression coefficient estimators and sample variance are conditionally independent (on predictors): $\hat{\boldsymbol{\beta}}$ ∐ $\mathbf{s}^2 \mid \mathbf{X}$
- t-statistics under null hypothesis that a regression coefficient is correctly estimated has a Student's t-distribution, with degree of freedom being the "excess of information": $\frac{\hat{\boldsymbol{\beta}}_i - \beta_i}{\hat{\sigma}_{\hat{\boldsymbol{\beta}}_i}} \sim t_{n-k}$

Estimators of OLS estimator's statistical properties:

- An unbiased estimator of sampling variance of the OLS estimator: $\widehat{\text{Var} \hat{\boldsymbol{\beta}}} = s^2 (\mathbf{X}' \mathbf{X})^{-1}$
- An unbiased estimator of sampling variance of the k-th OLS estimator: $\widehat{\sigma^2}_{\hat{\boldsymbol{\beta}}_k} = s^2 [(\mathbf{X}' \mathbf{X})^{-1}]_{kk}$
- An estimator of the standard error of the k-th OLS estimator: $\hat{\sigma}_{\hat{\boldsymbol{\beta}}_k} = \sqrt{s^2 [(\mathbf{X}' \mathbf{X})^{-1}]_{kk}}$

An alternative to the standard perspective of random variables,
the variables can be seen as vectors in a finite sample space, $y = X' \beta + u$.
Hereby the OLS estimate of a linear regression model
is the *projection* of the sample outcome on the linear span of sample predictors.

Terminology:

- Projection matrix $P_X = X (X'X)^{-1} X' = X X^\dagger$
- Annihilator matrix $M_X = I - P$
- Centering matrix $M_1 = I - \frac{1}{n} 1 1'$
- Idempotent matrix $A = A A$

The projection matrix $P_X$ projects a vector orthogonally onto the column space of $X$.

In algebra, an idempotent matrix is equivalent to a projection. While the projection matrix appreared here is actually an orthogonal projection since it's both idempotent and symmetric.

The projection matrix $P_X$ is sometimes called the **hat matrix**,
because it transforms the outcome to the fitted values ( $\hat{y} = X (X'X)^{-1} X' y$ ).
The diagonal elements of the hat matrix is called the **hat values**,
where relatively large values are influential to the regression coefficients.
Inspecting the hat values can be used to detect potential coding errors
or other suspicious values that may unduly affect regression coefficients.

Model: $\mathbf{y} \mid \mathbf{X} \sim N(\mathbf{X}_1 \beta_1 + \mathbf{X}_2 \beta_2, \sigma^2)$. OLS Estimator: $\begin{cases} \hat{\beta}_1 = (X_1' M_2 X_1)^{-1} X_1' M_2 y \\ \hat{\beta}_2 = (X_2' M_1 X_2)^{-1} X_2' M_1 y \end{cases}$, where $M_i$ stands for $M_{X_i}$.

If both the outcome and the (non-constant) predictors are centered before regression, the regression coefficients will be the same with those when an intercept is explicitly included in a regression on raw data. So save the effort of centering, and always include an intercept in the model.

If we do randomized experiment to a sample, the corresponding attribute will be uncorrelated (when centered) to any latent variable, so regression on this variable will not be affected by additional predictors.

Strategies to correctly estimate the marginal effect of a predictor:

- Include all variables that are correlated with the predictor in the "true" model (control for potential confounders);
- Use a random experiment.

**Generalized least squares** (GLS) [@Aitken1936]
is a linear least squares problem with a *known* covariance matrix of the residuals:
with a linear regression model $y = X \beta + \varepsilon$ where
$\mathbb{E}(\varepsilon | X) = 0$ and $\text{Cov}(\varepsilon | X) = \Omega$,
the GLS estimator is defined as
$\hat{\beta} \equiv \arg\min_{\beta} (y - X \beta)' \Omega^{-1} (y - X \beta)$.
Notice that the covariance matrix $\Omega$ is assumed to be known up to scaling.
The closed form solution of the GLS estimator is
$\hat{\beta} = (X' \Omega^{-1} X)^{-1} X' \Omega^{-1} y$.
The GLS estimator is unbiased, efficient, and asymptotically normal,
with $\text{Cov}(\hat\beta | X) = (X' \Omega^{-1} X)^{-1}$.
GLS is more efficient than OLS in case of heteroscedasticity or autocorrelation.

If the residual covariance $\Omega$ is unknown and needs to be estimated,
which is usually the case, one may use a **feasible generalized least squares** (FGLS) estimator:
fit the linear regression model by a consistent estimator such as the OLS;
use the residuals to estimate the residual covariance by a consistent estimator;
and then apply the GLS estimator.
Under appropriate conditions, FGLS estimators have the same asymptotic properties as the GLS;
however, FGLS estimators are not always consistent, and can be less efficient than the OLS
at small or medium sample sizes.

**Weighted least squares** (WLS) is a least squares problem
where each residual is assigned a non-negative weight:
$\hat{\beta} \equiv \arg\min_{\beta} \sum_{i=1}^m w_i \|y_i - f(x_i, \beta)\|_2^2$.
This is useful when the data points are of varying quality.
For linear least squares problems, WLS is a special form of the GLS
where the covariance matrix is diagonal:
$\hat{\beta} \equiv \arg\min_{\beta} (y - X \beta)' \text{diag}(w) (y - X \beta)$.
This is applicable when the residuals do not the same variance, but are uncorrelated.
To achieve the statistical properties of the GLS, here the weights should be
inversely proportional to the error variances.

An **iteratively reweighted least squares** (IRLS) algorithm is
an algorithm for a learning problem where the loss function is an $L^p$ norm:
at every iteration it solves a weighted least squares problem,
where the weights depend on the estimate at the previous iteration.

A **nonlinear least squares** (NLS) regression model is one
where the conditional mean function is nonlinear,
$\mathbf{y} \mid \mathbf{x} \sim N(f(\mathbf{x}, \beta), \sigma^2)$.
The optimization problem is nonconvex, and is often solved with iterative algorithms.
For iterative approximation, $\beta^{k+1} - \beta^k = (J^{k'} J^k)^{-1} J^{k'} \varepsilon^k$.
Here, $J^k_i = \nabla_{\beta} f(X_i, \beta^k)$, and
$\varepsilon^k = y - \hat{y}^k$ with $\hat{y}^k_i = f(X_i, \beta^k)$.

Other algorithms include Gauss-Newton type methods, and variable projection [@Golub and Pereyra, 1973].