This article discusses least squares in its linear, nonlinear, and weighted forms, in the context of regression analysis.
Convention on notations:
Note that, a standalone $1$ refers to a vector of ones, just like a standalone $0$ for a vector of zeros.
Ordinary least squares (OLS) or linear least squares estimator of regression coefficients minimizes the residual sum of squares, $\hat{\beta} \equiv \arg\min_{\beta} (y - X \beta)' (y - X \beta)$. The closed form solution of the OLS estimator is $\hat{\beta} = (X' X)^{-1} X' y$.
For a regression model with $k$ predictors estimated with a sample of $n$ observations, the outcome vector $y$ is n-by-1, the predictor matrix $X$ is n-by-k, and the regression coefficients $\beta$ is k-by-1. Typically $n > k$ is assumed, and sometimes people use $p$ instead of $k$.
Terminology:
Properties of OLS residuals and fitted values:
The most widely used estimator for residual variance $\sigma^2$ is the unbiased sample variance of residuals, $s^2 = \text{RSS} / (n-k)$. It is unbiased conditioning on predictors, $\mathbb{E}( \mathbf{s}^2 \mid \mathbf{X} ) = \sigma^2$. Its square root $s$ is called the standard error of the regression or residual standard error (RSE).
Gauss-Markov Theorem:
Among all linear unbiased estimators, OLS estimator has the "smallest" variance/covariance matrix.
Notes:
If the Conditional Normality (of residuals) Assumption holds, then additionally we have:
Estimators of OLS estimator's statistical properties:
An alternative to the standard perspective of random variables, the variables can be seen as vectors in a finite sample space, $y = X' \beta + u$. Hereby the OLS estimate of a linear regression model is the projection of the sample outcome on the linear span of sample predictors.
Terminology:
The projection matrix $P_X$ projects a vector orthogonally onto the column space of $X$.
In algebra, an idempotent matrix is equivalent to a projection. While the projection matrix appreared here is actually an orthogonal projection since it's both idempotent and symmetric.
The projection matrix $P_X$ is sometimes called the hat matrix, because it transforms the outcome to the fitted values ( $\hat{y} = X (X'X)^{-1} X' y$ ). The diagonal elements of the hat matrix is called the hat values, where relatively large values are influential to the regression coefficients. Inspecting the hat values can be used to detect potential coding errors or other suspicious values that may unduly affect regression coefficients.
Model: $\mathbf{y} \mid \mathbf{X} \sim N(\mathbf{X}_1 \beta_1 + \mathbf{X}_2 \beta_2, \sigma^2)$. OLS Estimator: $\begin{cases} \hat{\beta}_1 = (X_1' M_2 X_1)^{-1} X_1' M_2 y \\ \hat{\beta}_2 = (X_2' M_1 X_2)^{-1} X_2' M_1 y \end{cases}$, where $M_i$ stands for $M_{X_i}$.
If both the outcome and the (non-constant) predictors are centered before regression, the regression coefficients will be the same with those when an intercept is explicitly included in a regression on raw data. So save the effort of centering, and always include an intercept in the model.
If we do randomized experiment to a sample, the corresponding attribute will be uncorrelated (when centered) to any latent variable, so regression on this variable will not be affected by additional predictors.
Strategies to correctly estimate the marginal effect of a predictor:
Generalized least squares (GLS) [@Aitken1936] is a linear least squares problem with a known covariance matrix of the residuals: with a linear regression model $y = X \beta + \varepsilon$ where $\mathbb{E}(\varepsilon | X) = 0$ and $\text{Cov}(\varepsilon | X) = \Omega$, the GLS estimator is defined as $\hat{\beta} \equiv \arg\min_{\beta} (y - X \beta)' \Omega^{-1} (y - X \beta)$. Notice that the covariance matrix $\Omega$ is assumed to be known up to scaling. The closed form solution of the GLS estimator is $\hat{\beta} = (X' \Omega^{-1} X)^{-1} X' \Omega^{-1} y$.
Weighted least squares (WLS) is a least squares problem where each residual is assigned a non-negative weight. This is useful when the data points are of varying quality, or when error variances are not the same.
weights inversely proportional to the variance.
Weighted least squares can also be applied to nonlinear least squares problems.
A nonlinear least squares (NLS) regression model is one where the conditional mean function is nonlinear, $\mathbf{y} \mid \mathbf{x} \sim N(f(\mathbf{x}, \beta), \sigma^2)$. The optimization problem is nonconvex, and is often solved with iterative algorithms. For iterative approximation, $\beta^{k+1} - \beta^k = (J^{k'} J^k)^{-1} J^{k'} \varepsilon^k$. Here, $J^k_i = \nabla_{\beta} f(X_i, \beta^k)$, and $\varepsilon^k = y - \hat{y}^k$ with $\hat{y}^k_i = f(X_i, \beta^k)$.
Other algorithms include Gauss-Newton type methods, and variable projection [@Golub and Pereyra, 1973].