Convention on notations:
Note that, a standalone $1$ refers to a vector of ones, just like a standalone $0$ for a vector of zeros.
Ordinary least squares (OLS) estimator of regression coefficients minimizes the residual sum of squares. OLS is also called linear least squares, in contrast with nonlinear least squares (NLS).
$$\hat{\beta} \equiv \arg\min_{\beta} (y - X \beta)' (y - X \beta)$$
The closed form of OLS estimator is
$$\hat{\beta} = (X' X)^{-1} X' y$$
For a regression model with $k$ predictors estimated with a sample of $n$ observations, the outcome vector $y$ is n-by-1, the predictor matrix $X$ is n-by-k, and the regression coefficients $\beta$ is k-by-1. Typically $n > k$ is assumed, and sometimes people use $p$ instead of $k$.
Terminology:
Properties of OLS residuals and fitted values:
The most widely used estimator for residual variance $\sigma^2$ is the unbiased sample variance of residuals.
$$s^2 = \frac{\text{RSS}}{n-k}$$
It is unbiased conditioning on predictors, $\mathbb{E}( \mathbf{s}^2 \mid \mathbf{X} ) = \sigma^2$. Its square root $s$ is called the standard error of the regression or residual standard error (RSE).
Gauss-Markov Theorem:
Among all linear unbiased estimators, OLS estimator has the "smallest" variance/covariance matrix.
Notes:
If the Conditional Normality (of residuals) Assumption holds, then additionally we have:
Estimators of OLS estimator's statistical properties:
Instead of the standard perspective of random variables, the variables can be alternatively seen as vectors in a finite sample space. Hereby the ordinary least squares estimate of a linear regression model is the projection of the sample outcome on the linear span of sample predictors.
$$y = X' \beta + u$$
Terminology:
The projection matrix $P_X$ projects a vector orthogonally onto the column space of $X$.
In algebra, an idempotent matrix is equivalent to a projection. While the projection matrix appreared here is actually an orthogonal projection since it's both idempotent and symmetric.
The projection matrix $P_X$ is sometimes called the hat matrix, because it transforms the outcome to the fitted values ( $\hat{y} = X (X'X)^{-1} X' y$ ). The diagonal elements of the hat matrix is called the hat values, where relatively large values are influential to the regression coefficients. Inspecting the hat values can be used to detect potential coding errors or other suspicious values that may unduly affect regression coefficients.
Model: $$\mathbf{y} \mid \mathbf{X} \sim \text{Normal}(\mathbf{X}_1 \beta_1 + \mathbf{X}_2 \beta_2, \sigma^2)$$
OLS Estimator: $$\begin{aligned} \hat{\beta}_1 = (X_1' M_2 X_1)^{-1} X_1' M_2 y \\ \hat{\beta}_2 = (X_2' M_1 X_2)^{-1} X_2' M_1 y \end{aligned}$$ Here, $M_i$ is shorthand for $M_{X_i}$.
If both the outcome and the (non-constant) predictors are centered before regression, the regression coefficients will be the same with those when an intercept is explicitly included in a regression on raw data. So save the effort of centering, and always include an intercept in the model.
If we do randomized experiment to a sample, the corresponding attribute will be uncorrelated (when centered) to any latent variable, so regression on this variable will not be affected by additional predictors.
Strategies to correctly estimate the marginal effect of a predictor:
GLS, or weighted least squares
$$\hat{\beta} \equiv \arg\min_{\beta} (y - X \beta)' \Omega^{-1} (y - X \beta)$$
$$\hat{\beta} = (X' \Omega^{-1} X)^{-1} X' \Omega^{-1} y$$
NLS
$$\mathbf{y} \mid \mathbf{x} \sim \text{Normal}(f(\mathbf{x}, \beta), \sigma^2)$$
For iterative approximation,
$$\beta^{k+1} - \beta^k = (J^{k'} J^k)^{-1} J^{k'} \varepsilon^k$$
Here, $J^k_i = \nabla_{\beta} f(X_i, \beta^k)$, and $\varepsilon^k = y - \hat{y}^k$ with $\hat{y}^k_i = f(X_i, \beta^k)$.