Or should I say, regression analysis?
Econometrics is concerned with the measurement of economic relations. Measuring marginal effect is the goal of most empirical economics research. Examples of economic relations include the relation between earnings and education, work experience; expenditure on a commodity and household income; price and attributes of a good or service; output of firm and inputs of labor, capital, materials; inflation and unemployment rates; etc.
A regression model is often referred to as a reduced form regression, if the focus is on the prediction of outcome given predictors, not on the causal interpretation of the model parameters. Model for causal inference includes simultaneous equations models, potential outcome model, etc.
In econometrics, model (model family or model form) and estimator (fitted model) are distinct concepts. A (parametric) model describes the functional relationship between predictors and an outcome variable, where the parameter is not specified. Thus, a model is essentially a group of probability models, denoted by some parameters. In comparison, an estimator provides a definition of the parameters that correspond to the model considered most favorable in some criteria.
(To be explored: Model and estimator are in some way analogous to simulation and estimation of random processes.)
See also Applied Econometrics.
The fundamental construct in probability is random variable. Correspondingly, the fundamental construct in statistics is random sample, a sampling process from a hypothetical population, aka white noise. A basic and important example is Gaussian white noise, the statistical model for a deterministic value affected by numerous additive zero-mean uncorrelated small error terms. Every statistical analysis should begin and end up with verifying this hypothesis against real data, known as univariate analysis.
Graphical analysis: (4-plot)
Formal tests also exist for univariate analysis, see section "regression diagnostics".
Samples can be alternatively seen as time-series, and a time-series is essentially a vector sequentially measured at constant frequency. But time-series have some peculiarities: trend (shift in moving average), autocorrelation (random walk), seasonality (spectrum analysis). If any of these are detected in univariate analysis, follow-up analysis is needed.
For a given outcome and a set of predictors, (the ideal) regression function is the conditional mean function, which derives from (the hypothetical/theoretical) joint distribution. Here the term "regression" is understood as in "regression toward the mean", coined by Francis Galton. Conditional variance and higher order information about the joint distribution are completely ignored in regression analysis.
Parametric or not, regression models approximates the regression function using a finite sample. Linear regression model can be perceived as a linear approximation of the regression function, which is the conditional expectation function of the outcome on predictors. Different population (sample?) generally results in different linear approximation.
\[ \mathbb{E}[Y|\mathbf{X}] = f(\mathbf{X}) \approx \mathbf{X}'\beta \]
Other Models:
Stages of regression:
Linear Regression (LR) Model:
\[ Y|\mathbf{X} \sim \text{Normal}(\mathbf{X} \beta, \sigma^2) \]
Terminology:
Model assumptions explained:
Multicollinearity (also collinearity) refers to the presence of highly correlated subsets of predictors. Perfect multicollinearity means a subset of predictors are linearly dependent.
Violations of the Exogeneity Assumption that still satisfy the weaker \( \mathbb{E}(u_i|X_i) = 0 \) only occur with dependent data structures. However, it is often hard to assess whether the Exogeneity Assumption is satisfied, even if the model does not explicitly imply a violation.
The Spherical Residuals Assumption is generally too strong, because data in economics and many other fields commonly have heteroskedasticity, autocorrelation (akak serial correlation in time series data), or intraclass correlation (in clustered data).
For most purposes, the Conditional Normality Assumption is not necessary (optional).
As a finite sample, outcome \(\mathbf{y}\) is n-by-1, predictors \(X\) is n-by-k, and the regression coefficients \( \beta \) is k-by-1. \(n\) is the number of observations in a sample, and \(k\) is the number of predictors. Typically \(n > k\) is assumed, and sometimes people use \(p\) instead of \(k\).
Ordinary least square (OLS) estimator of regression coefficients minimizes the residual sum of squares. OLS is also called linear least squares, in contrast with nonlinear least squares (NLS).
\[ \hat{\beta} = \arg\min_{\beta} (\mathbf{y} - X \beta)' (\mathbf{y} - X \beta) \]
The closed form of OLS estimator is
\[ \hat{\beta} = (X' X)^{-1} X' \mathbf{y} \]
Terminology:
Properties of OLS residuals and fitted values:
The most widely used estimator for residual variance \( \sigma^2 \) is the unbiased sample variance of residuals.
\[ s^2 = \frac{\text{RSS}}{n-k} \]
It is unbiased conditioning on predictors, i.e. \( \mathbb{E}( s^2 | \mathbf{X} ) = \sigma^2 \). Its square root \(s\) is called the standard error of the regression or residual standard error (RSE).
Gauss-Markov Theorem:
Among all linear unbiased estimators, OLS estimator has the "smallest" variance/covariance matrix.
Notes:
If the Conditional Normality (of residuals) Assumption holds, then additionally we have:
Estimators of OLS estimator's statistical properties:
Instead of the standard perspective of random variables, the variables can be alternatively seen as vectors in a finite sample space. Hereby the ordinary least square estimate of a linear regression model is the projection of the sample outcome on the linear span of sample predictors.
\[ \mathbf{y} = X' \beta + \mathbf{u} \]
Terminology:
The projection matrix \( P_X \) projects a vector orthogonally onto the column space of X.
In algebra, an idempotent matrix is equivalent to a projection. While the projection matrix appreared here is actually an orthogonal projection since it's both idempotent and symmetric.
The projection matrix \( P_X \) is sometimes called the hat matrix, because it transforms the outcome to the fitted values ( \(\hat{\mathbf{y}} = X (X'X)^{-1} X' \mathbf{y}\) ). The diagonal elements of the hat matrix is called the hat values, where relatively large values are influential to the regression coefficients. Inspecting the hat values can be used to detect potential coding errors or other suspicious values that may unduly affect regression coefficients.
Model
\[ Y|\mathbf{X} \sim \text{Normal}(\mathbf{X_1} \beta_1 + \mathbf{X_2} \beta_2, \sigma^2) \]
OLS Estimator ( \(M_i\) is shorthand for \(M_{X_i}\). )
\[ \hat{\beta_1} = (X_1' M_2 X_1)^{-1} X_1' M_2 \mathbf{y} \]
\[ \hat{\beta_2} = (X_2' M_1 X_2)^{-1} X_2' M_1 \mathbf{y} \]
If both the outcome and the (non-constant) predictors are centered before regression, the regression coefficients will be the same with those when an intercept is explictely included in a regression on raw data. So save the effort of centering, and always include an intercept in the model.
If we do randomized experiment to a sample, the corresponding attribute will be uncorrelated (when centered) to any latent variable, so regression on this variable will not be affected by additional predictors.
Strategies to correctly estimate the marginal effect of a predictor:
Main effects vs: Interactions. Non-linear effects. Transformations.
The hierarchy principle:
If we include an interaction in an model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. Because interactions are hard to interpret in a model without main effects.
While many observables in social sciences have heavy-tail distributions, the distributions of their logarithms are typically well behaved. In economics, for example, monetary variables such as earnings are often log-transformed. But beware that the interpretations of the two are not the same.
Regression diagnostics are procedures assessing the validity of a regression model, mostly LR models. A regression diagnostic may take the form of a graphical approach, informal quantitative results, or a formal statistical hypothesis test; each of which provides guidance for further stages of a regression analysis.
Testing heteroskedasticity (of residuals):
Testing correlation of residuals:
Testing (conditional) normality of residuals:
Graphical residual analysis: If the model form is adequate, residual scatter plots should appear to be a random field over all potential predictors. It is impractical to measure every independent quantity, but all available attributes should be checked, and ambient variables should be measured. If a scatter plot of residuals versus a variable did show systematic structure, the model form should be adjusted in that predictor, or include it as a predictor if not already so.
Lack-of-fit statistics/tests for model adequacy: (Testing model adequacy requires replicate measurements, e.g. data hold-back and cross validation.)
Multicollinearity:
Change of model structure between groups of observations
Comparing model structures
Outliers: observations that deviates markedly from other observations in the sample.
Influential observations (high leverage points): observations that have a relatively large effect on the regression model's predictions.
Model selection (variable selection): best subset, forward/backward selection.
Akaike information criterion (AIC), Bayesian information criterion (BIC), Mallow's \(C_p\), cross-validation
R-squared, or coefficient of determination, is defined as
\[ R^2 = \frac{\text{MSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}} \]
Properties:
In macroeconomics, the R2 is often very high; 0.8 or higher is not unusual. In microeconomics, it is typically very low, with 0.1 not unusual. The reason might be that the number of observations in macroeconometrics is often much lower (e.g., 25 OECD countries) than in microeconometrics (e.g., 10,000 households), while the number of predictors is not that different.
Seeing regression as the approximation of regression function, \( R^2 \) doesn't have to be very close to one for model justification, distinct from practice in physics and mechanics experiments. This is because we're only considering average effect, rather than elimination of error term.
\[ \bar{R}^2 = 1 - \frac{\text{RSS}/(n-k)}{\text{TSS}/(n-1)} \]
Properties:
MLE (for classification models), GMM
Asymptotic Inference: Asymptotic distribution of regression estimators.
Bootstrap: repeated subsampling with replacement.
Bootstrap Inference: Estimate uncertainty of a statistic (point estimator).
observational data vs. experimental data (design of experiments).