Or should I say, regression analysis?

Econometrics is concerned with the measurement of economic relations. Measuring marginal effect is the goal of most empirical economics research. Examples of economic relations include the relation between earnings and education, work experience; expenditure on a commodity and household income; price and attributes of a good or service; output of firm and inputs of labor, capital, materials; inflation and unemployment rates; etc.

A regression model is often referred to as a reduced form regression, if the focus is on the prediction of outcome given predictors, not on the causal interpretation of the model parameters. Model for causal inference includes simultaneous equations models, potential outcome model, etc.

In econometrics, model (model family or model form) and estimator (fitted model) are distinct concepts. A (parametric) model specifies the probability distribution of an outcome variable conditioned on some predictors, where the parameters do not have fixed value. In other words, a (parametric) model is a group of conditional probability models, indexed by some parameters. In comparison, an estimator gives the model parameters as a function of a finite sample, which is considered the best in some criteria.

See also Applied Econometrics.

Univariate analysis

The fundamental construct in probability is random variable. Correspondingly, the fundamental construct in statistics is random sample, a sampling process from a hypothetical population, aka white noise. A basic and important example is Gaussian white noise, the statistical model for a deterministic value affected by numerous additive zero-mean uncorrelated small error terms. Every statistical analysis should begin and end up with verifying such a univariate model (not necessarily Gaussian) against real data, known as univariate analysis.

Graphical analysis: (4-plot)

  1. Run sequence plot: shifts in location or scale.
  2. Lag plot: auto-correlation.
  3. Histogram: distribution.
  4. Normal probability plot: fit of normal distribution.

Formal tests also exist for univariate analysis, see section "regression diagnostics".

Samples can be alternatively seen as time-series, and a time-series is essentially a vector sequentially measured at constant frequency. This fact is exploited in some of the univariate analysis techniques to detect the peculiarities of time-series data: trend (shift in moving average), autocorrelation (e.g. random walk), seasonality (requires spectral analysis). If any of these are detected in univariate analysis, follow-up analysis is needed.

Regression Analysis

For a given outcome and a given set of predictors, (the ideal) regression function is the conditional mean function, which derives from their (hypothetical/theoretical) joint distribution. Here the term "regression" is understood as in "regression toward the mean", coined by Francis Galton. Regression analysis is the estimation of the regression function, where conditional variance and higher order information about the joint distribution are completely ignored.

\[ f(\mathbf{X}) \equiv \mathbb{E}[Y|\mathbf{X}] \]

A regression model is a group of conditional probability models whose expectation function approximates the regression function, parametric or not.

\[ Y|\mathbf{X} \sim P(\mathbf{X}, \theta);~ \mathbb{E} P(\mathbf{X}, \theta) \approx f(\mathbf{X}) \]

Linear regression model gives a linear approximation of the regression function, with slope parameters and an intercept. Other model families:

  • Generalized linear model (GLM): linear model composite with a link function.
  • Linear discriminant analysis (LDA); support vector machines (SVM);
  • Nonlinear models: step function (piecewise null model); linear, cubic and cubic smoothing splines; generalized additive model (GAM);

There are two separate concerns regarding a fitted model: validity and prediction error. A fitted model is valid (or adequate) if the residuals behave according to a univariate model. The prediction error of a fitted model is typically measured as its mean square error (MSE) or root MSE (RMSE) on new observations.

Stages of regression (bottom-up view):

  1. No predictor: sample mean (null/univariate model)
  2. One predictor: simple linear regression (and other univariate parametric models); local regression, cubic smoothing splines;
  3. Multiple predictors: linear regression; k-nearest-neighbors, kernel methods, linear and cubic splines;
  4. Omitted covariates: check residual scatter plot for uncaptured pattern.
  5. The modeling procedure goes on until the model is valid (adequate) and prediction error estimates shrink below a level you consider negligible.

Linear Regression Model

Linear regression model (LM):

\[ Y|\mathbf{X} \sim \text{Normal}(\mathbf{X}' \beta, \sigma^2) \]

Terminology:

  • \(Y\): outcome, dependent variable, regressand, variable to be explained, left-hand side variable. If there are multiple outcomes, the model is called general linear model.
  • \(\mathbf{X}\): predictor/covariate, independent variable, regressor, explanatory variable, right-hand side variable. If there's only one predictor, this model is called Simple Linear Regression, to distinguish from (multiple) linear regression.
  • \(\beta\): regression coefficient. They are commonly interpreted as:
    • Marginal effect: \( \beta_k = \frac{\partial y}{\partial x_k} \) (\(y\) on \(x_k\)).
    • Percentage growth: \( \beta_k = \frac{1}{y} \frac{\partial y}{\partial x_k} \) (\(\ln y\) on \(x_k\)).
    • Marginal effect on doubling: \( \beta_k = \frac{\partial y}{\partial x_k} x_k \) (\(y\) on \(\ln x_k\)).
    • Elasticity: \( \beta_k = \frac{1}{y} \frac{\partial y}{\partial x_k} x_k\) (\(\ln y\) on \(\ln x_k\)).
  • \(u\): residual, error term, disturbance. (not explicit in formula) Consists of variables omitted or unobservable (latent variables); often includes a constant/intercept term.

Model assumptions explained:

  1. Linearity (random sample): \( (Y_i, \mathbf{X}_i) \) iid.
  2. No perfect multicollinearity (Sample matrix has full column rank): \(\exists \lambda \ne 0 : \mathbf{X}' \lambda = 0 \).
  3. Exogeneity (mean independence): \( \mathbb{E}[u|\mathbf{X}] = 0 \)
    • A weaker form: \( \mathbb{E} u = 0 \) and \( \mathbb{E} u \mathbf{X} = 0 \).
  4. Spherical residuals: \( \text{Var}(u|\mathbf{X}) = \sigma^2 I \)
    1. Homoskedasticity: \( \text{Var}(u_i|\mathbf{X}) = \text{Var}(u_j|\mathbf{X}) \)
    2. No autocorrelation or intraclass correlation: \( Cov(u_i, u_j|\mathbf{X}) = 0 \)
  5. Normal residuals: \( u|\mathbf{X} \sim \text{Normal} \)

Multicollinearity (also collinearity) refers to the presence of highly correlated subsets of predictors. Perfect multicollinearity means a subset of predictors are linearly dependent.

Violations of the Exogeneity Assumption that still satisfy the weaker \( \mathbb{E}(u_i|X_i) = 0 \) only occur with dependent data structures. However, it is often hard to assess whether the Exogeneity Assumption is satisfied, even if the model does not explicitly imply a violation.

The Spherical Residuals Assumption is generally too strong, because data in economics and many other fields commonly have heteroskedasticity, autocorrelation (akak serial correlation in time series data), or intraclass correlation (in clustered data).

For most purposes, the Conditional Normality Assumption is not necessary (optional).

Extensions to linear regression

  • Hierarchical/multilevel model: adding interactions to main effects.
  • Nonlinear models of one predictor: polynomial model, rational function model.
  • Transformations (aka link functions) on outcome/predictors: square (square root), reciprocal, logarithm, sinusoid; Box-Cox transformations.

The hierarchy principle:

If we include an interaction in an model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. Because interactions are hard to interpret in a model without main effects.

While many observables in social sciences have heavy-tail distributions, the distributions of their logarithms are typically well behaved. For example, in econometrics, monetary variables such as earnings are often log-transformed. Transformation can also improve homogeneity of variances. (An alternative technique is weighted least squares.) But interpretations of the two are not the same.

Ordinary Least Square Estimator

As a finite sample, outcome \(\mathbf{y}\) is n-by-1, predictors \(X\) is n-by-k, and the regression coefficients \( \beta \) is k-by-1. \(n\) is the number of observations in a sample, and \(k\) is the number of predictors. Typically \(n > k\) is assumed, and sometimes people use \(p\) instead of \(k\).

Ordinary least square (OLS) estimator of regression coefficients minimizes the residual sum of squares. OLS is also called linear least squares, in contrast with nonlinear least squares (NLS).

\[ \hat{\beta} = \arg\min_{\beta} (\mathbf{y} - X \beta)' (\mathbf{y} - X \beta) \]

The closed form of OLS estimator is

\[ \hat{\beta} = (X' X)^{-1} X' \mathbf{y} \]

Terminology:

  1. Fitted or predicted value of outcome: \( \hat{\mathbf{y}} = X \hat{\beta} \)
  2. Residuals: \( \varepsilon = \mathbf{y} - X \hat{\beta} \)
  3. Residual sum of squares: \( \text{RSS} = \varepsilon' \varepsilon \)
  4. Total sum of squares: \( \text{TSS} = (\mathbf{y} - \bar{y} \mathbf{1})' (\mathbf{y} - \bar{y} \mathbf{1}) \)
  5. Model sum of squares: \( \text{MSS} = (\hat{\mathbf{y}} - \bar{y} \mathbf{1})' (\hat{\mathbf{y}} - \bar{y} \mathbf{1}) \)

Properties of OLS residuals and fitted values:

  1. The residuals is orthogonal to the sample of predictors. \( X' \varepsilon = 0 \). Some corollaries are as follows:
    • If a constant/intercept is included in the predictors, then the sample mean of the residuals is zero. \( \bar{\varepsilon} = 0 \)
    • The sample covariance of the residuals with each of the predictors is zero.
    • The sample covariance of the residuals and the fitted values is zero.
  2. (If the model includes a constant) The sample averages of the outcome and predictors are on the regression line. \( \bar{y} = \bar{\mathbf{x}}' \hat{\beta} \)
  3. (If the model includes a constant) Average fitted value equals sample average of the outcome. \( \bar{y} = \bar{\hat{\mathbf{y}}} \)
  4. (If the model includes a constant) The sample variance of the outcome equals the sum of the sample variances of the fitted values and the residuals. TSS = MSS + RSS

The most widely used estimator for residual variance \( \sigma^2 \) is the unbiased sample variance of residuals.

\[ s^2 = \frac{\text{RSS}}{n-k} \]

It is unbiased conditioning on predictors, i.e. \( \mathbb{E}( s^2 | \mathbf{X} ) = \sigma^2 \). Its square root \(s\) is called the standard error of the regression or residual standard error (RSE).

R-squared

R-squared, or coefficient of determination, is defined as

\[ R^2 = \frac{\text{MSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}} \]

Properties:

  1. R-squared is within [0,1].
  2. R-squared never decreases as number of predictors increases.
  3. With a single predictor in LM, square roots of R-squared have the same absolute value with the correlation coefficient.

In macroeconomics, R-squared is often very high; 0.8 or higher is not unusual. In microeconomics, it is typically very low, with 0.1 not unusual. The reason might be that the number of observations in macroeconomics is often much lower (e.g., 25 OECD countries) than in microeconometrics (e.g., 10,000 households), while the number of predictors is not that different.

Seeing regression as the approximation of regression function, R-squared doesn't have to be very close to one for model justification, distinct from practice in physics and mechanics experiments. This is because we're only considering average effect, rather than elimination of error term.

Statistical Properties of OLS Estimators

Gauss-Markov Theorem:

Among all linear unbiased estimators, OLS estimator has the "smallest" variance/covariance matrix.

Notes:

  1. OLS estimator, conditioned on predictors, is linear in the outcome. \( \hat{\beta}|X = (X'X)^{-1} X'Y \)
  2. OLS estimator is (conditionally) unbiased. \( \mathbb{E}(\hat{\beta} | \mathbf{X}) = \beta \)
  3. OLS estimator has a conditional sampling variance \( \text{Var}(\hat{\beta} | \mathbf{X}) = \sigma^2 (X'X)^{-1} \)

If the Conditional Normality (of residuals) Assumption holds, then additionally we have:

  1. Regression coefficient estimators are conditionally normally distributed, i.e. \( \hat{\beta}|\mathbf{X} \sim N(\beta, \sigma^2 (X'X)^{-1}) \)
  2. Sample variance has a scaled chi-squared distribution, i.e. \( s^2|\mathbf{X} \sim \frac{\sigma^2}{n-k} \chi^2_{n-k} \)
  3. Regression coefficient estimators and sample variance are conditionally independent (on predictors). \( \hat{\beta} \) ∐ \(s^2|\mathbf{X}\)
  4. t-statistics under null hypothesis that a regression coefficient is correctly estimated has a Student's t-istribution, with degree of freedom being the excess of information. \( \frac{\hat{\beta}_i - \beta_i}{\widehat{\text{s.e.}}(\hat{\beta}_i)} \sim t_{n-k} \)

Estimators of OLS estimator's statistical properties:

  1. An unbiased estimator of sampling variance of the OLS estimator: \( \widehat{\text{Var}}(\hat{\beta}) = s^2 (X'X)^{-1} \)
  2. An unbiased estimator of sampling variance of the k-th OLS estimator: \( \widehat{\text{Var}}(\hat{\beta}_k) = s^2 [(X'X)^{-1}]_{kk} \)
  3. An estimator of the standard error of the k-th OLS estimator: \( \widehat{\text{s.e.}}(\hat{\beta}_k) = { s^2 [(X'X)^{-1}]_{kk} }^{\frac{1}{2}} \)

Linear Regression as Projection

Instead of the standard perspective of random variables, the variables can be alternatively seen as vectors in a finite sample space. Hereby the ordinary least square estimate of a linear regression model is the projection of the sample outcome on the linear span of sample predictors.

\[ \mathbf{y} = X' \beta + \mathbf{u} \]

Terminology:

  1. Projection matrix \( P_X = X (X'X)^{-1} X' \)
  2. Annihilator matrix \( M_X = I - P \)
  3. Centering matrix \( M_1 = I - \frac{1}{n} \mathbf{1} \mathbf{1}' \)
  4. Idempotent matrix \( A = A A \)

The projection matrix \( P_X \) projects a vector orthogonally onto the column space of X.

In algebra, an idempotent matrix is equivalent to a projection. While the projection matrix appreared here is actually an orthogonal projection since it's both idempotent and symmetric.

The projection matrix \( P_X \) is sometimes called the hat matrix, because it transforms the outcome to the fitted values ( \(\hat{\mathbf{y}} = X (X'X)^{-1} X' \mathbf{y}\) ). The diagonal elements of the hat matrix is called the hat values, where relatively large values are influential to the regression coefficients. Inspecting the hat values can be used to detect potential coding errors or other suspicious values that may unduly affect regression coefficients.

Partitioned Regression

Model

\[ Y|\mathbf{X} \sim \text{Normal}(\mathbf{X_1} \beta_1 + \mathbf{X_2} \beta_2, \sigma^2) \]

OLS Estimator ( \(M_i\) is shorthand for \(M_{X_i}\). )

\[ \hat{\beta_1} = (X_1' M_2 X_1)^{-1} X_1' M_2 \mathbf{y} \]

\[ \hat{\beta_2} = (X_2' M_1 X_2)^{-1} X_2' M_1 \mathbf{y} \]

If both the outcome and the (non-constant) predictors are centered before regression, the regression coefficients will be the same with those when an intercept is explictely included in a regression on raw data. So save the effort of centering, and always include an intercept in the model.

If we do randomized experiment to a sample, the corresponding attribute will be uncorrelated (when centered) to any latent variable, so regression on this variable will not be affected by additional predictors.

Strategies to correctly estimate the marginal effect of a predictor:

  1. Include all variables that are correlated with the predictor in the "true" model. (control for potential confounders)
  2. Use a random experiment.

Regression Diagnostics

Regression diagnostics are procedures assessing the validity (adequacy) of a regression model, mostly for linear models. A regression diagnostic may take the form of a graphical approach, informal quantitative results, or a formal statistical hypothesis test: each provides guidance for further stages of a regression analysis.

On Model Assumptions/Family

Testing heteroskedasticity (of residuals):

  1. regress squared residual on fitted value.
  2. regress squared residual on a quadratic function of fitted value.
  3. Formal tests:
    • Welch F-test (preferred), Brown and Forsythe test.
    • homogeneity of variances: Levene test, Bartlett's Test (for normal distribution).

Testing correlation of residuals:

  1. serial correlation:
    • lag plot: scatter plot of a sequence against its lagged values, typically lag 1.
    • autocorrelation plot: autocorrelation coefficient at various time lags, with confidence band as reference lines.
    • estimate an autoregressive model at low lags (e.g. AR(1)) for the residuals.
    • runs test: standardized number of runs.
  2. intraclass correlation: large one-way ANOVA.

Testing (conditional) normality of residuals:

  1. Graphical techniques:
    • For location and scale families of distributions: probability plot (QQ plot of data against a simulated sample; such as normal probability plot);
    • For finding the shape parameter: probability plot correlation coefficient (PPCC) plot;
  2. Goodness-of-fit tests for distributional adequacy:
    • General distributions: chi-squared goodness-of-fit test, Kolmogorov-Smirnov (K-S) test, Anderson-Darling test;
    • Normality: Shapiro-Wilk test, Shapiro-Francia test;

On Model Structure/Form

Graphical residual analysis: If the model form is adequate, residual scatter plots should appear to be a random field over all potential predictors. It is impractical to measure every independent quantity, but all available attributes should be checked, and ambient variables should be measured. If a scatter plot of residuals versus a variable did show systematic structure, the model form should be adjusted in that predictor, or include it as a predictor if not already so.

Lack-of-fit statistics/tests for model adequacy: (Testing model adequacy requires replicate measurements, e.g. validation and cross validation.)

  • Adequacy of existing predictors (misspecified or missing terms)
    • F-test for the ratio of "mean square for lack-of-fit" (on fitted model) to "mean square of pure error" (on replicated observations): \( \hat{\sigma}_m^2 / \hat{\sigma}_r^2 \)
  • Adding predictors
    • t-test for inclusion of a single explanatory variable (statistical significance away from zero)
    • F-test for inclusion of a group of variables
  • Dropping predictors
    • t-test of parameter significance

Multicollinearity:

  • unusually high standard errors of regression coefficients.
  • unusually high R-squared when you regress one explanatory variable on the others

Change of model structure between groups of observations

Comparing model structures

On Subgroups of Observations

Outliers: observations that deviates markedly from other observations in the sample.

  • Graphical techniques: normal probability plot, box plot, histogram;
  • Z-score \( z_i = (y_i - \bar{y}) / s \); modified Z-score \( M_i = 0.6745 (y_i - \tilde{y}) / \text{MAD} \), where \(\tilde{y}\) is the median and MAD is the median absolute deviation.
  • Formal outlier tests (for normally distributed data): Grubbs' test, Tietjen-Moore test, generalized extreme Studentized deviate (ESD) test.

Influential observations (high leverage points): observations that have a relatively large effect on the regression model's predictions.

Model Selection

Top-down view of linear regression: Given any data set, there's a default full model (p-dimensional LM), and LS gives the optimal fit in RSS. However, optimal training error does not mean optimal prediction error. To counter overfitting, enters model selection.

Types of LM model selection:

  • Group by subset size (m): RSS - adjusted error estimator (adjusted R-squared, AIC, BIC) or CV;
  • Group by derived basis size (M): RSS - CV;
  • Group by regularization parameter (λ): regularized RSS - CV;

Model selection methods:

  • Subset selection: best subset (recommended for p<20), forward selection (suitable for p>n), and backward selection.
  • Selection by derived directions: principal component regression (PCR).
  • Shrinkage methods (or Regularization): ridge (\(L^2\) penalty, suitable for dense "true" model), lasso (\(L^1\) penalty, have "variable selection" property, suitable for sparse "true" model).

Prediction error estimates:

  1. Adjusted estimates
    • Adjusted R-squared: doesn't need error variance estimate, applies to p>n;
    • Akaike information criterion (AIC)
      • Mallow's \(C_p\): equivalent to AIC for LM;
    • Bayesian information criterion (BIC)
  2. Direct estimates (of MSE or RMSE)
    • Cross-validation (CV)
      • Validation (not recommended)

One-standard-error rule of prediction error estimates (MSE): Choose the simplest (lowest m/M; highest λ) fitted model with prediction MSE within one standard error of the smallest prediction MSE.

Adjusted R-squared increases with one more predictor, if and only if the t-ratio of the new predictor is greater than 1.

\[ \bar{R}^2 = 1 - \frac{\text{RSS}/(n-k)}{\text{TSS}/(n-1)} \]

Other Topics

MLE (for classification models), GMM

Asymptotic Inference: Asymptotic distribution of regression estimators.

Bootstrap: repeated subsampling with replacement to estimates estimator variation (or any other population information).

Observational data (case control sampling) vs. experimental data (design of experiments, DOE).

(To be explored: Model and estimator are in some way analogous to simulation and estimation of random processes.)


🏷 Category=Economics