Or should I say, regression analysis?

Econometrics is concerned with the measurement of economic relations. Measuring marginal effect is the goal of most empirical economics research. Examples of economic relations include the relation between earnings and education, work experience; expenditure on a commodity and household income; price and attributes of a good or service; output of firm and inputs of labor, capital, materials; inflation and unemployment rates; etc.

A regression model in which the focus is on the prediction of y given predictors x, and not on the causal interpretation of the regression parameters, is often referred to as a reduced form regression.

In econometrics, model (model family or model form) and estimator (fitted model) are distinct concepts. A (parametric) model describes the functional relationship between predictors and an outcome variable, where the parameter is not specified. Thus, a model is essentially a group of probability models, denoted by some parameters. In comparison, an estimator provides a definition of the parameters that correspond to the model considered most favorable in some criteria.[^1]

[^1]: Footnote text (To be explored: Model and estimator are in some way analogous to simulation and estimation of random processes.)

See also Applied Econometrics.

Regression Analysis

For a given outcome and a set of predictors, (the ideal) regression function is the conditional mean function, which derives from (the hypothetical/theoretical) joint distribution. Here the term "regression" is understood as in "regression toward the mean", coined by Francis Galton. Conditional variance and higher order information about the joint distribution are completely ignored in regression analysis.

Parametric or not, regression models approximates the regression function using a finite sample. Linear regression model can be perceived as a linear approximation of the regression function, which is the conditional expectation function of the outcome on predictors. Different population (sample?) generally results in different linear approximation.

\[ \mathbb{E}[Y|\mathbf{X}] = f(\mathbf{X}) \approx \mathbf{X}'\beta \]

Stages of regression:

  1. No predictor: sample mean (null model)
  2. One predictor: local regression; simple linear regression (and other univariate parametric models);
  3. Multiple predictors: smoother (kNN); linear regression;
  4. Omitted covariates: check residual scatter plot for uncaptured pattern.
  5. The modeling procedure goes on until the residual standard error shrinks below a level you consider negligible.

Linear Regression Model

Linear Regression (LR) Model:

\[ Y|\mathbf{X} \sim \text{Normal}(\mathbf{X} \beta, \sigma^2) \]

Linear regression on a single variable is called Simple Linear Regression, to distinguish from (multiple) linear regression.

Terminology:

  • \(y\): outcome, dependent variable, regressand, variable to be explained, left-hand side variable.
  • \(\mathbf{x}\): predictor/covariate, independent variable, regressor, explanatory variable, right-hand side variable.
  • \(\beta\): regression coefficient.
  • \(u\): residual, error term, disturbance. Often includes a constant/intercept term. Variables omitted or unobservable (latent variables)

Assumptions:

  1. Linearity (random sample). \( (Y_i, X_i) \) iid
  2. Full column rank (No perfect multicollinearity). \( \mathbb{E}(X'X) > 0 \)
    • Multicollinearity refers to the presence of highly correlated subsets of predictors.
  3. Exogeneity (mean independence). \( \mathbb{E}(u|X) = 0 \)
    • A weaker form is E(u) = 0 and u and X are orthogonal.
  4. Spherical residuals. \( \text{Var}(u|X) = \sigma^2 I \)
    1. Homoskedasticity. \( \text{Var}(u_i|X) = \text{Var}(u_j|X) \)
    2. Non-autocorrelation. \( Cov(u_i, u_j|X) = 0 \)
  5. (Optional) Normal residuals. \( u|X \sim \text{Gaussian} \)

Violations of Assumption 3 (exogeneity) that are not violations of the weaker \( \mathbb{E}(u_i|X_i) = 0 \) only occur with dependent data structures. However, it is often hard to assess whether the Exogeneity Assumption is satisfied, even if the model does not explicitly imply a violation.

Assumption 4 (Spherical residuals) is generally too strong, since heteroskedasticity, serial correlation (in time series data), and intraclass correlation (in clustered data) are common in economics and many other fields.

For most purposes, Assumption 5 (Normality) is not necessary.

Interpretation of regression coefficients:

  1. Regress y on \(x_k\): \( \beta_k = \frac{\partial y}{\partial x_k} \) Marginal effect.
  2. Regress \(\ln y\) on \(x_k\): \( \beta_k = \frac{\frac{1}{y} \partial y}{\partial x_k} \) Percentage growth.
  3. Regress y on \(\ln x_k\): \( \beta_k = \frac{\partial y}{\frac{1}{x_k} \partial x_k} \) Marginal effect when predictor doubles.
  4. Regress \(\ln y\) on \(\ln x_k\): \( \beta_k = \frac{\frac{1}{y} \partial y}{\frac{1}{x_k} \partial x_k} \) Elasticity.

Ordinary least square estimator

As a finite sample, outcome Y is n-by-1, predictors X is n-by-k, and the regression coefficients \( \beta \) is k-by-1. \(n\) is the number of observations in a sample, and \(k\) is the number of predictors. Typically \(n > k\) is assumed, and sometimes people use \(p\) instead of \(k\).

Ordinary least square (OLS) estimator of regression coefficients minimizes the residual sum of squares (RSS). OLS is also called linear least squares, in contrast with nonlinear least squares.

\[ \hat{\beta} = \arg\min_{\beta} (Y-X \beta)' (Y-X \beta) \]

The closed form of OLS estimator is

\[ \hat{\beta} = (X'X)^{-1} X'Y \]

Terminology:

  1. Fitted or predicted value of outcome \( \hat{y} = X \hat{\beta} \)
  2. Residuals \( \varepsilon = Y-X \hat{\beta} \)
  3. Residual sum of squares \( \text{RSS} = \varepsilon' \varepsilon \)
  4. Total sum of squares \( \text{TSS} = (Y - \bar{y} \mathbf{1})' (Y - \bar{y} \mathbf{1}) \)
  5. Model sum of squares \( \text{MSS} = (\hat{Y} - \bar{y} \mathbf{1})' (\hat{Y} - \bar{y} \mathbf{1}) \)

Properties of OLS residuals and fitted values:

  1. The residuals is orthogonal to the sample of predictors. \( X' \varepsilon = 0 \). Some corollaries are as follows:
    • If a constant/intercept is included in the predictors, then the sample mean of the residuals is zero. \( \bar{\varepsilon} = 0 \)
    • The sample covariance of the residuals with each of the predictors is zero.
    • The sample covariance of the residuals and the fitted values is zero.
  2. (If the model includes a constant) The sample averages of the outcome and predictors are on the regression line. \( \bar{y} = \bar{x}' \hat{beta} \)
  3. (If the model includes a constant) Average fitted value equals sample average of the outcome. \( \bar{y} = \bar{\hat{y}} \)
  4. (If the model includes a constant) The sample variance of the outcome equals the sum of the sample variances of the fitted values and the residuals. TSS = MSS + RSS

The most widely used estimator for residual variance \( \sigma^2 \) is the unbiased sample variance of residuals.

\[ s^2 = \frac{\text{RSS}}{n-k} \]

It is unbiased conditioning on predictors, i.e. \( \mathbb{E}( s^2 | \mathbf{X} ) = \sigma^2 \). Its square root \(s\) is called the standard error of the regression (SER).

Linear Regression as Projection

Instead of the standard perspective of random variables, the variables can be alternatively seen as vectors in a finite sample space. Hereby the ordinary least square estimate of a linear regression model is the projection of the sample outcome on the linear span of sample predictors.

\[ \mathbf{y} = X' \beta + \mathbf{u} \]

Terminology:

  1. Projection matrix \( P_X = X (X'X)^{-1} X' \)
  2. Annihilator matrix \( M_X = I - P \)
  3. Centering matrix \( M_1 = I - 1/n mathbf{1} mathbf{1}' \)
  4. Idempotent matrix \( A = A * A \)

The projection matrix \( P_X \) projects a vector orthogonally onto the column space of X.

In algebra, an idempotent matrix is equivalent to a projection. While the projection matrix appreared here is actually an orthogonal projection since it's both idempotent and symmetric.

The projection matrix \( P_X \) is sometimes called the hat matrix, because it transforms the outcome to the fitted values ( \(\hat{y} = X (X'X)^{-1} X' y\) ). The diagonal elements of the hat matrix is called the hat values, where relatively large values are influential to the regression coefficients. Inspecting the hat values can be used to detect potential coding errors or other suspicious values that may unduly affect regression coefficients.

Partitioned regression

Model

\[ Y = X_1 \beta_1 + X_2 \beta_2 + U \]

OLS Estimator ( \(M_i\) is shorthand for \(M_{X_i}\). )

\[ \hat{\beta_1} = (X_1' M_2 X_1)^{-1} X_1' M_2 y \]

\[ \hat{\beta_2} = (X_2' M_1 X_2)^{-1} X_2' M_1 y \]

If both the outcome and the (non-constant) predictors are centered before regression, the regression coefficients will be the same with those when an intercept is explictely included in a regression on raw data. So save the effort of centering, and always include an intercept in the model.

If we do randomized experiment to a sample, the corresponding attribute will be uncorrelated (when centered) to any latent variable, so regression on this variable will not be affected by additional predictors.

Strategies to correctly estimate the marginal effect of a predictor:

  1. Include all variables that are correlated with the predictor in the true model. (control for potential confounders)
  2. Use a random experiment.

Statistical Properties of OLS estimators

Gauss-Markov Theorem

Among all linear unbiased estimators, OLS estimator has the "smallest" variance/covariance matrix.

Notes:

  1. OLS estimator, conditioned on predictors, is linear in the outcome. \( \hat{\beta}|X = (X'X)^{-1} X'Y \)
  2. OLS estimator is (conditionally) unbiased. \( \mathbb{E}(\hat{\beta} | \mathbf{X}) = \beta \)
  3. OLS estimator has a conditional sampling variance \( \text{Var}(\hat{\beta} | \mathbf{X}) = \sigma^2 (X'X)^{-1} \)

If normality of distrubance (Assumption 5) is assumed, then additionally we have:

  1. Regression coefficient estimators are conditionally normally distributed, i.e. \( \hat{\beta}|\mathbf{X} \sim N(\beta, \sigma^2 (X'X)^{-1}) \)
  2. Sample variance has a scaled chi-squared distribution, i.e. \( s^2|\mathbf{X} \sim \frac{\sigma^2}{n-k} \chi^2_{n-k} \)
  3. Regression coefficient estimators and sample variance are conditionally independent (on predictors). \( \hat{\beta} \) ∐ \(s^2|\mathbf{X}\)
  4. t-statistics under null hypothesis that a regression coefficient is correctly estimated has a Student's t-istribution, with degree of freedom being the excess of information. \( \frac{\hat{\beta}_i - \beta_i}{\widehat{\text{s.e.}}(\hat{\beta}_i)} \sim t_{n-k} \)

Estimators of OLS estimator's statistical properties:

  1. An unbiased estimator of sampling variance of the OLS estimator: \( \widehat{\text{Var}}(\hat{\beta}) = s^2 (X'X)^{-1} \)
  2. An unbiased estimator of sampling variance of the k-th OLS estimator: \( \widehat{\text{Var}}(\hat{\beta}_k) = s^2 [(X'X)^{-1}]_{kk} \)
  3. An estimator of the standard error of the k-th OLS estimator: \( \widehat{\text{s.e.}}(\hat{\beta}_k) = { s^2 [(X'X)^{-1}]_{kk} }^{\frac{1}{2}} \)

Asymptotic Inference

Asymptotic distribution of regression estimators

MLE GMM

Bootstrap Inference

Regression diagnostics

Regression diagnostics are procedures assessing the validity of a regression model, mostly LR models. A regression diagnostic may take the form of a graphical approach, informal quantitative results, or a formal statistical hypothesis test; each of which provides guidance for further stages of a regression analysis.

On Model Assumptions/Family

Testing heteroskedasticity (of residuals)

  1. regress squared residual on fitted value.
  2. regress squared residual on a quadratic function of fitted value.

Testing correlation (of residuals)

  1. testing serial correlation: estimate an AR(1) model for the residuals.
  2. testing intraclass correlation: Large one-way ANOVA.

Testing normality (of residuals)

  1. normal probability plot (QQ plot of data against a normal sample)
  2. Shapiro-Wilk test
  3. Shapiro-Francia test

On Model Structure/Form

Graphical residual analysis: Residual scatter plots should appear to be a random field over all potential predictors, if the model form is adequate. It is impractical to measure every independent quantity, but all available attributes should be checked, and ambient variables should be measured. If a scatter plot of residuals versus a variable did show systematic structure, the model form should be adjusted in that predictor, or include it as a predictor if not already so.

Lack-of-fit statistics help augment ambiguous residual plots:

  • Adequacy of existing predictors
    • F-test for replicated observations (Testing model adequacy requires replicate measurements, e.g. data hold-back and cross validation.)
  • Adding or dropping predictors
    • t-test for inclusion of a single explanatory variable (statistical significance away from zero)
    • F-test for inclusion of a group of variables

Multicollinearity:

  • unusually high standard errors of regression coefficients.
  • unusually high R2 when you regress one explanatory variable on the others

Change of model structure between groups of observations

Comparing model structures

On Subgroups of Observations

Outliers (observations poorly represented by the model)

Influential observations (that have a relatively large effect on the regression model's predictions)

Hypothesis Testing

t-statistics

The t-statistics under null \( H_0: \hat{\beta}_i = \beta^{*} \) is:

\[ t = \frac{\hat{\beta}_i - \beta^{*}}{\widehat{\text{s.e.}}(\hat{\beta}_i)} \]

t-test

F-statistics

F-statistic:

\[ F = \frac{MSS / p}{RSS / (n - p - 1)} \]

One-way ANOVA is an omnibus test that determines for several independent groups you are interested in whether any of the group means are statistically significantly different from each other. If there are only two groups in the one-way ANOVA F-test, \(F = t^2\) and is identical to a t-test.

As an omnibus test, ANOVA does not have the problem of increased Type I error probability in multiple t-tests.

Three main assumptions of ANOVA:

  1. (The residuals in) Each group is normally distributed. (robust against violation of normality)
  2. All groups have the same variance.
  3. Observations are independent.

Alternative tests: Kruskal-Wallis H Test.

To determine which specific groups differ from one another, you need to use a post hoc test.

p-value

p-value is the minimum size of test that the null gets rejected, given the value of test statistics. It is essentially the tail probability of the test statistics under asymptotic distribution.

Wald, LM, LR, J

Model Selection

Model selection (variable selection): best subset, forward/backward selection.

Akaike information criterion (AIC), Bayesian information criterion (BIC), Mallow's \(C_p\), cross-validation

R-sqaured

R-squared, or coefficient of determination, is defined as

\[ R^2 = \frac{\text{MSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}} \]

Properties:

  1. R-sqaured is within [0,1].
  2. R-sqaured never decreases as number of predictors increases.

In macroeconomics, the R2 is often very high; 0.8 or higher is not unusual. In microeconomics, it is typically very low, with 0.1 not unusual. The reason might be that the number of observations in macroeconometrics is often much lower (e.g., 25 OECD countries) than in microeconometrics (e.g., 10,000 households), while the number of predictors is not that different.

Seeing regression as the approximation of regression function, \( R^2 \) doesn't have to be very close to one for model justification, distinct from practice in physics and mechanics experiments. This is because we're only considering average effect, rather than elimination of error term.

Adjusted R-squared

\[ \bar{R}^2 = 1 - \frac{\text{RSS}/(n-k)}{\text{TSS}/(n-1)} \]

Properties:

  1. When a new predictor is added, adjusted R-sqaured increases if and only if the t-ratio of the new predictor is greater than 1.

Miscellaneous

Other Models:

  • generalized linear model (GLM)
  • discrete outcomes models: logistic regression (logit), probit;
  • nonlinear models (nonlinear least squares, NLS), generalized additive model (GAM)

Other topics:

  • causality: simultaneous equations models, potential outcome model.
  • observational and experimental data

Log Transformation

While many observables in social sciences have heavy-tail distributions, the distributions of their logarithms are typically well behaved. In economics, for example, monetary variables such as earnings are often log-transformed. But beware that the interpretations of the two are not the same.


🏷 Category=Economics