Econometrics

Econometrics applies regression analysis to economics, adding methods developed for specific problems.

Econometrics is concerned with the measurement of economic relations. Measuring marginal effect is the goal of most empirical economics research. Examples of economic relations include the relation between earnings and education, work experience; expenditure on a commodity and household income; price and attributes of a good or service; output of firm and inputs of labor, capital, materials; inflation and unemployment rates.

A regression model is often referred to as a reduced form regression, if the focus is on the prediction of outcome given predictors rather than the causal interpretation of the model parameters. Models for causal inference include simultaneous equations models, potential outcome model, etc.

Linear Regression Model

Linear regression model (LM):

\[ Y \mid \mathbf{X} \sim \text{Normal}(\mathbf{X}' \beta, \sigma^2) \]

Terminology:

\(Y\): outcome, dependent variable, regressand, variable to be explained, left-hand side variable;
\(\mathbf{X}\): predictor/covariate, independent variable, regressor, explanatory variable, right-hand side variable;
\(\beta\): regression coefficient, commonly interpreted as:
- Marginal effect: \( \beta_k = \frac{\partial y}{\partial x_k} \) (\(y\) on \(x_k\)).
- Percentage growth: \( \beta_k = \frac{1}{y} \frac{\partial y}{\partial x_k} \) (\(\ln y\) on \(x_k\)).
- Marginal effect on doubling: \( \beta_k = \frac{\partial y}{\partial x_k} x_k \) (\(y\) on \(\ln x_k\)).
- Elasticity: \( \beta_k = \frac{1}{y} \frac{\partial y}{\partial x_k} x_k\) (\(\ln y\) on \(\ln x_k\)); constant elasticity means power function: \(y = x^p\).
\(u\): residual, error term, disturbance (not explicit in formula);

If there are multiple outcomes, the model is called general linear model. If there's only one predictor, the model is called simple linear regression, to distinguish from (multiple) linear regression. The residual consists of variables omitted or unobservable (latent variables); often includes a constant/intercept term.

While many observables in social sciences have heavy tailed distribution, the distributions of their logarithms are typically well behaved. For example, in econometrics, monetary variables such as earnings are often log-transformed. Transformation can also improve homogeneity of variances. (An alternative technique is weighted least squares.) But interpretations of the two are not the same.

Model assumptions explained:

Linearity (random sample): \( (Y_i, \mathbf{X}_i) \) iid.
No perfect multicollinearity (Sample matrix has full column rank): \(\exists \lambda \ne 0 : \mathbf{X}' \lambda = 0 \).
Exogeneity (mean independence): \( \mathbb{E}[u|\mathbf{X}] = 0 \)
- A weaker form: \( \mathbb{E} u = 0 \) and \( \mathbb{E} u \mathbf{X} = 0 \).
Spherical residuals: \( \text{Var}(u|\mathbf{X}) = \sigma^2 I \)
- Homoskedasticity: \( \text{Var}(u_i|\mathbf{X}) = \text{Var}(u_j|\mathbf{X}) \)
- No autocorrelation or intraclass correlation: \( Cov(u_i, u_j|\mathbf{X}) = 0 \)
Normal residuals: \( u|\mathbf{X} \sim \text{Normal} \)

Multicollinearity (also collinearity) refers to the presence of highly correlated subsets of predictors. Perfect multicollinearity means a subset of predictors are linearly dependent.

Violations of the Exogeneity Assumption that still satisfy the weaker \( \mathbb{E}(u_i|X_i) = 0 \) only occur with dependent data structures. However, it is often hard to assess whether the Exogeneity Assumption is satisfied, even if the model does not explicitly imply a violation.

The Spherical Residuals Assumption is generally too strong, because data in economics and many other fields commonly have heteroskedasticity, autocorrelation (aka serial correlation in time-series data), or intraclass correlation (in clustered data).

For most purposes, the Conditional Normality Assumption is not necessary (optional).

Model Assessment

In macroeconomics, R-squared is often very high; 0.8 or higher is not unusual. In microeconomics, it is typically very low, with 0.1 not unusual. The reason might be that the number of observations in macroeconomics is often much lower (e.g., 25 OECD countries) than in micro-econometrics (e.g., 10,000 households), while the number of predictors is not that different.

Seeing regression as the approximation of regression function, R-squared does not have to be very close to one for model justification, distinct from practice in physics and mechanics experiments. This is because we're only considering average effect, rather than elimination of error term.

Regression Diagnostics

Regression diagnostics are procedures assessing the validity (adequacy) of a regression model, mostly for linear models. A regression diagnostic may take the form of a graphical approach, informal quantitative results, or a formal statistical hypothesis test: each provides guidance for further stages of a regression analysis.

On Residuals

Testing heteroskedasticity (of residuals):

regress squared residual on fitted value.
regress squared residual on a quadratic function of fitted value.
Formal tests:
- Welch F-test (preferred), Brown and Forsythe test.
- homogeneity of variances: Levene test, Bartlett's Test (for normal distribution).

Testing correlation of residuals:

serial correlation:
- lag plot: scatter plot of a sequence against its lagged values, typically lag 1.
- autocorrelation plot: autocorrelation coefficient at various time lags, with confidence band as reference lines.
- estimate an autoregressive model at low lags (e.g. AR(1)) for the residuals.
- runs test: standardized number of runs.
intraclass correlation: large one-way ANOVA.

Testing (conditional) normality of residuals:

Graphical techniques:
- For location and scale families of distributions: probability plot (QQ plot of data against a simulated sample; such as normal probability plot);
- For finding the shape parameter: probability plot correlation coefficient (PPCC) plot;
Goodness-of-fit tests for distributional adequacy:
- General distributions: chi-squared goodness-of-fit test, Kolmogorov-Smirnov (K-S) test, Anderson-Darling test;
- Normality: Shapiro-Wilk test, Shapiro-Francia test;

On Predictors

Graphical residual analysis: If the model form is adequate, residual scatter plots should appear to be a random field over all potential predictors. It is impractical to measure every independent quantity, but all available attributes should be checked, and ambient variables should be measured. If a scatter plot of residuals versus a variable did show systematic structure, the model form should be adjusted in that predictor, or include it as a predictor if not already so.

Lack-of-fit statistics/tests for model adequacy: (Testing model adequacy requires replicate measurements, e.g. validation and cross validation.)

Adequacy of existing predictors (misspecified or missing terms)
- F-test for the ratio of "mean square for lack-of-fit" (on fitted model) to "mean square of pure error" (on replicated observations): \( \hat{\sigma}_m^2 / \hat{\sigma}_r^2 \)
Adding predictors
- t-test for inclusion of a single explanatory variable (statistical significance away from zero)
- F-test for inclusion of a group of variables
Dropping predictors
- t-test of parameter significance

Multicollinearity:

unusually high standard errors of regression coefficients.
unusually high R-squared when you regress one explanatory variable on the others

Change of model structure between groups of observations

Comparing model structures

On Subgroups of Observations

Outliers: observations that deviates markedly from other observations in the sample.

Graphical techniques: normal probability plot, box plot, histogram;
Z-score \( z_i = (y_i - \bar{y}) / s \); modified Z-score \( M_i = 0.6745 (y_i - \tilde{y}) / \text{MAD} \), where \(\tilde{y}\) is the median and MAD is the median absolute deviation.
Formal outlier tests (for normally distributed data): Grubbs' test, Tietjen-Moore test, generalized extreme Studentized deviate (ESD) test.

Influential observations (high leverage points): observations that have a relatively large effect on the regression model's predictions.