Or should I say, regression analysis?
Econometrics is concerned with the measurement of economic relations. Measuring marginal effect is the goal of most empirical economics research. Examples of economic relations include the relation between earnings and education, work experience; expenditure on a commodity and household income; price and attributes of a good or service; output of firm and inputs of labor, capital, materials; inflation and unemployment rates; etc.
A regression model is often referred to as a reduced form regression, if the focus is on the prediction of outcome given predictors rather than the causal interpretation of the model parameters. Models for causal inference include simultaneous equations models, potential outcome model, etc.
See also Applied Econometrics.
The fundamental construct in probability is random variable. Correspondingly, the fundamental construct in statistics is random sample, a sampling process from a hypothetical population, aka white noise. A basic and important example is Gaussian white noise, the statistical model for a deterministic value affected by numerous additive uncorrelated zero-mean small error terms. Every statistical analysis should begin and end up with verifying such a univariate model (not necessarily Gaussian) against real data, known as univariate analysis.
Graphical analysis: (4-plot)
Formal tests also exist for univariate analysis, see section "regression diagnostics".
Samples can be alternatively seen as time-series, and a time-series is essentially a vector sequentially measured at constant frequency. This fact is exploited in some of the univariate analysis techniques to detect the peculiarities of time-series data: trend, seasonality, autocorrelation. If any of these are detected in univariate analysis, follow-up analysis is needed.
For a given outcome and a given set of predictors, (the ideal) regression function is the conditional mean function, which derives from their (hypothetical/theoretical) joint distribution. Here the term "regression" is understood as in "regression toward the mean", coined by Francis Galton. Regression analysis is the estimation of the regression function, where conditional variance and higher order information about the joint distribution are completely ignored.
\[ f(\mathbf{X}) \equiv \mathbb{E}[Y|\mathbf{X}] \]
A regression model is a group of conditional probability models whose expectation function approximates the regression function, parametric or not.
\[ Y|\mathbf{X} \sim P(\mathbf{X}, \theta);~ \mathbb{E} P(\mathbf{X}, \theta) \approx f(\mathbf{X}) \]
Linear regression model gives a linear approximation of the regression function, with slope parameters and an intercept. Other model families:
There are two separate concerns regarding a fitted model: validity and prediction error. A fitted model is valid (or adequate) if the residuals behave according to a univariate model. The prediction error of a fitted model is typically measured as its mean square error (MSE) or root MSE (RMSE) on new observations.
Stages of regression (bottom-up view):
Linear regression model (LM):
\[ Y \mid \mathbf{X} \sim \text{Normal}(\mathbf{X}' \beta, \sigma^2) \]
Terminology:
If there are multiple outcomes, the model is called general linear model. If there's only one predictor, the model is called simple linear regression, to distinguish from (multiple) linear regression. The residual consists of variables omitted or unobservable (latent variables); often includes a constant/intercept term.
Constant elasticity means power function: \(y = x^p\).
Model assumptions explained:
Multicollinearity (also collinearity) refers to the presence of highly correlated subsets of predictors. Perfect multicollinearity means a subset of predictors are linearly dependent.
Violations of the Exogeneity Assumption that still satisfy the weaker \( \mathbb{E}(u_i|X_i) = 0 \) only occur with dependent data structures. However, it is often hard to assess whether the Exogeneity Assumption is satisfied, even if the model does not explicitly imply a violation.
The Spherical Residuals Assumption is generally too strong, because data in economics and many other fields commonly have heteroskedasticity, autocorrelation (aka serial correlation in time-series data), or intraclass correlation (in clustered data).
For most purposes, the Conditional Normality Assumption is not necessary (optional).
The hierarchy principle:
If we include an interaction in an model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. Because interactions are hard to interpret in a model without main effects.
While many observables in social sciences have heavy tailed distribution, the distributions of their logarithms are typically well behaved. For example, in econometrics, monetary variables such as earnings are often log-transformed. Transformation can also improve homogeneity of variances. (An alternative technique is weighted least squares.) But interpretations of the two are not the same.
In macroeconomics, R-squared is often very high; 0.8 or higher is not unusual. In microeconomics, it is typically very low, with 0.1 not unusual. The reason might be that the number of observations in macroeconomics is often much lower (e.g., 25 OECD countries) than in micro-econometrics (e.g., 10,000 households), while the number of predictors is not that different.
Seeing regression as the approximation of regression function, R-squared does not have to be very close to one for model justification, distinct from practice in physics and mechanics experiments. This is because we're only considering average effect, rather than elimination of error term.
Regression diagnostics are procedures assessing the validity (adequacy) of a regression model, mostly for linear models. A regression diagnostic may take the form of a graphical approach, informal quantitative results, or a formal statistical hypothesis test: each provides guidance for further stages of a regression analysis.
Testing heteroskedasticity (of residuals):
Testing correlation of residuals:
Testing (conditional) normality of residuals:
Graphical residual analysis: If the model form is adequate, residual scatter plots should appear to be a random field over all potential predictors. It is impractical to measure every independent quantity, but all available attributes should be checked, and ambient variables should be measured. If a scatter plot of residuals versus a variable did show systematic structure, the model form should be adjusted in that predictor, or include it as a predictor if not already so.
Lack-of-fit statistics/tests for model adequacy: (Testing model adequacy requires replicate measurements, e.g. validation and cross validation.)
Multicollinearity:
Change of model structure between groups of observations
Comparing model structures
Outliers: observations that deviates markedly from other observations in the sample.
Influential observations (high leverage points): observations that have a relatively large effect on the regression model's predictions.
MLE (for classification models), GMM
Asymptotic Inference: Asymptotic distribution of regression estimators.
Observational data (case control sampling) vs. experimental data (design of experiments, DOE).
Model and estimator are in some way analogous to simulation and estimation of random processes.