Regression Inference

A model form, often simply called a model, specifies a set of probability distributions of an outcome variable conditioned on some predictors. Specifically, a (parametric) model is a group of conditional probability distributions indexed by its parameters.

In comparison, an estimator specifies the optimal model parameters in some criteria as a function of a sample.

A fitted model specifies a probability distribution of the outcome variable conditioned on predictors, which is an element of the model form elected by the estimator given the observed sample.

Stages of regression (bottom-up view):

No predictor: sample mean (null/univariate model);
One predictor: simple linear regression or other uni-parametric models; local regression, cubic smoothing splines;
Multiple predictors: linear regression; linear and cubic splines;
Omitted covariates: check residual scatter plot for uncaptured pattern;
The modeling procedure goes on until the model is valid (adequate) and prediction error estimates shrink below a level you consider negligible.

Model Form

A regression model is a group of conditional probability models whose expectation function approximates the regression function, parametric or not.

$$Y \mid \mathbf{X} \sim P(\mathbf{X}, \theta)$$

Linear Regression Model

Linear regression model (LM) gives a linear approximation of the regression function, with slope parameters and an intercept: simple linear regression;

$$Y \mid \mathbf{X} \sim \text{Normal}(\mathbf{X} \cdot \beta, \sigma^2)$$

Generalized Linear Model

Generalized linear model (GLM) is a linear model composite with a link function $g(y)$ [@Nelder1972]:

$$g(Y) \mid \mathbf{X} \sim \text{Normal}(\mathbf{X} \cdot \beta, \sigma^2)$$

generalized additive model (GAM);
Box-Cox transformation: $(X^\lambda - 1) / \lambda$, if $\lambda \ne 0$; logarithm $\ln(X)$, if $\lambda = 0$;
linear, cubic and cubic smoothing splines;
binary outcomes (K=2): logistic regression (logit), probit regression;
count data: Poisson regression (log-linear);
time-series seasonal: sinusoid;

Other Models

Nonlinear models: step function (piecewise null model);

Hierarchical/multilevel model: adding interactions to main effects.

The hierarchy principle:

If we include an interaction in an model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. Because interactions are hard to interpret in a model without main effects.

Graphical models (Bayesian network): Bayesian inference lead to Bayesian networks.

Estimator

Estimator is the cost function (optimization criteria) on training data: least squares (LS), ridge, lasso.

Least Squares Estimators

Model Assessment

There are two separate concerns regarding a fitted model: validity and prediction error. A fitted model is valid (or adequate) if the residuals behave according to a univariate model. The "gold standard" measure for prediction error of a fitted model is its mean square error (MSE) or root MSE (RMSE) on new observations.

R-squared $R^2$, or coefficient of determination, is the proportion of sample variance explained by a fitted model; in other words, it is the ratio of regression sum of squares (RSS) to total sum of squares (TSS).

$$R^2 = \frac{\sum_i (\hat{y}_i-\bar{y})^2}{\sum_i (y_i-\bar{y})^2}$$

Properties:

R-squared is within [0,1].
R-squared never decreases as number of predictors increases.
With a single predictor in LM, square roots of R-squared have the same absolute value with the correlation coefficient.

In a linear model with a single regressor and a constant term, the coefficient of determination $R^2$ is the square of the correlation between the regressor and the dependent variable,

$$R^2 = \left( \frac{ \widehat{\mathrm{cov}(X,Y)} }{\hat{\sigma}_X \hat{\sigma}_Y} \right)^2 = \frac{\left( \frac{1}{n} \sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) \right)^2} { \left( \frac{1}{n}\sum\limits_{i=1}^n (x_i-\bar{x})^2 \right) \left( \frac{1}{n}\sum\limits_{i=1}^n (y_i-\bar{y})^2 \right) }$$

If the number of regressors is more than one, $R^2$ can be seen as the square of a coefficient of multiple correlation.

$$R^2 = \widehat{\mathrm{cov}(Y,\mathbf{X})} \left( \hat{\sigma}_Y^2 \widehat{\mathrm{cov}( \mathbf{X},\mathbf{X})} \right) ^{-1} \widehat{\mathrm{cov}(\mathbf{X},Y)}$$

$$= (Y_0^{T} X_0) [(Y_0^{T} Y_0) (X_0^{T} X_0)]^{-1} (X_0^{T} Y_0)$$

F is the ratio of explained variance to unexplained variance; also the ratio of between-group variance to within-group variance.

Model Selection

Model selection is the process of selecting the proper level of flexibility/complexity for a model form. Given any data set, LS gives the optimal fit in RSS within the default full model (p-dimensional LM). However, optimal training error does not mean optimal prediction error; this is what we call overfitting.

Model selection methods:

Subset selection: best subset (recommended for p<20), forward selection (suitable for p>n), and backward selection.
Shrinkage methods (or Regularization): ridge ($L^2$ penalty, suitable for dense "true" model), lasso ($L^1$ penalty, have "variable selection" property, suitable for sparse "true" model).
Selection by derived basis/directions: principal component regression (PCR).

Types of LM model selection:

Group by subset size, $m$: RSS vs adjusted error estimators (adjusted R-squared, AIC, BIC) or CV;
Group by derived basis size, $M$: RSS vs CV;
Group by regularization parameter, $\lambda$: regularized RSS vs CV;

One-standard-error rule of prediction error estimates (MSE): Choose the simplest fitted model (lowest $m$ or $M$, highest $\lambda$) with prediction MSE within one standard error of the smallest prediction MSE.

Model Selection Criteria

Estimates of prediction error (MSE or RMSE):

Adjusted estimates:
- Adjusted R-squared: does not need error variance estimate, applies to $p>n$;
- Akaike information criterion (AIC): Mallow's $C_p$, equivalent to AIC for LM;
- Bayesian information criterion (BIC);
Direct estimates: Cross-validation (CV) prediction error;

Adjusted R-squared increases with one more predictor, if and only if the t-ratio of the new predictor is greater than 1.

$$\bar{R}^2 = 1 - \frac{\text{RSS}/(n-k)}{\text{TSS}/(n-1)}$$

Resampling methods

Cross validation (CV) estimates prediction error; bootstrap estimates estimator variation. Cross-validation is preferred to validation.

Bootstrap is repeated subsampling with replacement to estimates estimator variation (or any other population information).

Regression Diagnostics

Regression diagnostics are procedures assessing the validity (adequacy) of a regression model, mostly for linear models. A regression diagnostic may take the form of a graphical approach, informal quantitative results, or a formal statistical hypothesis test: each provides guidance for further stages of a regression analysis.

Univariate analysis

Every regression should begin and end up with verifying a univariate model (not necessarily Gaussian) against real data, known as univariate analysis.

Graphical analysis: (4-plot)

Run sequence plot: shifts in location or scale.
Lag plot: auto-correlation.
Histogram: distribution.
Normal probability plot: fit of normal distribution.

Samples can be alternatively seen as time-series, and a time-series is essentially a vector sequentially measured at constant frequency. This fact is exploited in some of the univariate analysis techniques to detect the peculiarities of time-series data: trend, seasonality, autocorrelation. If any of these are detected in univariate analysis, follow-up analysis is needed.

On Residuals

Testing heteroskedasticity (of residuals):

regress squared residual on fitted value.
regress squared residual on a quadratic function of fitted value.
Formal tests:
- Welch F-test (preferred), Brown and Forsythe test.
- homogeneity of variances: Levene test, Bartlett's Test (for normal distribution).

Testing correlation of residuals:

serial correlation:
- lag plot: scatter plot of a sequence against its lagged values, typically lag 1.
- autocorrelation plots (ACF, PACF): autocorrelation coefficient at various time lags, with confidence band as reference lines.
- estimate an autoregressive model at low lags (e.g. AR(1)) for the residuals.
- runs test: standardized number of runs.
- unit-root tests;
intraclass correlation: large one-way ANOVA.

Testing (conditional) normality of residuals:

Graphical techniques:
- For location and scale families of distributions: probability plot (QQ plot of data against a simulated sample; such as normal probability plot);
- For finding the shape parameter: probability plot correlation coefficient (PPCC) plot;
Goodness-of-fit tests for distributional adequacy:
- General distributions: chi-squared goodness-of-fit test, Kolmogorov-Smirnov (K-S) test, Anderson-Darling test;
- Normality: Shapiro-Wilk test, Shapiro-Francia test;

On Predictors

Graphical residual analysis: If the model form is adequate, residual scatter plots should appear to be a random field over all potential predictors. It is impractical to measure every independent quantity, but all available attributes should be checked, and ambient variables should be measured. If a scatter plot of residuals versus a variable did show systematic structure, the model form should be adjusted in that predictor, or include it as a predictor if not already so.

Lack-of-fit statistics/tests for model adequacy: (Testing model adequacy requires replicate measurements, e.g. validation and cross validation.)

Adequacy of existing predictors (misspecified or missing terms)
- F-test for the ratio of "mean square for lack-of-fit" (on fitted model) to "mean square of pure error" (on replicated observations): $\hat{\sigma}_m^2 / \hat{\sigma}_r^2$
Adding predictors
- t-test for inclusion of a single explanatory variable (statistical significance away from zero)
- F-test for inclusion of a group of variables
Dropping predictors
- t-test of parameter significance

Multicollinearity:

unusually high standard errors of regression coefficients.
unusually high R-squared when you regress one explanatory variable on the others

Change of model structure between groups of observations

Comparing model structures

On Subgroups of Observations

Outliers: observations that deviates markedly from other observations in the sample.

Graphical techniques: normal probability plot, box plot, histogram;
Z-score $z_i = (y_i - \bar{y}) / s$; modified Z-score $M_i = 0.6745 (y_i - \tilde{y}) / \text{MAD}$, where $\tilde{y}$ is the median and MAD is the median absolute deviation.
Formal outlier tests (for normally distributed data): Grubbs' test, Tietjen-Moore test, generalized extreme Studentized deviate (ESD) test.

Influential observations (high leverage points): observations that have a relatively large effect on the regression model's predictions.

References: [@ISL2015], [@ESL2013]

🏷 Category=Statistics