A model form, often simply called a model, specifies a set of probability distributions of an outcome variable conditioned on some predictors. Specifically, a (parametric) model is a group of conditional probability distributions indexed by its parameters.

In comparison, an estimator specifies the optimal model parameters in some criteria as a function of a sample.

A fitted model specifies a probability distribution of the outcome variable conditioned on predictors, which is an element of the model form elected by the estimator given the observed sample.

Stages of regression (bottom-up view):

  1. No predictor: sample mean (null/univariate model);
  2. One predictor: simple linear regression or other uni-parametric models; local regression, cubic smoothing splines;
  3. Multiple predictors: linear regression; linear and cubic splines;
  4. Omitted covariates: check residual scatter plot for uncaptured pattern;
  5. The modeling procedure goes on until the model is valid (adequate) and prediction error estimates shrink below a level you consider negligible.

Model Form

A regression model is a group of conditional probability models whose expectation function approximates the regression function, parametric or not.

\[ Y \mid \mathbf{X} \sim P(\mathbf{X}, \theta) \]

Linear Regression Model

Linear regression model (LM) gives a linear approximation of the regression function, with slope parameters and an intercept: simple linear regression;

\[ Y \mid \mathbf{X} \sim \text{Normal}(\mathbf{X} \cdot \beta, \sigma^2) \]

Generalized Linear Model

Generalized linear model (GLM) is a linear model composite with a link function \(g(y)\) {Nelder1972}: square (square root), reciprocal, logarithm, sinusoid; Box-Cox transformations.

  • polynomial, rational function;
  • linear, cubic and cubic smoothing splines;
  • generalized additive model (GAM);
  • binary outcomes (K=2): logistic regression (logit), probit regression;
  • count data: Poisson regression (log-linear);

Nonlinear models: step function (piecewise null model);

Hierarchical/multilevel model: adding interactions to main effects.

The hierarchy principle:

If we include an interaction in an model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. Because interactions are hard to interpret in a model without main effects.

Estimator

Estimator is the cost function (optimization criteria) on training data: least squares (LS), ridge, lasso.

Least Squares Estimator

Model Assessment

There are two separate concerns regarding a fitted model: validity and prediction error. A fitted model is valid (or adequate) if the residuals behave according to a univariate model. The "gold standard" measure for prediction error of a fitted model is its mean square error (MSE) or root MSE (RMSE) on new observations.

R-squared \( R^2 \), or coefficient of determination, is the proportion of sample variance explained by a fitted model; in other words, it is the ratio of regression sum of squares (RSS) to total sum of squares (TSS).

\[ R^2 = \frac{\sum_i (\hat{y}_i-\bar{y})^2}{\sum_i (y_i-\bar{y})^2} \]

Properties:

  1. R-squared is within [0,1].
  2. R-squared never decreases as number of predictors increases.
  3. With a single predictor in LM, square roots of R-squared have the same absolute value with the correlation coefficient.

In a linear model with a single regressor and a constant term, the coefficient of determination \( R^2 \) is the square of the correlation between the regressor and the dependent variable,

\[ R^2 = \left( \frac{ \widehat{\mathrm{cov}(X,Y)} }{\hat{\sigma}_X \hat{\sigma}_Y} \right)^2 = \frac{\left( \frac{1}{n} \sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) \right)^2} { \left( \frac{1}{n}\sum\limits_{i=1}^n (x_i-\bar{x})^2 \right) \left( \frac{1}{n}\sum\limits_{i=1}^n (y_i-\bar{y})^2 \right) } \]

If the number of regressors is more than one, \( R^2 \) can be seen as the square of a coefficient of multiple correlation.

\[ R^2 = \widehat{\mathrm{cov}(Y,\mathbf{X})} \left( \hat{\sigma}_Y^2 \widehat{\mathrm{cov}( \mathbf{X},\mathbf{X})} \right) ^{-1} \widehat{\mathrm{cov}(\mathbf{X},Y)} \]

\[ = (Y_0^{T} X_0) [(Y_0^{T} Y_0) (X_0^{T} X_0)]^{-1} (X_0^{T} Y_0) \]

F is the ratio of explained variance to unexplained variance; also the ratio of between-group variance to within-group variance.

Model Selection

Model selection is the process of selecting the proper level of flexibility/complexity for a model form. Given any data set, LS gives the optimal fit in RSS within the default full model (p-dimensional LM). However, optimal training error does not mean optimal prediction error; this is what we call overfitting.

Model selection methods:

  • Subset selection: best subset (recommended for p<20), forward selection (suitable for p>n), and backward selection.
  • Shrinkage methods (or Regularization): ridge (\(L^2\) penalty, suitable for dense "true" model), lasso (\(L^1\) penalty, have "variable selection" property, suitable for sparse "true" model).
  • Selection by derived basis/directions: principal component regression (PCR).

Types of LM model selection:

  • Group by subset size, \(m\): RSS vs adjusted error estimators (adjusted R-squared, AIC, BIC) or CV;
  • Group by derived basis size, \(M\): RSS vs CV;
  • Group by regularization parameter, \(\lambda\): regularized RSS vs CV;

One-standard-error rule of prediction error estimates (MSE): Choose the simplest fitted model (lowest \(m\) or \(M\); highest \(\lambda\)) with prediction MSE within one standard error of the smallest prediction MSE.

Model Selection Criteria

Estimates of prediction error (MSE or RMSE):

  1. Adjusted estimates:
    • Adjusted R-squared: does not need error variance estimate, applies to \(p>n\);
    • Akaike information criterion (AIC): Mallow's \(C_p\), equivalent to AIC for LM;
    • Bayesian information criterion (BIC);
  2. Direct estimates: Cross-validation (CV) prediction error;

Adjusted R-squared increases with one more predictor, if and only if the t-ratio of the new predictor is greater than 1.

\[ \bar{R}^2 = 1 - \frac{\text{RSS}/(n-k)}{\text{TSS}/(n-1)} \]

Resampling methods

Cross validation (CV) estimates prediction error; bootstrap estimates estimator variation. Cross-validation is preferred to validation.

Bootstrap is repeated subsampling with replacement to estimates estimator variation (or any other population information).

References: {ISL2015}, {ESL2013}


🏷 Category=Statistics