A model form, often simply called a model, specifies a set of probability distributions of an outcome variable conditioned on some predictors. Specifically, a (parametric) model is a group of conditional probability distributions indexed by its parameters.

In comparison, an estimator specifies the optimal model parameters in some criteria as a function of a sample.

A fitted model specifies a probability distribution of the outcome variable conditioned on predictors, which is an element of the model form elected by the estimator given the observed sample.

Model Form

linear regression: simple linear regression;

\[ Y \mid \mathbf{X} \sim \text{Normal}(\mathbf{X} \cdot \beta, \sigma^2) \]

Generalized linear regression (GLM): polynomial, splines; count data; logistic regression (LR) (K=2);

Estimator

Estimator is the cost function (optimization criteria) on training data: least squares (LS), ridge, lasso.

Least Squares Estimator

Model Assessment

R-squared \( R^2 \), or coefficient of determination, is the proportion of sample variance explained by a fitted model; in other words, it is the ratio of regression sum of squares (RSS) to total sum of squares (TSS).

\[ R^2 = \frac{\sum_i (\hat{y}_i-\bar{y})^2}{\sum_i (y_i-\bar{y})^2} \]

Properties:

  1. R-squared is within [0,1].
  2. R-squared never decreases as number of predictors increases.
  3. With a single predictor in LM, square roots of R-squared have the same absolute value with the correlation coefficient.

In a linear model with a single regressor and a constant term, the coefficient of determination \( R^2 \) is the square of the correlation between the regressor and the dependent variable,

\[ R^2 = \left( \frac{ \widehat{\mathrm{cov}(X,Y)} }{\hat{\sigma}_X \hat{\sigma}_Y} \right)^2 = \frac{\left( \frac{1}{n} \sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) \right)^2} { \left( \frac{1}{n}\sum\limits_{i=1}^n (x_i-\bar{x})^2 \right) \left( \frac{1}{n}\sum\limits_{i=1}^n (y_i-\bar{y})^2 \right) } \]

If the number of regressors is more than one, \( R^2 \) can be seen as the square of a coefficient of multiple correlation.

\[ R^2 = \widehat{\mathrm{cov}(Y,\mathbf{X})} \left( \hat{\sigma}_Y^2 \widehat{\mathrm{cov}( \mathbf{X},\mathbf{X})} \right) ^{-1} \widehat{\mathrm{cov}(\mathbf{X},Y)} \]

\[ = (Y_0^{T} X_0) [(Y_0^{T} Y_0) (X_0^{T} X_0)]^{-1} (X_0^{T} Y_0) \]

F is the ratio of explained variance to unexplained variance; also the ratio of between-group variance to within-group variance.

Mean square error (MSE) of prediction is the gold standard: (bias-variance trade-off)

MSE = variance + bias^2 + true variance

Model Selection

Model selection is the process of selecting the proper level of flexibility/complexity for a model. Given any data set, LS gives the optimal fit in RSS within the default full model (p-dimensional LM). However, optimal training error does not mean optimal prediction error; this is what we call overfitting.

Model selection methods:

  • Subset selection: best subset (recommended for p<20), forward selection (suitable for p>n), and backward selection.
  • Shrinkage methods (or Regularization): ridge (\(L^2\) penalty, suitable for dense "true" model), lasso (\(L^1\) penalty, have "variable selection" property, suitable for sparse "true" model).
  • Selection by derived basis/directions: principal component regression (PCR).

Types of LM model selection:

  • Group by subset size, \(m\): RSS vs adjusted error estimators (adjusted R-squared, AIC, BIC) or CV;
  • Group by derived basis size, \(M\): RSS vs CV;
  • Group by regularization parameter, \(\lambda\): regularized RSS vs CV;

One-standard-error rule of prediction error estimates (MSE): Choose the simplest fitted model (lowest \(m\) or \(M\); highest \(\lambda\)) with prediction MSE within one standard error of the smallest prediction MSE.

Model Selection Criteria

Estimates of prediction error (MSE or RMSE):

  1. Adjusted estimates:
    • Adjusted R-squared: doesn't need error variance estimate, applies to p>n;
    • Akaike information criterion (AIC): Mallow's \(C_p\), equivalent to AIC for LM;
    • Bayesian information criterion (BIC);
  2. Direct estimates: Cross-validation (CV) prediction error;

Adjusted R-squared increases with one more predictor, if and only if the t-ratio of the new predictor is greater than 1.

\[ \bar{R}^2 = 1 - \frac{\text{RSS}/(n-k)}{\text{TSS}/(n-1)} \]

Resampling methods

Cross validation (CV) estimates prediction error; bootstrap estimates estimator variation.

Cross-validation is preferred to validation.

Bootstrap is repeated subsampling with replacement to estimates estimator variation (or any other population information).

References: {ISL2015}, {ESL2013}


🏷 Category=Statistics