A model form, often simply called a model, specifies a set of probability distributions of an outcome variable conditioned on some predictors. Specifically, a (parametric) model is a group of conditional probability distributions indexed by its parameters.
In comparison, an estimator specifies the optimal model parameters in some criteria as a function of a sample.
A fitted model specifies a probability distribution of the outcome variable conditioned on predictors, which is an element of the model form elected by the estimator given the observed sample.
Stages of regression (bottom-up view):
A regression model is a group of conditional probability models whose expectation function approximates the regression function, parametric or not.
$$Y \mid \mathbf{X} \sim P(\mathbf{X}, \theta)$$
Linear regression model (LM) gives a linear approximation of the regression function, with slope parameters and an intercept: simple linear regression;
$$Y \mid \mathbf{X} \sim \text{Normal}(\mathbf{X} \cdot \beta, \sigma^2)$$
Generalized linear model (GLM) is a linear model composite with a link function $g(y)$ [@Nelder1972]:
$$g(Y) \mid \mathbf{X} \sim \text{Normal}(\mathbf{X} \cdot \beta, \sigma^2)$$
Nonlinear models: step function (piecewise null model);
Hierarchical/multilevel model: adding interactions to main effects.
The hierarchy principle:
If we include an interaction in an model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. Because interactions are hard to interpret in a model without main effects.
Graphical models (Bayesian network): Bayesian inference lead to Bayesian networks.
Estimator is the cost function (optimization criteria) on training data: least squares (LS), ridge, lasso.
There are two separate concerns regarding a fitted model: validity and prediction error. A fitted model is valid (or adequate) if the residuals behave according to a univariate model. The "gold standard" measure for prediction error of a fitted model is its mean square error (MSE) or root MSE (RMSE) on new observations.
R-squared $R^2$, or coefficient of determination, is the proportion of sample variance explained by a fitted model; in other words, it is the ratio of regression sum of squares (RSS) to total sum of squares (TSS).
$$R^2 = \frac{\sum_i (\hat{y}_i-\bar{y})^2}{\sum_i (y_i-\bar{y})^2}$$
Properties:
In a linear model with a single regressor and a constant term, the coefficient of determination $R^2$ is the square of the correlation between the regressor and the dependent variable,
$$R^2 = \left( \frac{ \widehat{\mathrm{cov}(X,Y)} }{\hat{\sigma}_X \hat{\sigma}_Y} \right)^2 = \frac{\left( \frac{1}{n} \sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) \right)^2} { \left( \frac{1}{n}\sum\limits_{i=1}^n (x_i-\bar{x})^2 \right) \left( \frac{1}{n}\sum\limits_{i=1}^n (y_i-\bar{y})^2 \right) }$$
If the number of regressors is more than one, $R^2$ can be seen as the square of a coefficient of multiple correlation.
$$R^2 = \widehat{\mathrm{cov}(Y,\mathbf{X})} \left( \hat{\sigma}_Y^2 \widehat{\mathrm{cov}( \mathbf{X},\mathbf{X})} \right) ^{-1} \widehat{\mathrm{cov}(\mathbf{X},Y)}$$
$$= (Y_0^{T} X_0) [(Y_0^{T} Y_0) (X_0^{T} X_0)]^{-1} (X_0^{T} Y_0)$$
F is the ratio of explained variance to unexplained variance; also the ratio of between-group variance to within-group variance.
Model selection is the process of selecting the proper level of flexibility/complexity for a model form. Given any data set, LS gives the optimal fit in RSS within the default full model (p-dimensional LM). However, optimal training error does not mean optimal prediction error; this is what we call overfitting.
Model selection methods:
Types of LM model selection:
One-standard-error rule of prediction error estimates (MSE): Choose the simplest fitted model (lowest $m$ or $M$, highest $\lambda$) with prediction MSE within one standard error of the smallest prediction MSE.
Estimates of prediction error (MSE or RMSE):
Adjusted R-squared increases with one more predictor, if and only if the t-ratio of the new predictor is greater than 1.
$$\bar{R}^2 = 1 - \frac{\text{RSS}/(n-k)}{\text{TSS}/(n-1)}$$
Cross validation (CV) estimates prediction error; bootstrap estimates estimator variation. Cross-validation is preferred to validation.
Bootstrap is repeated subsampling with replacement to estimates estimator variation (or any other population information).
Regression diagnostics are procedures assessing the validity (adequacy) of a regression model, mostly for linear models. A regression diagnostic may take the form of a graphical approach, informal quantitative results, or a formal statistical hypothesis test: each provides guidance for further stages of a regression analysis.
Every regression should begin and end up with verifying a univariate model (not necessarily Gaussian) against real data, known as univariate analysis.
Graphical analysis: (4-plot)
Samples can be alternatively seen as time-series, and a time-series is essentially a vector sequentially measured at constant frequency. This fact is exploited in some of the univariate analysis techniques to detect the peculiarities of time-series data: trend, seasonality, autocorrelation. If any of these are detected in univariate analysis, follow-up analysis is needed.
Testing heteroskedasticity (of residuals):
Testing correlation of residuals:
Testing (conditional) normality of residuals:
Graphical residual analysis: If the model form is adequate, residual scatter plots should appear to be a random field over all potential predictors. It is impractical to measure every independent quantity, but all available attributes should be checked, and ambient variables should be measured. If a scatter plot of residuals versus a variable did show systematic structure, the model form should be adjusted in that predictor, or include it as a predictor if not already so.
Lack-of-fit statistics/tests for model adequacy: (Testing model adequacy requires replicate measurements, e.g. validation and cross validation.)
Multicollinearity:
Change of model structure between groups of observations
Comparing model structures
Outliers: observations that deviates markedly from other observations in the sample.
Influential observations (high leverage points): observations that have a relatively large effect on the regression model's predictions.
References: [@ISL2015], [@ESL2013]