The theory of machine learning is typically discussed from two perspectives, computational learning theory and statistical learning theory.


Given a space of features $X$, a space of labels $Y$, and a joint probability distribution $P$ on $X \times Y$, currently available observations is a random sample of $(X,Y)$.

Concept over a domain $X$ is a total Boolean function: $c: X → \{0,1\}$; Concept class is a class of concepts.

Hypothesis is a function between the two spaces: $h: X → Y$; Hypothesis class is a class of hypotheses.

Loss function $L(\hat{y}, y)$ measures the distance between the prediction of a hypothesis and the true outcome. Risk associated with hypothesis $h(x)$ is its expected loss over the joint distribution: $R(h) = \mathbf{E}[L(h(x), y)]$.

Learning algorithm is a hypothesis $h^∗$ within a fixed hypothesis class $H$ that minimizes risk $R(h)$: $\min_{h \in \mathcal{H}} R(h)$.

Stability of a machine learning algorithm is a notion of how the algorithm is perturbed by small changes to its inputs.


Extracting features:

  • Feature extraction is representing observations with (numerical) attributes, aka features, by incorporating domain knowledge.
  • For a prediction (supervised learning) problem, a label can be separated from the features: think of $x$ (predictor variable) and $y$ (response/outcome variable) of statistics.

Partitioning observations:

  1. Training data: the observations used in building a statistical model.
    1. Validation data: the subset of training data reserved for a grid search (or other optimization methods) of the hyperparameter in model selection.
  2. Test data: the observations reserved for validating the statistical model.

Learning workflow:

  1. Model training: apply a learning algorithm to training data to obtain a statistical model.
  2. Model selection, aka hyperparameter optimization: choose the learning algorithm's hyperparameters $λ$, e.g. penalty on model complexity, that have the highest metric for model evaluation as evaluated by:
    • a held-out validation set from training-validation split;
    • cross validation on the training set;
  3. Model evaluation: evaluate the generalization performance of a model with some metric.

Generalization Error

In supervised learning, the predictive accuracy of a learning algorithm on previously unseen data is called generalization error, aka out-of-sample error: $G = R(h) - R_\text{emp}(h)$

An algorithm is said to generalize if its generalization error converges to zero in large sample: $\lim_{n \to \infty} G = 0$.

An algorithm will generalize if it satisfies the following conditions:

  1. Symmetry: the order of inputs does not affect the result.
  2. Bounded loss
  3. Leave-one-out cross-validation stability, $\text{CVloo}$: prediction error for each data point in leave-one-out cross validation converges to zero.
  4. Expected-leave-one-out error stability, $\text{Eloo}_\text{err}$: empirical leave-one-out risk converges to risk in probability.

Although generalization error cannot be computed, it may be bounded in probability: $P_G(\epsilon) \equiv P(|G| \leq \epsilon) \geq 1 - \delta$. The plot of generalization bounds against experience is called learning curve: $\bar{\epsilon}(n;\delta)$.

Generalization error can be decomposed into three types of error:

  1. Bias (偏差) is error from erroneous assumptions in the learning algorithm. (prediction error on the training data)
  2. Variance (方差) is error from sensitivity to small fluctuations in the training set.
  3. Irreducible error is noise in the problem itself.

High bias can cause algorithm under-fitting: missing the relevant relations between features and target outcomes. High variance can cause over-fitting: modeling the random noise in the training data rather than the intended outcomes. Regularization (正则化) refers to techniques that prevent overfitting and improve stability.

The bias–variance trade-off refers to the observation that bias and variance cannot be simultaneously minimized:

MSE = variance + bias^2 + true variance


The Empirical Risk Minimization Principle

Empirical risk is the average loss on the training set: $R_\text{emp}(h) = 1/m \sum_{i=1}^m L(h(x_i), y_i)$

Empirical risk minimization (ERM) principle: the learning algorithm should choose a hypothesis $\hat{h}$ which minimizes the empirical risk: $\min_{h \in \mathcal{H}} R_\text{emp}(h)$.

The ERM principle thus defines a family of learning algorithms.

Occam Learning

Length of a concept $c$ in concept class $C$, denoted as $\text{size}(c)$, is the shortest bit string that can represent $c$ in $C$.

Given the concept classes containing target concepts $C$ and hypotheses $H$, a set $S$ of $m$ samples of maximum length $n$ labeled according to concepts $c ∈ C$, constants $α≥0, 0≤β<1$, a learning algorithm $L$ is an (α,β)-Occam algorithm for $C$ using $H$ if it outputs a hypothesis $h ∈ H$ such that

  • $h$ is consistent with $c$ on $S$: $h(x)=c(x), \forall x \in S$
  • $\text{size}(h) \le (n \cdot \text{size}(c))^\alpha m^\beta$

An Occam algorithm is efficient if it runs in polynomial time in $n$, $m$, and $\text{size}(c)$.

A concept class $C$ is Occam learnable with respect to a hypothesis class $H$ if there exists an efficient Occam algorithm for $C$ using $H$.

Probably Approximately Correct (PAC) Learning

PAC theory inspired boosting.

Vapnik-Chervonenkis (VC) Theory

VC theory led to support vector machines.


References given by Gerald Friedland: 1. Raul Rojas, 1996. Neural Networks: a Systematic Introduction 1. David MacKay, 2003. Information Theory, Inference and Learning

🏷 Category=Computation Category=Machine Learning