The theory of machine learning is typically discussed from two perspectives: computational learning theory and statistical learning theory.

Terminology

Given a space of features X, a space of labels Y and a joint probability distribution P(x,y), currently available observations is a random sample of (X,Y).

A concept over a domain X is a total Boolean function over X: c: X → {0,1}; a concept class is a class of concepts. A hypothesis is a function between the two spaces: h: X → Y; a hypothesis class is a class of hypotheses.

Loss function \( L(\hat{y}, y) \) measures the distance between the prediction of a hypothesis and the true outcome. Risk associated with hypothesis h(x) is its expected loss over the joint distribution: \( R(h) = \mathbf{E}[L(h(x), y)] \).

A learning algorithm is a hypothesis \( h^* \) within a fixed hypothesis class H that minimizes risk R(h): \( \min_{h \in \mathcal{H}} R(h) \).

Stability, aka algorithmic stability, is a notion of how a machine learning algorithm is perturbed by small changes to its inputs.

The Empirical Risk Minimization Principle

Empirical risk is the average loss on the training set: \( R_\text{emp}(h) = 1/m \sum_{i=1}^m L(h(x_i), y_i) \)

Empirical risk minimization (ERM) principle: the learning algorithm should choose a hypothesis \(\hat{h}\) which minimizes the empirical risk: \( \min_{h \in \mathcal{H}} R_\text{emp}(h) \).

The ERM principle thus defines a family of learning algorithms.

Generalization Error

In supervised learning, the predictive accuracy of a learning algorithm on previously unseen data is called generalization error, aka out-of-sample error: \( G = R(h) - R_\text{emp}(h) \)

An algorithm is said to generalize if its generalization error converges to zero in large sample: \( \lim_{n \to \infty} G = 0 \).

An algorithm will generalize if it satisfies the following conditions:

  1. Symmetry: the order of inputs does not affect the result.
  2. Bounded loss
  3. Leave-one-out cross-validation stability, \(CVloo\): prediction error for each data point in leave-one-out cross validation converges to zero.
  4. Expected-leave-one-out error stability, \(Eloo_\text{err}\): empirical leave-one-out risk converges to risk in probability.

Although generalization error cannot be computed, it may be bounded in probability: \( P_G(\epsilon) \equiv P(|G| \leq \epsilon) \geq 1 - \delta \). The plot of generalization bounds against experience is called learning curve: \( \bar{\epsilon}(n;\delta) \).

Bias–Variance Tradeoff

Generalization error can be decomposed into three types of error:

  1. Bias (偏差) is error from erroneous assumptions in the learning algorithm. (prediction error on the training data)
  2. Variance (方差) is error from sensitivity to small fluctuations in the training set.
  3. Irreducible error is noise in the problem itself.

High bias can cause algorithm underfitting: missing the relevant relations between features and target outcomes. High variance can cause overfitting: modeling the random noise in the training data rather than the intended outcomes. Regularization (正则化) refers to techniques that prevent overfitting and improve stability.

The bias–variance tradeoff refers to the observation that bias and variance cannot be simultaneously minimized.

Occam Learning

Length of a concept c in concept class C, denoted as size(c), is the shortest bit string that can represent c in C.

Given the concept classes containing target concepts C and hypotheses H, a set S of m samples of maximum length n labeled according to concepts c∈C, constants α≥0, 0≤β<1, a learning algorithm L is an (α,β)-Occam algorithm for C using H if it outputs a hypothesis h∈H such that

  • h is consistent with c on S: \( h(x)=c(x), \forall x \in S \)
  • \( \mathrm{size}(h) \le (n \cdot \mathrm{size}(c))^\alpha m^\beta \)

An Occam algorithm is efficient if it runs in polynomial time in n, m, and size(c).

A concept class C is Occam learnable with respect to a hypothesis class H if there exists an efficient Occam algorithm for C using H.

Probably Approximately Correct (PAC) Learning

PAC theory inspired boosting.

Vapnik-Chervonenkis (VC) Theory

VC theory led to support vector machines.

Bayesian Inference

Bayesian inference led to Bayesian networks.