The theory of machine learning is typically discussed from two perspectives: computational learning theory and statistical learning theory.
Given a space of features X
, a space of labels Y
and a joint probability distribution P(x,y)
, currently available observations is a random sample of (X,Y)
.
A concept over a domain X
is a total Boolean function over X
: c: X → {0,1}
; a concept class is a class of concepts.
A hypothesis is a function between the two spaces: h: X → Y
; a hypothesis class is a class of hypotheses.
Loss function \( L(\hat{y}, y) \) measures the distance between the prediction of a hypothesis and the true outcome.
Risk associated with hypothesis h(x)
is its expected loss over the joint distribution: \( R(h) = \mathbf{E}[L(h(x), y)] \).
A learning algorithm is a hypothesis \( h^* \) within a fixed hypothesis class H
that minimizes risk R(h)
: \( \min_{h \in \mathcal{H}} R(h) \).
Stability, aka algorithmic stability, is a notion of how a machine learning algorithm is perturbed by small changes to its inputs.
Empirical risk is the average loss on the training set: \( R_\text{emp}(h) = 1/m \sum_{i=1}^m L(h(x_i), y_i) \)
Empirical risk minimization (ERM) principle: the learning algorithm should choose a hypothesis \(\hat{h}\) which minimizes the empirical risk: \( \min_{h \in \mathcal{H}} R_\text{emp}(h) \).
The ERM principle thus defines a family of learning algorithms.
In supervised learning, the predictive accuracy of a learning algorithm on previously unseen data is called generalization error, aka out-of-sample error: \( G = R(h) - R_\text{emp}(h) \)
An algorithm is said to generalize if its generalization error converges to zero in large sample: \( \lim_{n \to \infty} G = 0 \).
An algorithm will generalize if it satisfies the following conditions:
Although generalization error cannot be computed, it may be bounded in probability: \( P_G(\epsilon) \equiv P(|G| \leq \epsilon) \geq 1 - \delta \). The plot of generalization bounds against experience is called learning curve: \( \bar{\epsilon}(n;\delta) \).
Generalization error can be decomposed into three types of error:
High bias can cause algorithm underfitting: missing the relevant relations between features and target outcomes. High variance can cause overfitting: modeling the random noise in the training data rather than the intended outcomes. Regularization (正则化) refers to techniques that prevent overfitting and improve stability.
The bias–variance tradeoff refers to the observation that bias and variance cannot be simultaneously minimized.
Length of a concept c
in concept class C
, denoted as size(c)
, is the shortest bit string that can represent c
in C
.
Given the concept classes containing target concepts C
and hypotheses H
, a set S
of m
samples of maximum length n
labeled according to concepts c∈C
, constants α≥0, 0≤β<1
, a learning algorithm L
is an (α,β)-Occam algorithm for C
using H
if it outputs a hypothesis h∈H
such that
h
is consistent with c
on S
: \( h(x)=c(x), \forall x \in S \)An Occam algorithm is efficient if it runs in polynomial time in n
, m
, and size(c)
.
A concept class C
is Occam learnable with respect to a hypothesis class H
if there exists an efficient Occam algorithm for C
using H
.
PAC theory inspired boosting.
VC theory led to support vector machines.
Bayesian inference led to Bayesian networks.