The theory of machine learning is typically discussed from two perspectives: computational learning theory and statistical learning theory.
Given a space of features X
, a space of labels Y
and a joint probability distribution P(x,y)
, currently available observations is a random sample of (X,Y)
.
A concept over a domain X
is a total Boolean function over X
: c: X → {0,1}
;
a concept class is a class of concepts.
A hypothesis is a function between the two spaces: h: X → Y
;
a hypothesis class is a class of hypotheses.
Loss function \( L(\hat{y}, y) \) measures the distance between the prediction of a hypothesis and the true outcome.
Risk associated with hypothesis h(x)
is its expected loss over the joint distribution: \( R(h) = \mathbf{E}[L(h(x), y)] \).
A learning algorithm is a hypothesis \( h^* \) within a fixed hypothesis class H
that minimizes risk R(h)
: \( \min_{h \in \mathcal{H}} R(h) \).
Stability, aka algorithmic stability, is a notion of how a machine learning algorithm is perturbed by small changes to its inputs.
Extracting features:
x
(predictor variable) and y
(response/outcome variable) of statistics.Partitioning observations:
Learning workflow:
λ
, e.g. penalty on model complexity, that have the highest metric for model evaluation as evaluated by:
In supervised learning, the predictive accuracy of a learning algorithm on previously unseen data is called generalization error, aka out-of-sample error: \( G = R(h) - R_\text{emp}(h) \)
An algorithm is said to generalize if its generalization error converges to zero in large sample: \( \lim_{n \to \infty} G = 0 \).
An algorithm will generalize if it satisfies the following conditions:
Although generalization error cannot be computed, it may be bounded in probability: \( P_G(\epsilon) \equiv P(|G| \leq \epsilon) \geq 1 - \delta \). The plot of generalization bounds against experience is called learning curve: \( \bar{\epsilon}(n;\delta) \).
Generalization error can be decomposed into three types of error:
High bias can cause algorithm under-fitting: missing the relevant relations between features and target outcomes. High variance can cause over-fitting: modeling the random noise in the training data rather than the intended outcomes. Regularization (正则化) refers to techniques that prevent overfitting and improve stability.
The bias–variance trade-off refers to the observation that bias and variance cannot be simultaneously minimized:
MSE = variance + bias^2 + true variance
Empirical risk is the average loss on the training set: \( R_\text{emp}(h) = 1/m \sum_{i=1}^m L(h(x_i), y_i) \)
Empirical risk minimization (ERM) principle: the learning algorithm should choose a hypothesis \(\hat{h}\) which minimizes the empirical risk: \( \min_{h \in \mathcal{H}} R_\text{emp}(h) \).
The ERM principle thus defines a family of learning algorithms.
Length of a concept c
in concept class C
, denoted as size(c)
, is the shortest bit string that can represent c
in C
.
Given the concept classes containing target concepts C
and hypotheses H
, a set S
of m
samples of maximum length n
labeled according to concepts c∈C
, constants α≥0, 0≤β<1
, a learning algorithm L
is an (α,β)-Occam algorithm for C
using H
if it outputs a hypothesis h∈H
such that
h
is consistent with c
on S
: \( h(x)=c(x), \forall x \in S \)An Occam algorithm is efficient if it runs in polynomial time in n
, m
, and size(c)
.
A concept class C
is Occam learnable with respect to a hypothesis class H
if there exists an efficient Occam algorithm for C
using H
.
PAC theory inspired boosting.
VC theory led to support vector machines.