The theory of machine learning is typically discussed from two perspectives, computational learning theory and statistical learning theory.
Given a space of features X
, a space of labels Y
and a joint probability distribution P(x,y)
,
currently available observations is a random sample of (X,Y)
.
A concept over a domain X
is a total Boolean function over X
: c: X → {0,1}
;
a concept class is a class of concepts.
A hypothesis is a function between the two spaces: h: X → Y
;
a hypothesis class is a class of hypotheses.
Loss function $L(\hat{y}, y)$ measures the distance
between the prediction of a hypothesis and the true outcome.
Risk associated with hypothesis h(x)
is its expected loss over the joint distribution:
$R(h) = \mathbf{E}[L(h(x), y)]$.
A learning algorithm is a hypothesis $h^∗$ within a fixed hypothesis class H
that minimizes risk R(h)
: $\min_{h \in \mathcal{H}} R(h)$.
Stability, aka algorithmic stability, is a notion of how a machine learning algorithm is perturbed by small changes to its inputs.
Extracting features:
x
(predictor variable) and y
(response/outcome variable) of statistics.Partitioning observations:
Learning workflow:
λ
, e.g. penalty on model complexity,
that have the highest metric for model evaluation as evaluated by:
In supervised learning, the predictive accuracy of a learning algorithm on previously unseen data is called generalization error, aka out-of-sample error: $G = R(h) - R_\text{emp}(h)$
An algorithm is said to generalize if its generalization error converges to zero in large sample: $\lim_{n \to \infty} G = 0$.
An algorithm will generalize if it satisfies the following conditions:
Although generalization error cannot be computed, it may be bounded in probability: $P_G(\epsilon) \equiv P(|G| \leq \epsilon) \geq 1 - \delta$. The plot of generalization bounds against experience is called learning curve: $\bar{\epsilon}(n;\delta)$.
Generalization error can be decomposed into three types of error:
High bias can cause algorithm under-fitting: missing the relevant relations between features and target outcomes. High variance can cause over-fitting: modeling the random noise in the training data rather than the intended outcomes. Regularization (正则化) refers to techniques that prevent overfitting and improve stability.
The bias–variance trade-off refers to the observation that bias and variance cannot be simultaneously minimized:
MSE = variance + bias^2 + true variance
Empirical risk is the average loss on the training set: $R_\text{emp}(h) = 1/m \sum_{i=1}^m L(h(x_i), y_i)$
Empirical risk minimization (ERM) principle: the learning algorithm should choose a hypothesis $\hat{h}$ which minimizes the empirical risk: $\min_{h \in \mathcal{H}} R_\text{emp}(h)$.
The ERM principle thus defines a family of learning algorithms.
Length of a concept c
in concept class C
, denoted as size(c)
,
is the shortest bit string that can represent c
in C
.
Given the concept classes containing target concepts C
and hypotheses H
,
a set S
of m
samples of maximum length n
labeled according to concepts c∈C
,
constants α≥0, 0≤β<1
,
a learning algorithm L
is an (α,β)-Occam algorithm for C
using H
if it outputs a hypothesis h∈H
such that
h
is consistent with c
on S
: $h(x)=c(x), \forall x \in S$An Occam algorithm is efficient if it runs in polynomial time in n
, m
, and size(c)
.
A concept class C
is Occam learnable with respect to a hypothesis class H
if there exists an efficient Occam algorithm for C
using H
.
PAC theory inspired boosting.
VC theory led to support vector machines.
References given by Gerald Friedland: 1. Raul Rojas, 1996. Neural Networks: a Systematic Introduction 1. David MacKay, 2003. Information Theory, Inference and Learning