The theory of machine learning is typically discussed from two perspectives, computational learning theory and statistical learning theory.
Given a space of features X, a space of labels Y
and a joint probability distribution P(x,y),
currently available observations is a random sample of (X,Y).
A concept over a domain X is a total Boolean function over X: c: X → {0,1};
a concept class is a class of concepts.
A hypothesis is a function between the two spaces: h: X → Y;
a hypothesis class is a class of hypotheses.
Loss function $L(\hat{y}, y)$ measures the distance
between the prediction of a hypothesis and the true outcome.
Risk associated with hypothesis h(x) is its expected loss over the joint distribution:
$R(h) = \mathbf{E}[L(h(x), y)]$.
A learning algorithm is a hypothesis $h^∗$ within a fixed hypothesis class H
that minimizes risk R(h): $\min_{h \in \mathcal{H}} R(h)$.
Stability, aka algorithmic stability, is a notion of how a machine learning algorithm is perturbed by small changes to its inputs.
Extracting features:
x (predictor variable) and y (response/outcome variable) of statistics.Partitioning observations:
Learning workflow:
λ, e.g. penalty on model complexity,
that have the highest metric for model evaluation as evaluated by:
In supervised learning, the predictive accuracy of a learning algorithm on previously unseen data is called generalization error, aka out-of-sample error: $G = R(h) - R_\text{emp}(h)$
An algorithm is said to generalize if its generalization error converges to zero in large sample: $\lim_{n \to \infty} G = 0$.
An algorithm will generalize if it satisfies the following conditions:
Although generalization error cannot be computed, it may be bounded in probability: $P_G(\epsilon) \equiv P(|G| \leq \epsilon) \geq 1 - \delta$. The plot of generalization bounds against experience is called learning curve: $\bar{\epsilon}(n;\delta)$.
Generalization error can be decomposed into three types of error:
High bias can cause algorithm under-fitting: missing the relevant relations between features and target outcomes. High variance can cause over-fitting: modeling the random noise in the training data rather than the intended outcomes. Regularization (正则化) refers to techniques that prevent overfitting and improve stability.
The bias–variance trade-off refers to the observation that bias and variance cannot be simultaneously minimized:
MSE = variance + bias^2 + true variance
Empirical risk is the average loss on the training set: $R_\text{emp}(h) = 1/m \sum_{i=1}^m L(h(x_i), y_i)$
Empirical risk minimization (ERM) principle: the learning algorithm should choose a hypothesis $\hat{h}$ which minimizes the empirical risk: $\min_{h \in \mathcal{H}} R_\text{emp}(h)$.
The ERM principle thus defines a family of learning algorithms.
Length of a concept c in concept class C, denoted as size(c),
is the shortest bit string that can represent c in C.
Given the concept classes containing target concepts C and hypotheses H,
a set S of m samples of maximum length n labeled according to concepts c∈C,
constants α≥0, 0≤β<1,
a learning algorithm L is an (α,β)-Occam algorithm for C using H
if it outputs a hypothesis h∈H such that
h is consistent with c on S: $h(x)=c(x), \forall x \in S$An Occam algorithm is efficient if it runs in polynomial time in n, m, and size(c).
A concept class C is Occam learnable with respect to a hypothesis class H
if there exists an efficient Occam algorithm for C using H.
PAC theory inspired boosting.
VC theory led to support vector machines.
References given by Gerald Friedland: 1. Raul Rojas, 1996. Neural Networks: a Systematic Introduction 1. David MacKay, 2003. Information Theory, Inference and Learning