The theory of machine learning is typically discussed from two perspectives, computational learning theory and statistical learning theory.
Given a space of features $X$, a space of labels $Y$, and a joint probability distribution $P$ on $X \times Y$, currently available observations is a random sample of $(X,Y)$.
Concept over a domain $X$ is a total Boolean function: $c: X → \{0,1\}$; Concept class is a class of concepts.
Hypothesis is a function between the two spaces: $h: X → Y$; Hypothesis class is a class of hypotheses.
Loss function $L(\hat{y}, y)$ measures the distance between the prediction of a hypothesis and the true outcome. Risk associated with hypothesis $h(x)$ is its expected loss over the joint distribution: $R(h) = \mathbf{E}[L(h(x), y)]$.
Learning algorithm is a hypothesis $h^∗$ within a fixed hypothesis class $H$ that minimizes risk $R(h)$: $\min_{h \in \mathcal{H}} R(h)$.
Stability of a machine learning algorithm is a notion of how the algorithm is perturbed by small changes to its inputs.
Extracting features:
Partitioning observations:
Learning workflow:
In supervised learning, the predictive accuracy of a learning algorithm on previously unseen data is called generalization error, aka out-of-sample error: $G = R(h) - R_\text{emp}(h)$
An algorithm is said to generalize if its generalization error converges to zero in large sample: $\lim_{n \to \infty} G = 0$.
An algorithm will generalize if it satisfies the following conditions:
Although generalization error cannot be computed, it may be bounded in probability: $P_G(\epsilon) \equiv P(|G| \leq \epsilon) \geq 1 - \delta$. The plot of generalization bounds against experience is called learning curve: $\bar{\epsilon}(n;\delta)$.
Generalization error can be decomposed into three types of error:
High bias can cause algorithm under-fitting: missing the relevant relations between features and target outcomes. High variance can cause over-fitting: modeling the random noise in the training data rather than the intended outcomes. Regularization (正则化) refers to techniques that prevent overfitting and improve stability.
The bias–variance trade-off refers to the observation that bias and variance cannot be simultaneously minimized:
MSE = variance + bias^2 + true variance
Empirical risk is the average loss on the training set: $R_\text{emp}(h) = 1/m \sum_{i=1}^m L(h(x_i), y_i)$
Empirical risk minimization (ERM) principle: the learning algorithm should choose a hypothesis $\hat{h}$ which minimizes the empirical risk: $\min_{h \in \mathcal{H}} R_\text{emp}(h)$.
The ERM principle thus defines a family of learning algorithms.
Length of a concept $c$ in concept class $C$, denoted as $\text{size}(c)$, is the shortest bit string that can represent $c$ in $C$.
Given the concept classes containing target concepts $C$ and hypotheses $H$, a set $S$ of $m$ samples of maximum length $n$ labeled according to concepts $c ∈ C$, constants $α≥0, 0≤β<1$, a learning algorithm $L$ is an (α,β)-Occam algorithm for $C$ using $H$ if it outputs a hypothesis $h ∈ H$ such that
An Occam algorithm is efficient if it runs in polynomial time in $n$, $m$, and $\text{size}(c)$.
A concept class $C$ is Occam learnable with respect to a hypothesis class $H$ if there exists an efficient Occam algorithm for $C$ using $H$.
PAC theory inspired boosting.
VC theory led to support vector machines.
References given by Gerald Friedland: 1. Raul Rojas, 1996. Neural Networks: a Systematic Introduction 1. David MacKay, 2003. Information Theory, Inference and Learning