The theory of machine learning is typically discussed from two perspectives: computational learning theory and statistical learning theory.
Given a space of features X, a space of labels Y and a joint probability distribution P(x,y), currently available observations is a random sample of (X,Y).
A concept over a domain X is a total Boolean function over X: c: X → {0,1};
a concept class is a class of concepts.
A hypothesis is a function between the two spaces: h: X → Y;
a hypothesis class is a class of hypotheses.
Loss function \( L(\hat{y}, y) \) measures the distance between the prediction of a hypothesis and the true outcome.
Risk associated with hypothesis h(x) is its expected loss over the joint distribution: \( R(h) = \mathbf{E}[L(h(x), y)] \).
A learning algorithm is a hypothesis \( h^* \) within a fixed hypothesis class H that minimizes risk R(h): \( \min_{h \in \mathcal{H}} R(h) \).
Stability, aka algorithmic stability, is a notion of how a machine learning algorithm is perturbed by small changes to its inputs.
Extracting features:
x (predictor variable) and y (response/outcome variable) of statistics.Partitioning observations:
Learning workflow:
λ, e.g. penalty on model complexity, that have the highest metric for model evaluation as evaluated by:
In supervised learning, the predictive accuracy of a learning algorithm on previously unseen data is called generalization error, aka out-of-sample error: \( G = R(h) - R_\text{emp}(h) \)
An algorithm is said to generalize if its generalization error converges to zero in large sample: \( \lim_{n \to \infty} G = 0 \).
An algorithm will generalize if it satisfies the following conditions:
Although generalization error cannot be computed, it may be bounded in probability: \( P_G(\epsilon) \equiv P(|G| \leq \epsilon) \geq 1 - \delta \). The plot of generalization bounds against experience is called learning curve: \( \bar{\epsilon}(n;\delta) \).
Generalization error can be decomposed into three types of error:
High bias can cause algorithm under-fitting: missing the relevant relations between features and target outcomes. High variance can cause over-fitting: modeling the random noise in the training data rather than the intended outcomes. Regularization (正则化) refers to techniques that prevent overfitting and improve stability.
The bias–variance trade-off refers to the observation that bias and variance cannot be simultaneously minimized.
Empirical risk is the average loss on the training set: \( R_\text{emp}(h) = 1/m \sum_{i=1}^m L(h(x_i), y_i) \)
Empirical risk minimization (ERM) principle: the learning algorithm should choose a hypothesis \(\hat{h}\) which minimizes the empirical risk: \( \min_{h \in \mathcal{H}} R_\text{emp}(h) \).
The ERM principle thus defines a family of learning algorithms.
Length of a concept c in concept class C, denoted as size(c), is the shortest bit string that can represent c in C.
Given the concept classes containing target concepts C and hypotheses H, a set S of m samples of maximum length n labeled according to concepts c∈C, constants α≥0, 0≤β<1, a learning algorithm L is an (α,β)-Occam algorithm for C using H if it outputs a hypothesis h∈H such that
h is consistent with c on S: \( h(x)=c(x), \forall x \in S \)An Occam algorithm is efficient if it runs in polynomial time in n, m, and size(c).
A concept class C is Occam learnable with respect to a hypothesis class H if there exists an efficient Occam algorithm for C using H.
PAC theory inspired boosting.
VC theory led to support vector machines.