A weak learner is a classifier which is only slightly correlated with the true classification (and necessarily better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.
Ensemble learning (集成学习) are machine learning methods that use an ensemble of (weak) statistical classifiers to obtain better predictive performance (stronger) than any constituent classifier.
Bootstrap aggregating, aka bagging, is an ensemble learning meta-algorithm that combines models from bootstrapped training sets:
D
of size n
, iterate m
times:
n′
.m
models by their mode (classification) or mean (regression).An optimal number of trees m
can be found using cross-validation or out-of-bag error: the mean prediction error on each training sample \(x_i\), using only the trees that did not have \(x_i\) in their bootstrap sample.
The training and test error tend to level off after some number of trees have been fit.
Bagging decreases the variance of the model without increasing the bias, thus preventing overfit.
Random subspace method (feature bagging)
Random forest (随机森林) {Breiman2001} constructs multiple decision trees at training time and output the mode of the classes (for classification) or the mean prediction (for regression) given by individual trees.
Boosting (提升方法) is an ensemble learning meta-algorithm that turns a weak learner into a strong learner. Boosting is sometimes synonymous to arcing (adaptive resampling and combining) Only algorithms that are provable boosting algorithms in the probably approximately correct learning formulation can accurately be called boosting algorithms.
Boosting algorithms can be interpreted as iterative functional gradient descent algorithms which optimize a convex cost function over function space (weak hypotheses/learners). Any boosting algorithm that fits into the AnyBoost framework in {Mason1999} is a gradient boosting algorithm (梯度提升).
Problem:
Given a training set {(x_i, y_i) | i = 1,…,n}
and base learners {h_i(x) | i = 1,…,M}
, find a learner F(x)
in the linear function space generated by {1, h_1(x), …, h_M(x)}
that optimizes the average value of a differentiable loss function L(y,F(x))
on the training set.
Base learners for gradient boosting are typically decision trees of a fixed size, especially classification and regression trees (CART), as they are fast.
Generic gradient boosting algorithm:
F_0(x)
that minimizes average loss on the training set: \min_γ \sum_i L(y_i, γ)
.m
from 1 through M
:
r_m = -∇_F L(y,F(x)) |_{(y_i, F_{m-1}(x_i)) | i=1,…,n}
h_m(x)
to pseudo-residuals {(x_i, r_{mi}) | i = 1,…,n}
m
-th base learner F_m(x) = F_{m−1}(x) + γ_m h_m(x)
: \min_γ \sum_i L(y_i, F_{m−1}(x) + γ h_m(x))
.xgboost
is a gradient boosting algorithm.