Ensemble Learning

A weak learner is a classifier which is only slightly correlated with the true classification (and necessarily better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

Ensemble learning (集成学习) are machine learning methods that use an ensemble of (weak) statistical classifiers to obtain better predictive performance (stronger) than any constituent classifier.

Random forest

Bootstrap aggregating, aka bagging, is an ensemble learning meta-algorithm that combines models from bootstrapped training sets:

Given a standard training set D of size n, iterate m times:
1. Bootstrap a new training set \(D_i\) of size n′.
2. Fit learning algorithm to the bootstrapped training set.
Aggregate the m models by their mode (classification) or mean (regression).

An optimal number of trees m can be found using cross-validation or out-of-bag error: the mean prediction error on each training sample \(x_i\), using only the trees that did not have \(x_i\) in their bootstrap sample. The training and test error tend to level off after some number of trees have been fit.

Bagging decreases the variance of the model without increasing the bias, thus preventing overfit.

Random subspace method (feature bagging)

Random forest (随机森林) {Breiman2001} constructs multiple decision trees at training time and output the mode of the classes (for classification) or the mean prediction (for regression) given by individual trees.

Gradient Boosting

Boosting (提升方法) is an ensemble learning meta-algorithm that turns a weak learner into a strong learner. Boosting is sometimes synonymous to arcing (adaptive resampling and combining) Only algorithms that are provable boosting algorithms in the probably approximately correct learning formulation can accurately be called boosting algorithms.

Boosting algorithms can be interpreted as iterative functional gradient descent algorithms which optimize a convex cost function over function space (weak hypotheses/learners). Any boosting algorithm that fits into the AnyBoost framework in {Mason1999} is a gradient boosting algorithm (梯度提升).

Problem: Given a training set {(x_i, y_i) | i = 1,…,n} and base learners {h_i(x) | i = 1,…,M}, find a learner F(x) in the linear function space generated by {1, h_1(x), …, h_M(x)} that optimizes the average value of a differentiable loss function L(y,F(x)) on the training set.

Base learners for gradient boosting are typically decision trees of a fixed size, especially classification and regression trees (CART), as they are fast.

Generic gradient boosting algorithm:

Initialize model with a constant function F_0(x) that minimizes average loss on the training set: \min_γ \sum_i L(y_i, γ).
Iterate m from 1 through M:
1. Compute pseudo-residuals, i.e. the negative gradient of loss function w.r.t. predictions on the training set: r_m = -∇_F L(y,F(x)) |_{(y_i, F_{m-1}(x_i)) | i=1,…,n}
2. Train base learner h_m(x) to pseudo-residuals {(x_i, r_{mi}) | i = 1,…,n}
3. Update model with the m-th base learner F_m(x) = F_{m−1}(x) + γ_m h_m(x): \min_γ \sum_i L(y_i, F_{m−1}(x) + γ h_m(x)).

xgboost is a gradient boosting algorithm.