A weak learner is a classifier which is only slightly correlated with the true classification (and necessarily better than random guessing). In contrast, a strong learner is arbitrarily well-correlated with the true classification.

Ensemble learning (集成学习) are machine learning methods that use an ensemble of (weak) statistical classifiers to obtain better predictive performance (stronger) than any constituent classifier.

## Random forest

Bootstrap aggregating, aka bagging, is an ensemble learning meta-algorithm that combines models from bootstrapped training sets:

1. Given a standard training set D of size n, iterate m times:
1. Bootstrap a new training set $D_i$ of size n′.
2. Fit learning algorithm to the bootstrapped training set.
2. Aggregate the m models by their mode (classification) or mean (regression).

An optimal number of trees m can be found using cross-validation or out-of-bag error: the mean prediction error on each training sample $x_i$, using only the trees that did not have $x_i$ in their bootstrap sample. The training and test error tend to level off after some number of trees have been fit.

Variable importance can be estimated as the difference in out-of-bag error through permutation:

1. Compute the out-of-bag error for each data point;
2. Permute the values of a particular feature, fit a random forest, and compute the out-of-bag error again;
3. The importance score for this feature is the average difference in out-of-bag error, normalized by the standard deviation of these differences.

As long as the tree ensemble is uncorrelated, the aggregate model is not sensitive to the training set. Bagging thus decreases the variance of the model without increasing the bias, preventing overfit.

Random forest (随机森林) [@Breiman2001] constructs multiple decision trees at training time and output the mode of the classes (for classification) or the mean prediction (for regression) given by individual trees. The first random forest algorithm is a random subspace method (feature/attribute bagging) [@Ho1998], which is bagging with feature selection. Feature selection prevents a particular set of strong predictors dominating the tree ensemble, reducing the risk of correlation.

Feature selection is building models with a subset of all available features. The premise of feature selection is that many features are redundant or irrelevant. This method is suitable for classification problems where the number of features is much larger than observations in the training set, e.g. image and gene expression data. With fewer features, it could potentially simplify the model, shorten training time, and decrease model variance (preventing overfit). A rule of thumb for the size of feature subset is: $\lfloor \sqrt{p} \rfloor$ for classification, and $\lfloor p/3 \rfloor$ for regression, where $p$ is the size of the original feature set. This method has been applied to many types of classifiers: linear classifier, support vector machines, nearest neighbors, decision trees.

Boosting (提升) is an ensemble learning meta-algorithm that turns a weak learner into a strong learner. Boosting is sometimes synonymous to arcing (adaptive resampling and combining) Only provable boosting algorithms in the probably approximately correct (PAC) learning formulation can be accurately called boosting algorithms.

Boosting algorithms can be interpreted as iterative functional gradient descent algorithms which optimize a convex cost function over function space (weak hypotheses/learners). Any boosting algorithm that fits into the AnyBoost framework in [@Mason1999] is a gradient boosting algorithm (梯度提升).

Problem: Given a training set {(x_i, y_i) | i = 1,…,n} and base learners {h_i(x) | i = 1,…,M}, find a learner F(x) in the linear function space generated by {1, h_1(x), …, h_M(x)} that optimizes the average value of a differentiable loss function L(y,F(x)) on the training set.

Base learners for gradient boosting are typically decision trees of a fixed size, especially classification and regression trees (CART), as they are fast.

1. Initialize model with a constant function F_0(x) that minimizes average loss on the training set: \min_γ \sum_i L(y_i, γ);
2. Iterate m from 1 through M:
1. Compute pseudo-residuals, i.e. the negative gradient of loss function w.r.t. predictions on the training set r_m = -∇_F L(y,F(x)) |_{(y_i, F_{m-1}(x_i)) | i=1,…,n};
2. Train base learner h_m(x) to pseudo-residuals {(x_i, r_{mi}) | i = 1,…,n};
3. Update model with the m-th base learner F_m(x) = F_{m−1}(x) + γ_m h_m(x): \min_γ \sum_i L(y_i, F_{m−1}(x) + γ h_m(x));