Decision trees, aka classification and regression tree (CART) [@Breiman1984], are of two main types:
In a decision tree, each branch (node) represents a decision rule based on one feature, and each leaf represents the model prediction (class or regressant value).
Top-down induction of decision trees (TDIDT) recursively partitions the training set until the subset at a node has all the same value of the target variable, or when splitting no longer adds value to the predictions.
pruning, a technique to avoid overfitting in decision tree models.
At each split, a chosen metrics is applied to each candidate subset, and the resulting values are combined to measure the quality of the split.
Gini impurity is used by the CART algorithm [@Breiman1984]. Given a subset with m classes, let \(f_i\) be the fraction of the subset with class i, the Gini impurity is:
\[ I_{G}(f) = \sum_{i=1}^{m} f_i (1-f_i) = \sum_{i=1}^m f_i - \sum_{i=1}^{m} {f_i}^2 = 1 - \sum^{m}_{i=1} {f_i}^{2} \]
Information gain is used by the ID3 [@Quinlan1986], C4.5 [@Quinlan1993] and C5.0 tree-generation algorithms. With entropy defined as \( H(f) = - \sum^{m}_{i=1} f_i \log_2 f_i \), the information gain of the a-th feature on training set T is:
\[ IG(T,a) = H(T) - H(T|a) = H(T) - \sum_v \frac{ |{\mathbf{x}\in T | x_a=v}| }{ |T| } \cdot H({\mathbf{x} \in T | x_a=v }) \]
Variance reduction is introduced in the CART algorithm [@Breiman1984], suitable for regression trees. Given \(S\) the set of pre-split sample indices, \(S_t\) the set of sample indices for which the split test is true, and \(S_f\) the set of sample indices for which the split test is false, denote the variance of regressant x in set S as \( \text{Var}S X = \frac{1}{|S|^2}\sum{i\in S} \sum_{j\in S} \frac{1}{2}(x_i - x_j)^2 \), the variance reduction of a node N is the total reduction of the variance of the regressant due to the split at this node:
\[ I_{V}(N) = \text{Var}S X - (\text{Var}{S_t} X + \text{Var}_{S_f} X) \]