Decision trees, aka classification and regression tree (CART) [@Breiman1984], are of two main types:
In a decision tree, each branch (node) represents a decision rule based on one feature, and each leaf represents the model prediction (class or regressant value).
Top-down induction of decision trees (TDIDT) recursively partitions the training set until the subset at a node has all the same value of the target variable, or when splitting no longer adds value to the predictions.
pruning, a technique to avoid overfitting in decision tree models.
At each split, a chosen metrics is applied to each candidate subset, and the resulting values are combined to measure the quality of the split.
Gini impurity is used by the CART algorithm [@Breiman1984]. Given a subset with m classes, let $f_i$ be the fraction of the subset with class i, the Gini impurity is:
$$I_{G}(f) = \sum_{i=1}^{m} f_i (1-f_i) = \sum_{i=1}^m f_i - \sum_{i=1}^{m} {f_i}^2 = 1 - \sum^{m}_{i=1} {f_i}^{2}$$
Information gain is used by the ID3 [@Quinlan1986], C4.5 [@Quinlan1993] and C5.0 tree-generation algorithms. With entropy defined as $H(f) = - \sum^{m}_{i=1} f_i \log_2 f_i$, the information gain of the a-th feature on training set T is:
$$IG(T,a) = H(T) - H(T|a) = H(T) - \sum_v \frac{ |{\mathbf{x}\in T | x_a=v}| }{ |T| } \cdot H({\mathbf{x} \in T | x_a=v })$$
Variance reduction is introduced in the CART algorithm [@Breiman1984], suitable for regression trees. Given $S$ the set of pre-split sample indices, $S_t$ the set of sample indices for which the split test is true, and $S_f$ the set of sample indices for which the split test is false, denote the variance of regressant $x$ in set $S$ as $\text{Var}_S X = \frac{1}{|S|^2} \sum_{i\in S} \sum_{j\in S} \frac{1}{2}(x_i - x_j)^2$ , the variance reduction of a node $N$ is the total reduction of the variance of the regressant due to the split at this node:
$$I_{V}(N) = \text{Var}S X - (\text{Var}{S_t} X + \text{Var}_{S_f} X)$$