Recent from talks
Nothing was collected or created yet.
Decision tree pruning
View on Wikipedia
Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.
One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A tree that is too large risks overfitting the training data and poorly generalizing to new samples. A small tree might not capture important structural information about the sample space. However, it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition of a single extra node will dramatically decrease error. This problem is known as the horizon effect. A common strategy is to grow the tree until each node contains a small number of instances then use pruning to remove nodes that do not provide additional information.[1]
Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set. There are many techniques for tree pruning that differ in the measurement that is used to optimize performance.
Techniques
[edit]Pruning processes can be divided into two types (pre- and post-pruning).
Pre-pruning procedures prevent a complete induction of the training set by replacing a stop () criterion in the induction algorithm (e.g. max. Tree depth or information gain (Attr)> minGain). Pre-pruning methods are considered to be more efficient because they do not induce an entire set, but rather trees remain small from the start. Prepruning methods share a common problem, the horizon effect. This is to be understood as the undesired premature termination of the induction by the stop () criterion.
Post-pruning (or just pruning) is the most common way of simplifying trees. Here, nodes and subtrees are replaced with leaves to reduce complexity. Pruning can not only significantly reduce the size but also improve the classification accuracy of unseen objects. It may be the case that the accuracy of the assignment on the train set deteriorates, but the accuracy of the classification properties of the tree increases overall.
The procedures are differentiated on the basis of their approach in the tree (top-down or bottom-up).
Bottom-up pruning
[edit]These procedures start at the last node in the tree (the lowest point). Following recursively upwards, they determine the relevance of each individual node. If the relevance for the classification is not given, the node is dropped or replaced by a leaf. The advantage is that no relevant sub-trees can be lost with this method. These methods include Reduced Error Pruning (REP), Minimum Cost Complexity Pruning (MCCP), or Minimum Error Pruning (MEP).
Top-down pruning
[edit]In contrast to the bottom-up method, this method starts at the root of the tree. Following the structure below, a relevance check is carried out which decides whether a node is relevant for the classification of all n items or not. By pruning the tree at an inner node, it can happen that an entire sub-tree (regardless of its relevance) is dropped. One of these representatives is pessimistic error pruning (PEP), which brings quite good results with unseen items.
Pruning algorithms
[edit]Reduced error pruning
[edit]One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and speed.
Cost complexity pruning
[edit]Cost complexity pruning generates a series of trees where is the initial tree and is the root alone. At step , the tree is created by removing a subtree from tree and replacing it with a leaf node with value chosen as in the tree building algorithm. The subtree that is removed is chosen as follows:
- Define the error rate of tree over data set as .
- The subtree that minimizes is chosen for removal.
The function defines the tree obtained by pruning the subtrees from the tree . Once the series of trees has been created, the best tree is chosen by generalized accuracy as measured by a training set or cross-validation.
Examples
[edit]This section needs expansion. You can help by adding to it. (October 2025) |
Pruning could be applied in a compression scheme of a learning algorithm to remove the redundant details without compromising the model's performances. In neural networks, pruning removes entire neurons or layers of neurons.
See also
[edit]References
[edit]- Pearl, Judea (1984). Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley. ISBN 978-0-201-05594-8.
- Mansour, Y. (1997). "Pessimistic decision tree pruning based on tree size". Proc. 14th International Conference on Machine Learning. pp. 195–201.
- Breslow, L. A.; Aha, D. W. (1997). "Simplifying Decision Trees: A Survey". The Knowledge Engineering Review. 12 (1): 1–47. doi:10.1017/S0269888997000015. S2CID 18782652.
- Quinlan, J. R. (1986). "Induction of Decision Trees". Machine Learning. 1. Kluwer: 81–106. doi:10.1007/BF00116251.
- ^ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2001). The Elements of Statistical Learning. Springer. pp. 269–272. ISBN 0-387-95284-5.
Further reading
[edit]External links
[edit]Decision tree pruning
View on GrokipediaIntroduction
Definition and Purpose
Decision tree pruning refers to techniques that simplify decision trees to reduce overfitting and complexity, either by halting growth during construction (pre-pruning) based on stopping criteria or by detecting and removing branches or subtrees from a fully grown tree (post-pruning). This process involves evaluating criteria, such as estimated error rates on validation data, to determine whether sections of the tree should be collapsed into leaves or eliminated entirely, without specifying particular algorithmic details. In essence, pruning transforms an overly detailed tree into a more compact structure that captures the essential decision boundaries while discarding noise-induced splits.[5][6] The core purpose of pruning is to strike a balance between bias and variance in the decision tree model, enabling better generalization to unseen data while preserving accuracy on the training set. Unpruned trees often exhibit low bias but high variance, fitting the training data too closely and performing poorly on new instances due to sensitivity to minor data fluctuations; pruning mitigates this by increasing bias slightly to substantially lower variance, thus minimizing the total expected error. This aligns with the decomposition of expected prediction error as the sum of squared bias, variance, and irreducible noise, where pruning primarily targets variance reduction by constraining tree size.[7] Key benefits of pruning include enhanced interpretability through simpler tree structures that are easier for humans to understand and explain, lower computational costs for inference since predictions traverse fewer nodes, and improved empirical performance on held-out data by avoiding the pitfalls of overfit models. These advantages make pruned trees particularly valuable in practical applications where model transparency and efficiency are prioritized alongside predictive quality.[6][5]Historical Context
The development of decision tree pruning originated in the early 1980s as a response to overfitting in tree-based models, with the Classification and Regression Trees (CART) framework by Leo Breiman and colleagues introducing cost-complexity pruning in 1984.[5] This method balanced tree complexity against error rates through a regularization parameter, marking the first systematic approach to post-pruning in statistical decision trees. Concurrently, J. Ross Quinlan's ID3 algorithm in 1986 laid foundational work for information gain-based tree induction but lacked built-in pruning, relying instead on manual simplification to mitigate overfitting.[8][9] Key advancements followed in the late 1980s and early 1990s, with Quinlan proposing reduced-error pruning in 1987, a bottom-up technique that replaces subtrees with leaves if validation error decreases.[10] Subsequently, the C4.5 system in 1993 incorporated a distinct error-based pruning method using pessimistic estimates and confidence intervals for more robust generalization.[11] These milestones shifted decision trees from purely rule-based systems like ID3 toward statistically grounded methods exemplified by CART, emphasizing empirical validation to improve predictive performance. The evolution continued into the 2000s with the rise of ensemble methods, such as Breiman's random forests in 2001, which grew unpruned trees to full depth but reduced overfitting through bagging and feature randomness, thereby diminishing the reliance on individual tree pruning. Early literature often underemphasized pre-pruning techniques, which stop tree growth early via parameters like minimum split size; however, post-2010 implementations in libraries like scikit-learn have prioritized these for efficiency in large-scale applications.[6] As of 2025, decision tree pruning remains a foundational element in machine learning, with recent integrations into hybrid models combining trees with deep neural networks for enhanced interpretability, though no paradigm shifts have occurred since the ensemble era of the 2000s.[12]Pruning Techniques
Pre-Pruning
Pre-pruning, also referred to as early stopping or forward pruning, involves applying stopping rules during the construction phase of a decision tree to prevent the expansion of subtrees that lack sufficient predictive value, thereby avoiding the creation of overly complex structures prone to overfitting. This top-down approach integrates checks at each node before a split is performed, ensuring that only meaningful branches are developed. In seminal implementations like C4.5, pre-pruning is employed to maintain tree simplicity while building from the root downward.[13] Common criteria for pre-pruning include thresholds on the minimum number of samples required at a leaf or internal node, often set to 5-10 instances to guarantee reliable statistical estimates; a maximum tree depth, typically limited to 5-10 levels to curb exponential growth; and a minimum decrease in impurity metrics, such as requiring a Gini impurity reduction greater than 0.01 or an equivalent entropy-based threshold before permitting a split. These parameters can be adjusted based on dataset characteristics, with statistical tests like the chi-square test sometimes used to assess the significance of potential splits under a null hypothesis of independence between attributes and outcomes. In practice, such criteria are drawn from foundational algorithms like CART and C4.5, where they balance model expressiveness against generalization.[13][14] The advantages of pre-pruning lie in its computational efficiency, as it avoids the need to construct and subsequently prune an overly large tree, thereby reducing training time and memory usage—particularly beneficial for large datasets. Additionally, by preventing the formation of suboptimal branches early, it mitigates the risk of converging to local optima that might arise in post-pruning methods. However, pre-pruning carries the disadvantage of potentially underfitting the data if thresholds are set too conservatively, as it may halt growth prematurely and overlook splits that could capture subtle patterns in the training distribution.[14][15] Implementation of pre-pruning follows a straightforward integration into the recursive tree-building process: at each node, candidate splits are evaluated against the predefined criteria, and expansion proceeds only if all conditions are satisfied; otherwise, the node is designated as a leaf. A simplified pseudocode outline for this check is as follows:function build_tree(node, depth, samples):
if samples.size < min_samples_split or depth >= max_depth or best_impurity_decrease < min_impurity_decrease:
return create_leaf_node(samples)
else:
best_split = find_best_split(samples)
left_child = build_tree(split_left(best_split), depth + 1, left_samples)
right_child = build_tree(split_right(best_split), depth + 1, right_samples)
return create_internal_node(best_split, left_child, right_child)
function build_tree(node, depth, samples):
if samples.size < min_samples_split or depth >= max_depth or best_impurity_decrease < min_impurity_decrease:
return create_leaf_node(samples)
else:
best_split = find_best_split(samples)
left_child = build_tree(split_left(best_split), depth + 1, left_samples)
right_child = build_tree(split_right(best_split), depth + 1, right_samples)
return create_internal_node(best_split, left_child, right_child)
Post-Pruning
Post-pruning is a technique in decision tree induction that involves first constructing the complete, unpruned tree from the training data and then iteratively removing branches or subtrees that do not significantly contribute to the model's predictive accuracy, thereby simplifying the structure while reducing overfitting. This approach contrasts with pre-pruning by allowing unrestricted initial growth to capture potentially complex patterns before refinement.[10] The process employs a bottom-up strategy, beginning at the tree's leaves and progressing upward toward the root. A separate validation set is used to empirically assess the impact of potential prunings, ensuring decisions are based on generalization performance rather than training data alone. For instance, in reduced error pruning—a common post-pruning method—subtrees are evaluated by comparing the error rate before and after replacement with a leaf node labeled by the majority class in the subtree.[16] Key steps in post-pruning include: constructing the full decision tree without early stopping; identifying candidate subtrees starting from terminal nodes; testing the replacement of each subtree with a single leaf and measuring the resulting error on validation data; and pruning the subtree if the replacement yields equal or lower error, repeating iteratively until no further improvements are possible. This validation-driven evaluation helps avoid reliance solely on training data statistics.[17] One advantage of post-pruning is its ability to explore the full hypothesis space initially, potentially yielding more accurate representations of the data's underlying structure before simplification, which can lead to better generalization on unseen data compared to conservative early termination methods. However, it carries disadvantages such as increased computational expense from building and traversing the entire tree, and the potential for suboptimal prunings if the validation set is small, noisy, or unrepresentative.[18] As the foundational bottom-up pruning paradigm, post-pruning systematically simplifies the tree from its extremities inward, enabling targeted removal of unreliable branches while preserving core decision paths. Algorithms such as pessimistic pruning exemplify this by using conservative error estimates derived from the training set to guide replacements without a dedicated validation holdout.[10]Pruning Algorithms
Reduced Error Pruning
Reduced error pruning is a post-pruning technique for decision trees that simplifies the model by iteratively replacing subtrees with leaf nodes, using a separate validation set to ensure that the pruning does not increase the tree's error rate on unseen data.[10] This method, introduced by J. R. Quinlan in 1987 as an improvement over earlier systems like ID3, focuses on empirical error reduction by evaluating potential prunes based on their impact on validation accuracy.[19] It was developed in the context of constructing full, unpruned trees first and then refining them bottom-up to balance accuracy and simplicity.[10] The algorithm proceeds in the following steps: first, a complete unpruned decision tree is grown using the training data; second, a separate validation set (typically 20-30% of the total data) is used to classify cases with the full tree and compute baseline errors; third, starting from the bottom of the tree, each non-leaf node's subtree is considered for replacement by a leaf node representing the majority class in that subtree, with the validation error recalculated after each potential replacement; fourth, the replacement is performed if the resulting error is less than or equal to the error before pruning, and this process iterates until no further beneficial replacements are possible.[10] This bottom-up approach ensures that pruning decisions are made locally but propagated globally through repeated passes.[20] The strengths of reduced error pruning lie in its simplicity and direct linkage to observed performance improvements on the validation set, making it intuitive and effective for reducing overfitting without complex statistical adjustments.[10] However, it requires partitioning the data into training and validation sets, which can reduce the amount available for tree construction, and its conservative criterion (pruning only when error does not increase) often results in larger trees than more aggressive methods.[20] Here is an outline of the pseudocode for the validation-based subtree replacement test in reduced error pruning:function ReducedErrorPruning(Tree T, ValidationSet V):
errors = classify_all(V, T) // Compute initial errors on validation set
changed = true
while changed:
changed = false
for each non-leaf node N in T (bottom-up):
subtree = subtree_at(N)
majority_class = most_common_class_in_subtree(subtree)
temp_tree = replace_subtree_with_leaf(T, N, majority_class)
temp_errors = classify_all(V, temp_tree)
if temp_errors <= errors:
T = temp_tree
errors = temp_errors
changed = true
return T
function ReducedErrorPruning(Tree T, ValidationSet V):
errors = classify_all(V, T) // Compute initial errors on validation set
changed = true
while changed:
changed = false
for each non-leaf node N in T (bottom-up):
subtree = subtree_at(N)
majority_class = most_common_class_in_subtree(subtree)
temp_tree = replace_subtree_with_leaf(T, N, majority_class)
temp_errors = classify_all(V, temp_tree)
if temp_errors <= errors:
T = temp_tree
errors = temp_errors
changed = true
return T
Cost-Complexity Pruning
Cost-complexity pruning, introduced in the Classification and Regression Trees (CART) methodology, is a post-pruning technique that balances a decision tree's accuracy against its complexity by introducing a regularization parameter . This algorithm generates a sequence of increasingly pruned subtrees, each corresponding to a different value of , allowing for systematic control over overfitting. The core idea is to minimize a penalized error measure that trades off the tree's misclassification error with its size, where the "effective " for pruning a specific subtree is the smallest value at which that subtree is replaced by a leaf node. The cost-complexity measure is defined as where represents the misclassification error (or resubstitution error) of the tree , and is the number of terminal nodes (leaves) in . acts as a penalty parameter for tree complexity. For a subtree rooted at node with child subtree , pruning occurs when , which simplifies to the condition . Rearranging yields the effective for that node: This value indicates the complexity threshold at which pruning the subtree at becomes beneficial, as it equates the cost of the leaf to the cost of the full subtree. The derivation ensures that pruning decisions are made globally optimal within the family of subtrees, producing nested sequences where larger yields simpler trees.[6] The algorithm proceeds in the following steps: First, grow the full unpruned tree using the training data to minimize . Second, for every non-terminal node, compute the effective using the formula above and identify the node with the smallest . Third, prune the subtree at that node (replacing it with a leaf using ), updating the tree and recomputing effective values for affected nodes; repeat this process, increasing monotonically, until only the root remains, yielding a sequence of subtrees for . Finally, select the optimal subtree from this sequence using cross-validation on a held-out validation set, evaluating the total error plus any out-of-sample penalty. This process, while rooted in post-pruning, differs from simpler alternatives like reduced error pruning by enabling a continuum of complexity levels rather than binary decisions.[6] Key strengths of cost-complexity pruning include its systematic nature, which guarantees a unique sequence of nested subtrees for easy comparison, and its ability to handle continuous tuning of for fine-grained control over model complexity. It effectively mitigates overfitting in large trees by penalizing excessive branching in a mathematically grounded way. However, limitations arise from its computational intensity, as computing the full sequence requires evaluating effective across all nodes iteratively, which scales poorly for very large trees. Additionally, selecting the optimal demands cross-validation, adding further overhead and sensitivity to validation set quality.[6] In practice, cost-complexity pruning is widely implemented in machine learning libraries, such as scikit-learn'sDecisionTreeClassifier and DecisionTreeRegressor, where the ccp_alpha parameter directly sets to control pruning during or after tree growth. Setting ccp_alpha = 0 yields the full unpruned tree, while higher values enforce stricter pruning; the library automates the minimal cost-complexity computation for efficiency.[21][22]
Pessimistic Pruning
Pessimistic pruning is a post-pruning technique for decision trees that applies statistical adjustments to the observed training error to derive conservative upper-bound estimates, thereby avoiding overly optimistic decisions that might retain unreliable subtrees. Introduced by Quinlan in 1987, this method uses the training data itself without requiring a separate validation set, making it computationally efficient for large datasets. It forms the basis for pruning in the C4.5 algorithm, where error estimates are inflated to account for the uncertainty inherent in finite samples. The core of pessimistic pruning lies in computing a pessimistic error estimate for each node or subtree, which incorporates a penalty for model complexity to prevent overfitting. For a leaf node covering n samples with E_base observed errors, the basic pessimistic error E is calculated as E = (E_base + 0.5) / (n + 1), applying a continuity correction to approximate the binomial distribution more conservatively. For subtrees with k leaves, this extends to E = (E_base + 0.5 * k) / n, where the penalty 0.5 * k reflects the added complexity of multiple decision outcomes, adjusted further using chi-squared tests or binomial confidence intervals for reliability. An alternative formulation employs confidence intervals: pessimistic error = observed error + z \sqrt{observed error \times (1 - observed error)/n}, with z typically set to a value like 0.67 for a conservative 50% confidence level to upper-bound the true error rate. The pruning process proceeds bottom-up through the tree. First, for each internal node, the pessimistic error upper bound is computed for the full subtree rooted at that node by summing the weighted pessimistic errors of its children and adding a 0.5 penalty for the node itself. Second, this is compared to the pessimistic error bound if the subtree were replaced by a single leaf predicting the majority class in the subtree's samples; pruning occurs if the leaf's bound is less than or equal to the subtree's bound. Third, decisions propagate upward, with recalculations ensuring global consistency while favoring simpler structures. This approach reduces reliance on validation sets by embedding statistical caution directly into training data estimates, effectively preventing prunes based on noisy or optimistic error rates from small samples. However, its conservative nature can lead to retaining more complex trees than necessary, potentially under-pruning in low-noise scenarios, and it assumes a binomial model for classification errors, which may not hold for all data distributions. Pessimistic pruning relates closely to the Minimum Description Length (MDL) principle, which balances model fit against complexity by minimizing the total description length of the data and tree structure; the error penalties in pessimistic estimates approximate MDL's complexity term, promoting trees that compress the data efficiently without separate optimization.[23]Examples and Applications
Illustrative Example
To illustrate decision tree pruning, consider the classic "play tennis" toy dataset, which consists of 14 instances describing weather conditions (attributes: outlook with values sunny, overcast, rain; temperature with hot, mild, cool; humidity with high, normal; wind with weak, strong) and the target attribute play tennis (yes or no). This dataset is commonly used to demonstrate decision tree construction and serves as a basis for showing pruning effects. The unpruned decision tree is grown using an algorithm like ID3 until all leaves are pure, resulting in 4 leaves that achieve 100% accuracy on the training set (0% error rate). The tree structure begins with outlook as the root split: overcast leads directly to a yes leaf; for sunny, it further splits on humidity (high to no, normal to yes); for rain, it splits on wind (weak to yes, strong to no). However, on a held-out validation set (e.g., a split of 9 training and 5 validation instances), the unpruned tree may show signs of overfitting, with reduced accuracy compared to the training set. Applying reduced error pruning, the process evaluates bottom-up whether replacing a subtree with a leaf (assigned the majority class from its training instances) reduces validation error. For example, consider pruning the wind split under the rain branch: the unpruned subtree has weak → yes (majority yes) and strong → no (majority no), but replacing it with a single yes leaf (majority over the 5 rain instances: 3 yes, 2 no) can be tested. If validation error decreases, the prune is applied, as the simplified branch generalizes better. The final pruned tree might have 3 leaves: outlook = overcast → yes; sunny → humidity (high → no, normal → yes); rain → yes (pruned leaf). This achieves better validation accuracy than the unpruned version, improving generalization at the cost of a slight increase in training error. The pruned tree can be represented textually as follows:- Outlook