Hubbry Logo
Random forestRandom forestMain
Open search
Random forest
Community hub
Random forest
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Random forest
Random forest
from Wikipedia

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees.[1][2] Random forests correct for decision trees' habit of overfitting to their training set.[3]: 587–588 

The first algorithm for random decision forests was created in 1995 by Tin Kam Ho[1] using the random subspace method,[2] which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg.[4][5][6]

An extension of the algorithm was developed by Leo Breiman[7] and Adele Cutler,[8] who registered[9] "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab, Inc.).[10] The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho[1] and later independently by Amit and Geman[11] in order to construct a collection of decision trees with controlled variance.

History

[edit]

The general method of random decision forests was first proposed by Salzberg and Heath in 1993,[12] with a method that used a randomized decision tree algorithm to create multiple trees and then combine them using majority voting. This idea was developed further by Ho in 1995.[1] Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines[2] concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. This observation that a more complex classifier (a larger forest) gets more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination.[4][5][6]

The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman[11] who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho[2] was also influential in the design of random forests. This method grows a forest of trees, and introduces variation among the trees by projecting the training data into a randomly chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Thomas G. Dietterich.[13]

The proper introduction of random forests was made in a paper by Leo Breiman,[7] that has become one of the world's most cited papers.[14] This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular:

  1. Using out-of-bag error as an estimate of the generalization error.
  2. Measuring variable importance through permutation.

The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation.

Algorithm

[edit]

Preliminaries: decision tree learning

[edit]

Decision trees are a popular method for various machine learning tasks. Tree learning is almost "an off-the-shelf procedure for data mining", say Hastie et al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate".[3]: 352 

In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.[3]: 587–588  This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

Bagging

[edit]
Illustration of training a Random Forest model. The training dataset (in this case, of 250 rows and 100 columns) is randomly sampled with replacement n times. Then, a decision tree is trained on each sample. Finally, for prediction, the results of all n trees are aggregated to produce a final decision.

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:

For b = 1, ..., B:
  1. Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
  2. Train a classification or regression tree fb on Xb, Yb.

After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x':

or by taking the plurality vote in the case of classification trees.

This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on x′:

The number B of samples (equivalently, of trees) is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. B can be optimized using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample xi, using only the trees that did not have xi in their bootstrap sample.[15]

The training and test error tend to level off after some number of trees have been fit.

From bagging to random forests

[edit]

The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.[16]

Typically, for a classification problem with p features, p (rounded down) features are used in each split.[3]: 592  For regression problems the inventors recommend p/3 (rounded down) with a minimum node size of 5 as the default.[3]: 592  In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem.[3]: 592 

ExtraTrees

[edit]

Adding one further step of randomization yields extremely randomized trees, or ExtraTrees. As with ordinary random forests, they are an ensemble of individual trees, but there are two main differences: (1) each tree is trained using the whole learning sample (rather than a bootstrap sample), and (2) the top-down splitting is randomized: for each feature under consideration, a number of random cut-points are selected, instead of computing the locally optimal cut-point (based on, e.g., information gain or the Gini impurity). The values are chosen from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly chosen splits, the split that yields the highest score is chosen to split the node.

Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are for classification and for regression, where is the number of features in the model.[17]

Random forests for high-dimensional data

[edit]

The basic random forest procedure may not work well in situations where there are a large number of features but only a small proportion of these features are informative with respect to sample classification. This can be addressed by encouraging the procedure to focus mainly on features and trees that are informative. Some methods for accomplishing this are:

  • Prefiltering: Eliminate features that are mostly just noise.[18][19]
  • Enriched Random Forest (ERF): Use weighted random sampling instead of simple random sampling at each node of each tree, giving greater weight to features that appear to be more informative.[20][21][22]
  • Tree-weighted random forest (TWRF): Give more weight to more accurate trees.[23][24]

Properties

[edit]

Variable importance

[edit]

Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper[7] and is implemented in the R package randomForest.[8]

Permutation importance

[edit]

To measure a feature's importance in a data set , first a random forest is trained on the data. During training, the out-of-bag error for each data point is recorded and averaged over the forest. (If bagging is not used during training, we can instead compute errors on an independent test set.)

After training, the values of the feature are permuted in the out-of-bag samples and the out-of-bag error is again computed on this perturbed data set. The importance for the feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.

Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhu et al.[25]

This method of determining variable importance has some drawbacks:

  • When features have different numbers of values, random forests favor features with more values. Solutions to this problem include partial permutations[26][27][28] and growing unbiased trees.[29][30]
  • If the data contain groups of correlated features of similar relevance, then smaller groups are favored over large groups.[31]
  • If there are collinear features, the procedure may fail to identify important features. A solution is to permute groups of correlated features together.[32]

Mean decrease in impurity feature importance

[edit]

This approach to feature importance for random forests considers as important the variables which decrease a lot the impurity during splitting.[33] It is described in the book Classification and Regression Trees by Leo Breiman[34] and is the default implementation in sci-kit learn and R. The definition is:where

  • is a feature
  • is the number of trees in the forest
  • is tree
  • is the fraction of samples reaching node
  • is the change in impurity in tree at node .

As impurity measure for samples falling in a node e.g. the following statistics can be used:

The normalized importance is then obtained by normalizing over all features, so that the sum of normalized feature importances is 1.

The sci-kit learn default implementation can report misleading feature importance:[32]

  • it favors high cardinality features
  • it uses training statistics and so does not reflect a feature's usefulness for predictions on a test set[35]

Relationship to nearest neighbors

[edit]

A relationship between random forests and the k-nearest neighbor algorithm (k-NN) was pointed out by Lin and Jeon in 2002.[36] Both can be viewed as so-called weighted neighborhoods schemes. These are models built from a training set that make predictions for new points x' by looking at the "neighborhood" of the point, formalized by a weight function W:Here, is the non-negative weight of the i'th training point relative to the new point x' in the same tree. For any x', the weights for points must sum to 1. Weight functions are as follows:

  • In k-NN, if xi is one of the k points closest to x', and zero otherwise.
  • In a tree, if xi is one of the k' points in the same leaf as x', and zero otherwise.

Since a forest averages the predictions of a set of m trees with individual weight functions , its predictions are

This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of x' in this interpretation are the points sharing the same leaf in any tree . In this way, the neighborhood of x' depends in a complex way on the structure of the trees, and thus on the structure of the training set. Lin and Jeon show that the shape of the neighborhood used by a random forest adapts to the local importance of each feature.[36]

Unsupervised learning

[edit]

As part of their construction, random forest predictors naturally lead to a dissimilarity measure among observations. One can analogously define dissimilarity between unlabeled data, by training a forest to distinguish original "observed" data from suitably generated synthetic data drawn from a reference distribution.[7][37] A random forest dissimilarity is attractive because it handles mixed variable types very well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. Random forest dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection; for example, the "Addcl 1" random forest dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. Random forest dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data.[38]

Variants

[edit]

Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particular multinomial logistic regression and naive Bayes classifiers.[39][40][41] In cases that the relationship between the predictors and the target variable is linear, the base learners may have an equally high accuracy as the ensemble learner.[42][39]

Kernel random forest

[edit]

In machine learning, kernel random forests (KeRF) establish the connection between random forests and kernel methods. By slightly modifying their definition, random forests can be rewritten as kernel methods, which are more interpretable and easier to analyze.[43]

History

[edit]

Leo Breiman[44] was the first person to notice the link between random forest and kernel methods. He pointed out that random forests trained using i.i.d. random vectors in the tree construction are equivalent to a kernel acting on the true margin. Lin and Jeon[45] established the connection between random forests and adaptive nearest neighbor, implying that random forests can be seen as adaptive kernel estimates. Davies and Ghahramani[46] proposed Kernel Random Forest (KeRF) and showed that it can empirically outperform state-of-art kernel methods. Scornet[43] first defined KeRF estimates and gave the explicit link between KeRF estimates and random forest. He also gave explicit expressions for kernels based on centered random forest[47] and uniform random forest,[48] two simplified models of random forest. He named these two KeRFs Centered KeRF and Uniform KeRF, and proved upper bounds on their rates of consistency.

Notations and definitions

[edit]

Preliminaries: Centered forests

[edit]

Centered forest[47] is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute. The algorithm stops when a fully binary tree of level is built, where is a parameter of the algorithm.

Uniform forest

[edit]

Uniform forest[48] is another simplified model for Breiman's original random forest, which uniformly selects a feature among all features and performs splits at a point uniformly drawn on the side of the cell, along the preselected feature.

From random forest to KeRF

[edit]

Given a training sample of -valued independent random variables distributed as the independent prototype pair , where . We aim at predicting the response , associated with the random variable , by estimating the regression function . A random regression forest is an ensemble of randomized regression trees. Denote the predicted value at point by the -th tree, where are independent random variables, distributed as a generic random variable , independent of the sample . This random variable can be used to describe the randomness induced by node splitting and the sampling procedure for tree construction. The trees are combined to form the finite forest estimate . For regression trees, we have , where is the cell containing , designed with randomness and dataset , and .

Thus random forest estimates satisfy, for all , . Random regression forest has two levels of averaging, first over the samples in the target cell of a tree, then over all trees. Thus the contributions of observations that are in cells with a high density of data points are smaller than that of observations which belong to less populated cells. In order to improve the random forest methods and compensate the misestimation, Scornet[43] defined KeRF by which is equal to the mean of the 's falling in the cells containing in the forest. If we define the connection function of the finite forest as , i.e. the proportion of cells shared between and , then almost surely we have , which defines the KeRF.

Centered KeRF

[edit]

The construction of Centered KeRF of level is the same as for centered forest, except that predictions are made by , the corresponding kernel function, or connection function is

Uniform KeRF

[edit]

Uniform KeRF is built in the same way as uniform forest, except that predictions are made by , the corresponding kernel function, or connection function is

Properties

[edit]

Relation between KeRF and random forest

[edit]

Predictions given by KeRF and random forests are close if the number of points in each cell is controlled:

Assume that there exist sequences such that, almost surely, Then almost surely,

Relation between infinite KeRF and infinite random forest

[edit]

When the number of trees goes to infinity, then we have infinite random forest and infinite KeRF. Their estimates are close if the number of observations in each cell is bounded:

Assume that there exist sequences such that, almost surely

Then almost surely,

Consistency results

[edit]

Assume that , where is a centered Gaussian noise, independent of , with finite variance . Moreover, is uniformly distributed on and is Lipschitz. Scornet[43] proved upper bounds on the rates of consistency for centered KeRF and uniform KeRF.

Consistency of centered KeRF

[edit]

Providing and , there exists a constant such that, for all , .

Consistency of uniform KeRF

[edit]

Providing and , there exists a constant such that, .

Disadvantages

[edit]

While random forests often achieve higher accuracy than a single decision tree, they sacrifice the intrinsic interpretability of decision trees. Decision trees are among a fairly small family of machine learning models that are easily interpretable along with linear models, rule-based models, and attention-based models. This interpretability is one of the main advantages of decision trees. It allows developers to confirm that the model has learned realistic information from the data and allows end-users to have trust and confidence in the decisions made by the model.[39][3] For example, following the path that a decision tree takes to make its decision is quite trivial, but following the paths of tens or hundreds of trees is much harder. To achieve both performance and interpretability, some model compression techniques allow transforming a random forest into a minimal "born-again" decision tree that faithfully reproduces the same decision function.[39][49][50]

Another limitation of random forests is that if features are linearly correlated with the target, random forest may not enhance the accuracy of the base learner.[39][42] Likewise in problems with multiple categorical variables.[51]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A random forest is an ensemble learning method for classification, regression, and other supervised machine learning tasks that constructs a multitude of decision trees during training and aggregates their outputs using majority voting for classification or averaging for regression, yielding a final prediction. This approach builds on the concept of bagging, where each tree is trained on a random bootstrap sample of the dataset, and introduces randomness in feature selection at each split to decorrelate the trees and enhance overall performance. Introduced by Leo Breiman in 2001, random forests combine ideas from earlier work on bootstrap aggregating (bagging) and random subspace methods, resulting in a robust algorithm that converges to a low generalization error as the number of trees increases, provided the trees are both strong predictors and lowly correlated. The algorithm operates by growing an of decision , where for each tree, a random of the training data is selected with replacement (bootstrap sampling), and at every node, a random of features is considered for the best split, typically limited to the of the total number of features for . This randomness reduces the variance of individual decision trees, which are prone to , and allows the forest to handle noisy data and outliers more effectively than single trees or other boosting methods like . Key hyperparameters include the number of trees (n_estimators, often defaulting to 100), the maximum depth of trees to control complexity, and the fraction of features considered per split (max_features), which can be tuned to optimize accuracy and computational efficiency. Additionally, random forests provide built-in mechanisms for estimating (using samples not selected in ) and computing variable importance based on mean decrease in impurity or importance. Random forests offer several advantages over traditional models, including high predictive accuracy comparable to or exceeding methods in many scenarios, inherent resistance to due to ensemble averaging, and the ability to process high-dimensional data without extensive feature engineering or handling missing values natively in modern implementations. They are computationally efficient and parallelizable, as trees can be built independently, making them suitable for large datasets. Since their introduction, random forests have seen advancements such as extensions for , online learning, and integration with genetic algorithms for , further broadening their applicability. Widely adopted across domains, random forests are employed in bioinformatics for gene selection and protein , ecology for species distribution modeling, medicine for diagnostic prediction, and finance for , owing to their interpretability through feature importance rankings and robustness in real-world, noisy environments. Their popularity stems from consistent performance on benchmark datasets and ease of implementation in libraries like , where they serve as a baseline for more complex techniques.

Introduction and Background

Definition and Overview

A is a supervised algorithm for and regression tasks that constructs a multitude of decision trees during and outputs the class that is the mode of the classes () or prediction (regression) of the individual trees. Each tree in the forest is built using a bootstrap sample of the data and a random of features at each split to introduce diversity among the trees. The core intuition behind random forests is to combine multiple weak learners—typically unpruned decision trees, which individually exhibit low bias but high variance—into a single robust model that mitigates overfitting by averaging out errors across the ensemble. This approach leverages the law of large numbers, ensuring that as more trees are added, the ensemble's performance stabilizes without further degradation from overfitting. Unlike a single decision tree, which can be highly sensitive to noise and prone to high variance due to its recursive partitioning, the random forest reduces this variance through aggregation, leading to improved generalization on unseen data. For prediction in , the final output is determined by majority voting across all ; a brief representation is:

For each [tree](/page/Tree) in [forest](/page/Forest): prediction = tree.predict(input) Aggregate predictions = mode of all predictions Return aggregate

For each [tree](/page/Tree) in [forest](/page/Forest): prediction = tree.predict(input) Aggregate predictions = mode of all predictions Return aggregate

In regression, predictions are averaged instead of taking the mode. At a high level, the involves generating bootstrap samples from the to each independently, constructing the trees with randomized feature selections to decorrelate them, and then combining their outputs via voting or averaging for the final prediction. This ensemble structure makes random forests particularly effective in handling high-dimensional data and noisy environments common in applications.

Historical Development

The concept of ensemble methods for decision trees emerged in the early , with Stephen Salzberg and colleagues exploring randomized approaches to improve tree-based classifiers. In 1993, Heath, Kasif, and Salzberg introduced methods for inducing oblique decision trees through randomized perturbations and heuristics, laying groundwork for diversifying tree structures in ensembles to enhance . This work highlighted the potential of to mitigate in single trees, influencing subsequent ensemble techniques. A key milestone came in 1996 when Leo Breiman introduced (bagging), a method for generating multiple versions of a predictor by on bootstrap samples and aggregating their outputs to reduce variance. In 1995, Tin Kam Ho developed the , which constructs decision forests by building trees in randomly selected subspaces of the feature space, particularly effective for high-dimensional data. Ho's approach addressed limitations in traditional decision trees by reducing correlation among members through feature subset , demonstrating improved accuracy on tasks. Building on similar ideas, Amit and Geman in 1997 proposed randomized trees for shape recognition, incorporating bagging with random to create diverse classifiers from a large pool of possible features. Their method emphasized at split points to handle complex, high-variety data like images. The pivotal advancement occurred in 2001 with Leo Breiman's seminal paper "Random Forests," which formalized the algorithm by combining (bagging) with random at each split, creating uncorrelated trees for robust predictions in and regression. Breiman's integrated prior ideas, such as Dietterich's 2000 study on via perturbed splits to compare methods like bagging and boosting, which demonstrated benefits of in reducing error. By 2025, Breiman's paper had amassed over 167,000 citations, underscoring its transformative impact on . Following Breiman's introduction, random forests saw rapid adoption in specialized fields, notably bioinformatics, where the 2002 release of the randomForest package for facilitated analysis and classification tasks in . This tool enabled researchers to handle high-dimensional data, marking an early evolution toward domain-specific applications. In the 2010s, integration into libraries like further broadened accessibility, solidifying random forests as a standard ensemble method across disciplines.

Core Algorithm

Decision Tree Fundamentals

A is a predictive modeling tool that represents decisions and their possible consequences as a tree-like structure, consisting of internal nodes, branches, and leaf nodes. Internal nodes represent feature tests or splits on input variables, branches denote the outcome of those tests, and leaf nodes provide the final predictions, such as class labels for or mean values for regression. This hierarchical structure allows the tree to partition the input space into regions based on feature values, enabling interpretable and non-parametric modeling of complex relationships in . The learning for a involves of the , starting from the node and repeatedly splitting subsets of until a stopping criterion is met, such as maximum depth or minimum samples per . At each internal node, an impurity measure quantifies the homogeneity of the with respect to the target variable; common measures include the Gini index for , defined as G=1i=1cpi2G = 1 - \sum_{i=1}^{c} p_i^2, where cc is the number of classes and pip_i is the proportion of class ii in the , and information entropy, H=i=1cpilog2piH = - \sum_{i=1}^{c} p_i \log_2 p_i. These measures evaluate how "pure" a split is, with lower impurity indicating better separation of classes. For regression, analogous measures like are used. Splitting criteria are determined using a that selects, at each node, the feature and threshold combination which maximizes the reduction in impurity across the resulting child nodes, often computed as the weighted average impurity decrease: ΔI=I(parent)kNkNI(childk)\Delta I = I(parent) - \sum_{k} \frac{N_k}{N} I(child_k), where NN and NkN_k are the sizes of the parent and child subsets. This exhaustive search over features and possible thresholds ensures locally optimal splits but does not guarantee a globally optimal . Binary splits are typical in methods like , though multi-way splits appear in algorithms such as ID3. To mitigate , techniques simplify the fully grown by removing branches that contribute little to predictive accuracy. Cost-complexity , a post- method, balances accuracy and size by minimizing the objective Rα(T)=R(T)+αTR_\alpha(T) = R(T) + \alpha |T|, where R(T)R(T) is the resubstitution error (e.g., misclassification rate), T|T| is the number of terminal nodes, and α0\alpha \geq 0 is a tuned via cross-validation. As α\alpha increases, suboptimal subtrees are pruned, producing a sequence of nested trees; the optimal α\alpha is selected to minimize error on validation data. This approach prevents the tree from capturing in the set. Despite their strengths in interpretability and handling mixed data types, single decision trees suffer from high variance and a propensity for , leading to instability where small changes in the training data can produce substantially different trees. For instance, consider a toy with features representing height and weight to predict gender; a single tree might split first on height > 170 cm, achieving near-perfect training accuracy but failing to generalize to a slightly perturbed test set with added noise, resulting in high error due to memorizing outliers rather than underlying patterns. This instability arises because the greedy splitting amplifies small data variations into divergent structures, particularly in deeper trees.

Bagging Procedure

The bagging procedure, short for , forms the foundational technique in random forests by generating multiple bootstrap samples from the original training dataset to train diverse s, thereby reducing variance. Introduced by Leo Breiman in 1996, bagging involves resampling the dataset with replacement to create subsets, each used to fit an independent base learner, typically a , followed by aggregating their outputs to produce a final . This method leverages the of s—where small changes in training data lead to significant variations in s—to stabilize the through averaging or voting. Bootstrap sampling is performed by drawing nn with replacement from the original dataset of size nn, resulting in BB such subsets (where BB is the number of trees, often 100 or more). On average, approximately 63.2% of the original appear in each bootstrap sample, leaving about 36.8% out-of-bag (OOB), calculated as the probability that a specific is not selected: 1(11n)n1e10.3681 - \left(1 - \frac{1}{n}\right)^n \approx 1 - e^{-1} \approx 0.368. These OOB samples provide an internal validation set for unbiased error estimation without separate holdout data. Each tree is then trained on its respective bootstrap sample, ensuring among the trees due to the in sampling. Aggregation combines the predictions from the BB trees: for regression tasks, the final prediction is the average of the individual tree outputs, which smooths out fluctuations; for classification, it is the majority vote across the trees, selecting the class with the most endorsements. This aggregation step is crucial, as it exploits the : as BB \to \infty, the aggregated predictor converges to the of a single tree's prediction over the bootstrap distribution, theoretically reducing variance while preserving the low bias of the base learners. Breiman demonstrated that for unstable procedures like trees, bagging can substantially improve accuracy by mitigating to in any single training set. The algorithm proceeds in the following steps:
  1. For b=1b = 1 to BB, generate a bootstrap sample by drawing nn observations with replacement from the training data.
  2. Train a decision tree on the bb-th bootstrap sample, using the full set of features at each split (unlike the random subspace extension).
  3. For a new input, obtain predictions from all BB trees and aggregate them via averaging (regression) or majority vote (classification).
To illustrate, consider a small regression dataset with three observations: (x1=1, y1=2), (x2=2, y2=4), (x3=3, y3=6), where the true underlying function is linear (y=2x). A single decision tree might overfit noise if the data includes minor perturbations, predicting y=5.5 for x=2.5. With bagging using B=3 bootstraps, one sample might be {(1,2), (2,4), (2,4)} yielding a tree predicting 4 for x=2.5; another {(2,4), (3,6), (3,6)} predicting 5.33; and the third {(1,2), (1,2), (3,6)} predicting 3.33. Averaging these gives 4.22, closer to the true 5, demonstrating stabilization despite individual tree variability. This example highlights bagging's variance reduction in practice, as extended to random forests.

Random Subspace Method

The , a key component of random forests, involves selecting a random of features at each node during tree construction to determine potential splits, thereby introducing additional diversity among the members beyond data sampling techniques. This approach randomly chooses mm features out of the total pp available features, where mpm \ll p, and restricts the split search to this subset, with the process repeated independently at every node. Originally introduced by Tin Kam Ho in as a technique for in high-dimensional problems, the method constructs decision trees within randomly projected subspaces of the feature space to mitigate and enhance . Ho's formulation emphasized building entire trees within fixed random subspaces using the full training set, which proved effective for datasets with thousands of features, such as handwritten digit recognition. The primary motivation in random forests is to further decorrelate the trees by reducing the in their predictions, particularly when features are highly correlated, leading to more independent error structures across the . This decorrelation helps in lowering the overall variance without substantially compromising individual tree accuracy. In implementation, the feature subset is selected without replacement at each node, ensuring that the chosen mm features are distinct, after which standard splitting criteria (e.g., Gini impurity for ) are applied solely within this subset to find the best split. This per-node , as adapted in random forests, contrasts with Ho's whole-tree subspace projection but achieves similar diversity benefits while allowing trees to adapt to different feature combinations as they grow deeper. The impact on split quality is a modest reduction in the optimality of individual splits compared to searching all pp features, but this promotes broader exploration of the feature space across the forest, resulting in improved ensemble performance on unseen data. The size of the feature subset mm is a tunable that balances strength and inter-tree correlation. For tasks, a common choice is m=pm = \lfloor \sqrt{p} \rfloor
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.