Recent from talks
Contribute something
Nothing was collected or created yet.
Random forest
View on Wikipedia| Part of a series on |
| Machine learning and data mining |
|---|
Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees.[1][2] Random forests correct for decision trees' habit of overfitting to their training set.[3]: 587–588
The first algorithm for random decision forests was created in 1995 by Tin Kam Ho[1] using the random subspace method,[2] which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg.[4][5][6]
An extension of the algorithm was developed by Leo Breiman[7] and Adele Cutler,[8] who registered[9] "Random Forests" as a trademark in 2006 (as of 2019[update], owned by Minitab, Inc.).[10] The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho[1] and later independently by Amit and Geman[11] in order to construct a collection of decision trees with controlled variance.
History
[edit]The general method of random decision forests was first proposed by Salzberg and Heath in 1993,[12] with a method that used a randomized decision tree algorithm to create multiple trees and then combine them using majority voting. This idea was developed further by Ho in 1995.[1] Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines[2] concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. This observation that a more complex classifier (a larger forest) gets more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination.[4][5][6]
The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman[11] who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho[2] was also influential in the design of random forests. This method grows a forest of trees, and introduces variation among the trees by projecting the training data into a randomly chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Thomas G. Dietterich.[13]
The proper introduction of random forests was made in a paper by Leo Breiman,[7] that has become one of the world's most cited papers.[14] This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular:
- Using out-of-bag error as an estimate of the generalization error.
- Measuring variable importance through permutation.
The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation.
Algorithm
[edit]Preliminaries: decision tree learning
[edit]Decision trees are a popular method for various machine learning tasks. Tree learning is almost "an off-the-shelf procedure for data mining", say Hastie et al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate".[3]: 352
In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.[3]: 587–588 This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.
Bagging
[edit]
The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:
- Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
- Train a classification or regression tree fb on Xb, Yb.
After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x':
or by taking the plurality vote in the case of classification trees.
This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.
Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on x′:
The number B of samples (equivalently, of trees) is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. B can be optimized using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample xi, using only the trees that did not have xi in their bootstrap sample.[15]
The training and test error tend to level off after some number of trees have been fit.
From bagging to random forests
[edit]The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.[16]
Typically, for a classification problem with p features, √p (rounded down) features are used in each split.[3]: 592 For regression problems the inventors recommend p/3 (rounded down) with a minimum node size of 5 as the default.[3]: 592 In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem.[3]: 592
ExtraTrees
[edit]Adding one further step of randomization yields extremely randomized trees, or ExtraTrees. As with ordinary random forests, they are an ensemble of individual trees, but there are two main differences: (1) each tree is trained using the whole learning sample (rather than a bootstrap sample), and (2) the top-down splitting is randomized: for each feature under consideration, a number of random cut-points are selected, instead of computing the locally optimal cut-point (based on, e.g., information gain or the Gini impurity). The values are chosen from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly chosen splits, the split that yields the highest score is chosen to split the node.
Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are for classification and for regression, where is the number of features in the model.[17]
Random forests for high-dimensional data
[edit]The basic random forest procedure may not work well in situations where there are a large number of features but only a small proportion of these features are informative with respect to sample classification. This can be addressed by encouraging the procedure to focus mainly on features and trees that are informative. Some methods for accomplishing this are:
- Prefiltering: Eliminate features that are mostly just noise.[18][19]
- Enriched Random Forest (ERF): Use weighted random sampling instead of simple random sampling at each node of each tree, giving greater weight to features that appear to be more informative.[20][21][22]
- Tree-weighted random forest (TWRF): Give more weight to more accurate trees.[23][24]
Properties
[edit]Variable importance
[edit]Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper[7] and is implemented in the R package randomForest.[8]
Permutation importance
[edit]To measure a feature's importance in a data set , first a random forest is trained on the data. During training, the out-of-bag error for each data point is recorded and averaged over the forest. (If bagging is not used during training, we can instead compute errors on an independent test set.)
After training, the values of the feature are permuted in the out-of-bag samples and the out-of-bag error is again computed on this perturbed data set. The importance for the feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.
Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhu et al.[25]
This method of determining variable importance has some drawbacks:
- When features have different numbers of values, random forests favor features with more values. Solutions to this problem include partial permutations[26][27][28] and growing unbiased trees.[29][30]
- If the data contain groups of correlated features of similar relevance, then smaller groups are favored over large groups.[31]
- If there are collinear features, the procedure may fail to identify important features. A solution is to permute groups of correlated features together.[32]
Mean decrease in impurity feature importance
[edit]This approach to feature importance for random forests considers as important the variables which decrease a lot the impurity during splitting.[33] It is described in the book Classification and Regression Trees by Leo Breiman[34] and is the default implementation in sci-kit learn and R. The definition is:where
- is a feature
- is the number of trees in the forest
- is tree
- is the fraction of samples reaching node
- is the change in impurity in tree at node .
As impurity measure for samples falling in a node e.g. the following statistics can be used:
The normalized importance is then obtained by normalizing over all features, so that the sum of normalized feature importances is 1.
The sci-kit learn default implementation can report misleading feature importance:[32]
- it favors high cardinality features
- it uses training statistics and so does not reflect a feature's usefulness for predictions on a test set[35]
Relationship to nearest neighbors
[edit]A relationship between random forests and the k-nearest neighbor algorithm (k-NN) was pointed out by Lin and Jeon in 2002.[36] Both can be viewed as so-called weighted neighborhoods schemes. These are models built from a training set that make predictions for new points x' by looking at the "neighborhood" of the point, formalized by a weight function W:Here, is the non-negative weight of the i'th training point relative to the new point x' in the same tree. For any x', the weights for points must sum to 1. Weight functions are as follows:
- In k-NN, if xi is one of the k points closest to x', and zero otherwise.
- In a tree, if xi is one of the k' points in the same leaf as x', and zero otherwise.
Since a forest averages the predictions of a set of m trees with individual weight functions , its predictions are
This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of x' in this interpretation are the points sharing the same leaf in any tree . In this way, the neighborhood of x' depends in a complex way on the structure of the trees, and thus on the structure of the training set. Lin and Jeon show that the shape of the neighborhood used by a random forest adapts to the local importance of each feature.[36]
Unsupervised learning
[edit]As part of their construction, random forest predictors naturally lead to a dissimilarity measure among observations. One can analogously define dissimilarity between unlabeled data, by training a forest to distinguish original "observed" data from suitably generated synthetic data drawn from a reference distribution.[7][37] A random forest dissimilarity is attractive because it handles mixed variable types very well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. Random forest dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection; for example, the "Addcl 1" random forest dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. Random forest dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data.[38]
Variants
[edit]Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particular multinomial logistic regression and naive Bayes classifiers.[39][40][41] In cases that the relationship between the predictors and the target variable is linear, the base learners may have an equally high accuracy as the ensemble learner.[42][39]
Kernel random forest
[edit]In machine learning, kernel random forests (KeRF) establish the connection between random forests and kernel methods. By slightly modifying their definition, random forests can be rewritten as kernel methods, which are more interpretable and easier to analyze.[43]
History
[edit]Leo Breiman[44] was the first person to notice the link between random forest and kernel methods. He pointed out that random forests trained using i.i.d. random vectors in the tree construction are equivalent to a kernel acting on the true margin. Lin and Jeon[45] established the connection between random forests and adaptive nearest neighbor, implying that random forests can be seen as adaptive kernel estimates. Davies and Ghahramani[46] proposed Kernel Random Forest (KeRF) and showed that it can empirically outperform state-of-art kernel methods. Scornet[43] first defined KeRF estimates and gave the explicit link between KeRF estimates and random forest. He also gave explicit expressions for kernels based on centered random forest[47] and uniform random forest,[48] two simplified models of random forest. He named these two KeRFs Centered KeRF and Uniform KeRF, and proved upper bounds on their rates of consistency.
Notations and definitions
[edit]Preliminaries: Centered forests
[edit]Centered forest[47] is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute. The algorithm stops when a fully binary tree of level is built, where is a parameter of the algorithm.
Uniform forest
[edit]Uniform forest[48] is another simplified model for Breiman's original random forest, which uniformly selects a feature among all features and performs splits at a point uniformly drawn on the side of the cell, along the preselected feature.
From random forest to KeRF
[edit]Given a training sample of -valued independent random variables distributed as the independent prototype pair , where . We aim at predicting the response , associated with the random variable , by estimating the regression function . A random regression forest is an ensemble of randomized regression trees. Denote the predicted value at point by the -th tree, where are independent random variables, distributed as a generic random variable , independent of the sample . This random variable can be used to describe the randomness induced by node splitting and the sampling procedure for tree construction. The trees are combined to form the finite forest estimate . For regression trees, we have , where is the cell containing , designed with randomness and dataset , and .
Thus random forest estimates satisfy, for all , . Random regression forest has two levels of averaging, first over the samples in the target cell of a tree, then over all trees. Thus the contributions of observations that are in cells with a high density of data points are smaller than that of observations which belong to less populated cells. In order to improve the random forest methods and compensate the misestimation, Scornet[43] defined KeRF by which is equal to the mean of the 's falling in the cells containing in the forest. If we define the connection function of the finite forest as , i.e. the proportion of cells shared between and , then almost surely we have , which defines the KeRF.
Centered KeRF
[edit]The construction of Centered KeRF of level is the same as for centered forest, except that predictions are made by , the corresponding kernel function, or connection function is
Uniform KeRF
[edit]Uniform KeRF is built in the same way as uniform forest, except that predictions are made by , the corresponding kernel function, or connection function is
Properties
[edit]Relation between KeRF and random forest
[edit]Predictions given by KeRF and random forests are close if the number of points in each cell is controlled:
Assume that there exist sequences such that, almost surely, Then almost surely,
Relation between infinite KeRF and infinite random forest
[edit]When the number of trees goes to infinity, then we have infinite random forest and infinite KeRF. Their estimates are close if the number of observations in each cell is bounded:
Assume that there exist sequences such that, almost surely
Then almost surely,
Consistency results
[edit]Assume that , where is a centered Gaussian noise, independent of , with finite variance . Moreover, is uniformly distributed on and is Lipschitz. Scornet[43] proved upper bounds on the rates of consistency for centered KeRF and uniform KeRF.
Consistency of centered KeRF
[edit]Providing and , there exists a constant such that, for all , .
Consistency of uniform KeRF
[edit]Providing and , there exists a constant such that, .
Disadvantages
[edit]While random forests often achieve higher accuracy than a single decision tree, they sacrifice the intrinsic interpretability of decision trees. Decision trees are among a fairly small family of machine learning models that are easily interpretable along with linear models, rule-based models, and attention-based models. This interpretability is one of the main advantages of decision trees. It allows developers to confirm that the model has learned realistic information from the data and allows end-users to have trust and confidence in the decisions made by the model.[39][3] For example, following the path that a decision tree takes to make its decision is quite trivial, but following the paths of tens or hundreds of trees is much harder. To achieve both performance and interpretability, some model compression techniques allow transforming a random forest into a minimal "born-again" decision tree that faithfully reproduces the same decision function.[39][49][50]
Another limitation of random forests is that if features are linearly correlated with the target, random forest may not enhance the accuracy of the base learner.[39][42] Likewise in problems with multiple categorical variables.[51]
See also
[edit]- Boosting – Ensemble learning method
- Decision tree learning – Machine learning algorithm
- Ensemble learning – Statistics and machine learning technique
- Gradient boosting – Machine learning technique
- Non-parametric statistics – Type of statistical analysis
- Randomized algorithm – Algorithm that employs a degree of randomness as part of its logic or procedure
References
[edit]- ^ a b c d Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282. Archived from the original (PDF) on 17 April 2016. Retrieved 5 June 2016.
- ^ a b c d Ho TK (1998). "The Random Subspace Method for Constructing Decision Forests" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (8): 832–844. Bibcode:1998ITPAM..20..832T. doi:10.1109/34.709601. S2CID 206420153.
- ^ a b c d e f g Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical Learning (2nd ed.). Springer. ISBN 0-387-95284-5.
- ^ a b Kleinberg E (1990). "Stochastic Discrimination" (PDF). Annals of Mathematics and Artificial Intelligence. 1 (1–4): 207–239. CiteSeerX 10.1.1.25.6750. doi:10.1007/BF01531079. S2CID 206795835. Archived from the original (PDF) on 2018-01-18.
- ^ a b Kleinberg E (1996). "An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition". Annals of Statistics. 24 (6): 2319–2349. doi:10.1214/aos/1032181157. MR 1425956.
- ^ a b Kleinberg E (2000). "On the Algorithmic Implementation of Stochastic Discrimination" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 22 (5): 473–490. Bibcode:2000ITPAM..22..473K. CiteSeerX 10.1.1.33.4131. doi:10.1109/34.857004. S2CID 3563126. Archived from the original (PDF) on 2018-01-18.
- ^ a b c d Breiman L (2001). "Random Forests". Machine Learning. 45 (1): 5–32. Bibcode:2001MachL..45....5B. doi:10.1023/A:1010933404324.
- ^ a b Liaw A (16 October 2012). "Documentation for R package randomForest" (PDF). Retrieved 15 March 2013.
- ^ U.S. trademark registration number 3185828, registered 2006/12/19.
- ^ "RANDOM FORESTS Trademark of Health Care Productivity, Inc. - Registration Number 3185828 - Serial Number 78642027 :: Justia Trademarks".
- ^ a b Amit Y, Geman D (1997). "Shape quantization and recognition with randomized trees" (PDF). Neural Computation. 9 (7): 1545–1588. CiteSeerX 10.1.1.57.6069. doi:10.1162/neco.1997.9.7.1545. S2CID 12470146. Archived from the original (PDF) on 2018-02-05. Retrieved 2008-04-01.
- ^ Heath, D., Kasif, S. and Salzberg, S. (1993). k-DT: A multi-tree learning method. In Proceedings of the Second Intl. Workshop on Multistrategy Learning, pp. 138-149.
- ^ Dietterich, Thomas (2000). "An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization". Machine Learning. 40 (2): 139–157. doi:10.1023/A:1007607513941.
- ^ Helen Pearson; Heidi Ledford; Matthew Hutson; Richard Van Noorden (15 April 2025). "Exclusive: the most-cited papers of the twenty-first century". Nature. 640 (8059): 588–592. doi:10.1038/D41586-025-01125-9. ISSN 1476-4687. Wikidata Q135104889.
- ^ Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer. pp. 316–321.
- ^ Ho, Tin Kam (2002). "A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors" (PDF). Pattern Analysis and Applications. 5 (2): 102–112. doi:10.1007/s100440200009. S2CID 7415435. Archived from the original (PDF) on 2016-04-17. Retrieved 2015-11-13.
- ^ Geurts P, Ernst D, Wehenkel L (2006). "Extremely randomized trees" (PDF). Machine Learning. 63: 3–42. doi:10.1007/s10994-006-6226-1.
- ^ Dessi, N. & Milia, G. & Pes, B. (2013). Enhancing random forests performance in microarray data classification. Conference paper, 99-103. 10.1007/978-3-642-38326-7_15.
- ^ Ye, Y., Li, H., Deng, X., and Huang, J. (2008) Feature weighting random forest for detection of hidden web search interfaces. Journal of Computational Linguistics and Chinese Language Processing, 13, 387–404.
- ^ Amaratunga, D., Cabrera, J., Lee, Y.S. (2008) Enriched Random Forest. Bioinformatics, 24, 2010-2014.
- ^ Ghosh D, Cabrera J. (2022) Enriched random forest for high dimensional genomic data. IEEE/ACM Trans Comput Biol Bioinform. 19(5):2817-2828. doi:10.1109/TCBB.2021.3089417.
- ^ Amaratunga, D., Cabrera, J., Shkedy, Z. (2014). Exploration and Analysis of DNA Microarray and Other High-Dimensional Data. New York: John Wiley. Second Edition. 0.1002/9781118364505.
- ^ Winham, Stacey & Freimuth, Robert & Biernacka, Joanna. (2013). A weighted random forests approach to improve predictive performance. Statistical Analysis and Data Mining. 6. 10.1002/sam.11196.
- ^ Li, H. B., Wang, W., Ding, H. W., & Dong, J. (2010, 10-12 Nov. 2010). Trees weighting random forest method for classifying high-dimensional noisy data. Paper presented at the 2010 IEEE 7th International Conference on E-Business Engineering.
- ^ Zhu R, Zeng D, Kosorok MR (2015). "Reinforcement Learning Trees". Journal of the American Statistical Association. 110 (512): 1770–1784. doi:10.1080/01621459.2015.1036994. PMC 4760114. PMID 26903687.
- ^ Deng, H.; Runger, G.; Tuv, E. (2011). Bias of importance measures for multi-valued attributes and solutions. Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). pp. 293–300.
- ^ Altmann A, Toloşi L, Sander O, Lengauer T (May 2010). "Permutation importance: a corrected feature importance measure". Bioinformatics. 26 (10): 1340–7. doi:10.1093/bioinformatics/btq134. PMID 20385727.
- ^ Piryonesi S. Madeh; El-Diraby Tamer E. (2020-06-01). "Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems". Journal of Transportation Engineering, Part B: Pavements. 146 (2): 04020022. doi:10.1061/JPEODX.0000175. S2CID 216485629.
- ^ Strobl C, Boulesteix AL, Augustin T (2007). "Unbiased split selection for classification trees based on the Gini index" (PDF). Computational Statistics & Data Analysis. 52: 483–501. CiteSeerX 10.1.1.525.3178. doi:10.1016/j.csda.2006.12.030.
- ^ Painsky A, Rosset S (2017). "Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance". IEEE Transactions on Pattern Analysis and Machine Intelligence. 39 (11): 2142–2153. arXiv:1512.03444. Bibcode:2017ITPAM..39.2142P. doi:10.1109/tpami.2016.2636831. PMID 28114007. S2CID 5381516.
- ^ Tolosi L, Lengauer T (July 2011). "Classification with correlated features: unreliability of feature ranking and solutions". Bioinformatics. 27 (14): 1986–94. doi:10.1093/bioinformatics/btr300. PMID 21576180.
- ^ a b "Beware Default Random Forest Importances". explained.ai. Retrieved 2023-10-25.
- ^ Ortiz-Posadas, Martha Refugio (2020-02-29). Pattern Recognition Techniques Applied to Biomedical Problems. Springer Nature. ISBN 978-3-030-38021-2.
- ^ Breiman, Leo (2017-10-25). Classification and Regression Trees. New York: Routledge. doi:10.1201/9781315139470. ISBN 978-1-315-13947-0.
- ^ https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html 31. Aug. 2023
- ^ a b Lin, Yi; Jeon, Yongho (2002). Random forests and adaptive nearest neighbors (Technical report). Technical Report No. 1055. University of Wisconsin. CiteSeerX 10.1.1.153.9168.
- ^ Shi, T.; Horvath, S. (2006). "Unsupervised Learning with Random Forest Predictors". Journal of Computational and Graphical Statistics. 15 (1): 118–138. CiteSeerX 10.1.1.698.2365. doi:10.1198/106186006X94072. JSTOR 27594168. S2CID 245216.
- ^ Shi T, Seligson D, Belldegrun AS, Palotie A, Horvath S (April 2005). "Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma". Modern Pathology. 18 (4): 547–57. doi:10.1038/modpathol.3800322. PMID 15529185.
- ^ a b c d e Piryonesi, S. Madeh; El-Diraby, Tamer E. (2021-02-01). "Using Machine Learning to Examine Impact of Type of Performance Indicator on Flexible Pavement Deterioration Modeling". Journal of Infrastructure Systems. 27 (2): 04021005. doi:10.1061/(ASCE)IS.1943-555X.0000602. ISSN 1076-0342. S2CID 233550030.
- ^ Prinzie, A.; Van den Poel, D. (2008). "Random Forests for multiclass classification: Random MultiNomial Logit". Expert Systems with Applications. 34 (3): 1721–1732. doi:10.1016/j.eswa.2007.01.029.
- ^ Prinzie, Anita (2007). "Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB". In Roland Wagner; Norman Revell; Günther Pernul (eds.). Database and Expert Systems Applications: 18th International Conference, DEXA 2007, Regensburg, Germany, September 3-7, 2007, Proceedings. Lecture Notes in Computer Science. Vol. 4653. pp. 349–358. doi:10.1007/978-3-540-74469-6_35. ISBN 978-3-540-74467-2.
- ^ a b Smith, Paul F.; Ganesh, Siva; Liu, Ping (2013-10-01). "A comparison of random forest regression and multiple linear regression for prediction in neuroscience". Journal of Neuroscience Methods. 220 (1): 85–91. doi:10.1016/j.jneumeth.2013.08.024. PMID 24012917. S2CID 13195700.
- ^ a b c d Scornet, Erwan (2015). "Random forests and kernel methods". arXiv:1502.03836 [math.ST].
- ^ Breiman, Leo (2000). "Some infinity theory for predictor ensembles". Technical Report 579, Statistics Dept. UCB.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Lin, Yi; Jeon, Yongho (2006). "Random forests and adaptive nearest neighbors". Journal of the American Statistical Association. 101 (474): 578–590. CiteSeerX 10.1.1.153.9168. doi:10.1198/016214505000001230. S2CID 2469856.
- ^ Davies, Alex; Ghahramani, Zoubin (2014). "The Random Forest Kernel and other kernels for big data from random partitions". arXiv:1402.4293 [stat.ML].
- ^ a b Breiman L, Ghahramani Z (2004). "Consistency for a simple model of random forests". Statistical Department, University of California at Berkeley. Technical Report (670). CiteSeerX 10.1.1.618.90.
- ^ a b Arlot S, Genuer R (2014). "Analysis of purely random forests bias". arXiv:1407.3939 [math.ST].
- ^ Sagi, Omer; Rokach, Lior (2020). "Explainable decision forest: Transforming a decision forest into an interpretable tree". Information Fusion. 61: 124–138. doi:10.1016/j.inffus.2020.03.013. S2CID 216444882.
- ^ Vidal, Thibaut; Schiffer, Maximilian (2020). "Born-Again Tree Ensembles". International Conference on Machine Learning. 119. PMLR: 9743–9753. arXiv:2003.11132.
- ^ Piryonesi, Sayed Madeh (November 2019). The Application of Data Analytics to Asset Management: Deterioration and Climate Change Adaptation in Ontario Roads (Doctoral dissertation) (Thesis).
Further reading
[edit]- Prinzie A, Poel D (2007). "Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB". Database and Expert Systems Applications. Lecture Notes in Computer Science. Vol. 4653. p. 349. doi:10.1007/978-3-540-74469-6_35. ISBN 978-3-540-74467-2.
- Denisko D, Hoffman MM (February 2018). "Classification and interaction in random forests". Proceedings of the National Academy of Sciences of the United States of America. 115 (8): 1690–1692. Bibcode:2018PNAS..115.1690D. doi:10.1073/pnas.1800256115. PMC 5828645. PMID 29440440.
External links
[edit]- Random Forests classifier description (Leo Breiman's site)
- Liaw, Andy & Wiener, Matthew "Classification and Regression by randomForest" R News (2002) Vol. 2/3 p. 18 (Discussion of the use of the random forest package for R)
Random forest
View on GrokipediaIntroduction and Background
Definition and Overview
A random forest is a supervised ensemble learning algorithm for classification and regression tasks that constructs a multitude of decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.[2] Each tree in the forest is built using a bootstrap sample of the training data and a random subset of features at each split to introduce diversity among the trees.[2] The core intuition behind random forests is to combine multiple weak learners—typically unpruned decision trees, which individually exhibit low bias but high variance—into a single robust model that mitigates overfitting by averaging out errors across the ensemble.[2] This approach leverages the law of large numbers, ensuring that as more trees are added, the ensemble's performance stabilizes without further degradation from overfitting.[2] Unlike a single decision tree, which can be highly sensitive to noise and prone to high variance due to its recursive partitioning, the random forest reduces this variance through aggregation, leading to improved generalization on unseen data.[2] For prediction in classification, the final output is determined by majority voting across all trees; a brief pseudocode representation is:For each [tree](/page/Tree) in [forest](/page/Forest):
prediction = tree.predict(input)
Aggregate predictions = mode of all predictions
Return aggregate
For each [tree](/page/Tree) in [forest](/page/Forest):
prediction = tree.predict(input)
Aggregate predictions = mode of all predictions
Return aggregate
Historical Development
The concept of ensemble methods for decision trees emerged in the early 1990s, with Stephen Salzberg and colleagues exploring randomized approaches to improve tree-based classifiers. In 1993, Heath, Kasif, and Salzberg introduced methods for inducing oblique decision trees through randomized perturbations and heuristics, laying groundwork for diversifying tree structures in ensembles to enhance generalization. This work highlighted the potential of randomization to mitigate overfitting in single trees, influencing subsequent ensemble techniques. A key milestone came in 1996 when Leo Breiman introduced bootstrap aggregating (bagging), a method for generating multiple versions of a predictor by training on bootstrap samples and aggregating their outputs to reduce variance.[6] In 1995, Tin Kam Ho developed the random subspace method, which constructs decision forests by building trees in randomly selected subspaces of the feature space, particularly effective for high-dimensional data.[7] Ho's approach addressed limitations in traditional decision trees by reducing correlation among ensemble members through feature subset randomization, demonstrating improved accuracy on pattern recognition tasks. Building on similar ideas, Amit and Geman in 1997 proposed randomized trees for shape recognition, incorporating bagging with random feature selection to create diverse classifiers from a large pool of possible features. Their method emphasized randomization at split points to handle complex, high-variety data like images. The pivotal advancement occurred in 2001 with Leo Breiman's seminal paper "Random Forests," which formalized the algorithm by combining bootstrap aggregating (bagging) with random feature selection at each split, creating uncorrelated trees for robust predictions in classification and regression.[3] Breiman's formulation integrated prior ideas, such as Dietterich's 2000 study on randomization via perturbed splits to compare ensemble methods like bagging and boosting, which demonstrated benefits of randomization in reducing ensemble error.[8] By 2025, Breiman's paper had amassed over 167,000 citations, underscoring its transformative impact on machine learning. Following Breiman's introduction, random forests saw rapid adoption in specialized fields, notably bioinformatics, where the 2002 release of the randomForest package for R facilitated gene expression analysis and classification tasks in genomics.[9] This tool enabled researchers to handle high-dimensional microarray data, marking an early evolution toward domain-specific applications. In the 2010s, integration into libraries like scikit-learn further broadened accessibility, solidifying random forests as a standard ensemble method across disciplines.[1]Core Algorithm
Decision Tree Fundamentals
A decision tree is a predictive modeling tool that represents decisions and their possible consequences as a tree-like structure, consisting of internal nodes, branches, and leaf nodes. Internal nodes represent feature tests or splits on input variables, branches denote the outcome of those tests, and leaf nodes provide the final predictions, such as class labels for classification or mean values for regression. This hierarchical structure allows the tree to partition the input space into regions based on feature values, enabling interpretable and non-parametric modeling of complex relationships in data.[10] The learning process for a decision tree involves recursive partitioning of the dataset, starting from the root node and repeatedly splitting subsets of data until a stopping criterion is met, such as maximum depth or minimum samples per leaf. At each internal node, an impurity measure quantifies the homogeneity of the data subset with respect to the target variable; common measures include the Gini index for classification, defined as , where is the number of classes and is the proportion of class in the subset, and information entropy, . These measures evaluate how "pure" a split is, with lower impurity indicating better separation of classes. For regression, analogous measures like variance reduction are used.[10][11] Splitting criteria are determined using a greedy algorithm that selects, at each node, the feature and threshold combination which maximizes the reduction in impurity across the resulting child nodes, often computed as the weighted average impurity decrease: , where and are the sizes of the parent and child subsets. This exhaustive search over features and possible thresholds ensures locally optimal splits but does not guarantee a globally optimal tree. Binary splits are typical in methods like CART, though multi-way splits appear in algorithms such as ID3.[10][11] To mitigate overfitting, pruning techniques simplify the fully grown tree by removing branches that contribute little to predictive accuracy. Cost-complexity pruning, a post-pruning method, balances tree accuracy and size by minimizing the objective , where is the resubstitution error (e.g., misclassification rate), is the number of terminal nodes, and is a complexity parameter tuned via cross-validation. As increases, suboptimal subtrees are pruned, producing a sequence of nested trees; the optimal is selected to minimize error on validation data. This approach prevents the tree from capturing noise in the training set.[10] Despite their strengths in interpretability and handling mixed data types, single decision trees suffer from high variance and a propensity for overfitting, leading to instability where small changes in the training data can produce substantially different trees. For instance, consider a toy binary classification dataset with features representing height and weight to predict gender; a single tree might split first on height > 170 cm, achieving near-perfect training accuracy but failing to generalize to a slightly perturbed test set with added noise, resulting in high error due to memorizing outliers rather than underlying patterns. This instability arises because the greedy splitting amplifies small data variations into divergent structures, particularly in deeper trees.[12][10]Bagging Procedure
The bagging procedure, short for bootstrap aggregating, forms the foundational ensemble technique in random forests by generating multiple bootstrap samples from the original training dataset to train diverse decision trees, thereby reducing prediction variance. Introduced by Leo Breiman in 1996, bagging involves resampling the dataset with replacement to create subsets, each used to fit an independent base learner, typically a decision tree, followed by aggregating their outputs to produce a final prediction.[12] This method leverages the instability of decision trees—where small changes in training data lead to significant variations in predictions—to stabilize the ensemble through averaging or voting.[12] Bootstrap sampling is performed by drawing observations with replacement from the original dataset of size , resulting in such subsets (where is the number of trees, often 100 or more). On average, approximately 63.2% of the original observations appear in each bootstrap sample, leaving about 36.8% out-of-bag (OOB), calculated as the probability that a specific observation is not selected: . These OOB samples provide an internal validation set for unbiased error estimation without separate holdout data. Each tree is then trained on its respective bootstrap sample, ensuring independence among the trees due to the randomness in sampling.[12] Aggregation combines the predictions from the trees: for regression tasks, the final prediction is the average of the individual tree outputs, which smooths out fluctuations; for classification, it is the majority vote across the trees, selecting the class with the most endorsements. This aggregation step is crucial, as it exploits the law of large numbers: as , the aggregated predictor converges to the expected value of a single tree's prediction over the bootstrap distribution, theoretically reducing variance while preserving the low bias of the base learners. Breiman demonstrated that for unstable procedures like trees, bagging can substantially improve accuracy by mitigating overfitting to noise in any single training set.[12] The algorithm proceeds in the following steps:- For to , generate a bootstrap sample by drawing observations with replacement from the training data.
- Train a decision tree on the -th bootstrap sample, using the full set of features at each split (unlike the random subspace extension).
- For a new input, obtain predictions from all trees and aggregate them via averaging (regression) or majority vote (classification).
Random Subspace Method
The random subspace method, a key component of random forests, involves selecting a random subset of features at each node during tree construction to determine potential splits, thereby introducing additional diversity among the ensemble members beyond data sampling techniques. This approach randomly chooses features out of the total available features, where , and restricts the split search to this subset, with the process repeated independently at every node.[2] Originally introduced by Tin Kam Ho in 1995 as a technique for dimensionality reduction in high-dimensional classification problems, the method constructs decision trees within randomly projected subspaces of the feature space to mitigate overfitting and enhance generalization.[13] Ho's formulation emphasized building entire trees within fixed random subspaces using the full training set, which proved effective for datasets with thousands of features, such as handwritten digit recognition.[13] The primary motivation in random forests is to further decorrelate the trees by reducing the correlation in their predictions, particularly when features are highly correlated, leading to more independent error structures across the ensemble.[2] This decorrelation helps in lowering the overall variance without substantially compromising individual tree accuracy. In implementation, the feature subset is selected without replacement at each node, ensuring that the chosen features are distinct, after which standard splitting criteria (e.g., Gini impurity for classification) are applied solely within this subset to find the best split. This per-node randomization, as adapted in random forests, contrasts with Ho's whole-tree subspace projection but achieves similar diversity benefits while allowing trees to adapt to different feature combinations as they grow deeper. The impact on split quality is a modest reduction in the optimality of individual splits compared to searching all features, but this trade-off promotes broader exploration of the feature space across the forest, resulting in improved ensemble performance on unseen data.[2] The size of the feature subset is a tunable parameter that balances tree strength and inter-tree correlation. For classification tasks, a common choice is , while for regression, is often used to maintain higher individual tree accuracy.[2] For example, with features in a classification problem, , limiting each node's split search to just 10 randomly selected features. Compared to exhaustive search over all features, this introduces a slight increase in bias due to suboptimal splits but yields a substantial decrease in variance through reduced tree correlation, as empirically demonstrated by Breiman on datasets like sonar, where minimum error occurred at values of 4 to 8, outperforming single trees and bagging alone.[2]Forest Construction and Prediction
The random forest algorithm integrates bagging and the random subspace method to construct an ensemble of decision trees, enhancing predictive performance through diversity and averaging. Specifically, it builds B independent trees, each trained on a bootstrap sample drawn with replacement from the original dataset of size n, ensuring approximately 63% unique samples per tree on average. At each internal node of a tree, a random subset of m features is selected from the total p features, and the best split is chosen among those m candidates using standard criteria like Gini impurity for classification or mean squared error for regression. This randomization decorrelates the trees, reducing overfitting compared to single trees or plain bagging.[2] The complete procedure for forest construction can be outlined in pseudocode as follows:Algorithm BuildRandomForest(D, B, m):
Input: Dataset D with n samples and p features, number of trees B, features per split m
Output: Forest ensemble F = {T_1, T_2, ..., T_B}
F ← empty list
for k = 1 to B do:
D_k ← BootstrapSample(D) // Sample n instances with replacement
T_k ← BuildTree(D_k, m) // Grow tree with random feature subsets at nodes
Append T_k to F
return F
Algorithm BuildTree(D_k, m):
// Recursive function to grow unpruned tree
If stopping criterion met (e.g., all samples same class or min samples):
Return leaf node with majority class or mean target
Else:
Select m random features from p
Find best split among those m
Split D_k into child subsets
Left child ← BuildTree(left subset, m)
Right child ← BuildTree(right subset, m)
Return internal node with split
Algorithm BuildRandomForest(D, B, m):
Input: Dataset D with n samples and p features, number of trees B, features per split m
Output: Forest ensemble F = {T_1, T_2, ..., T_B}
F ← empty list
for k = 1 to B do:
D_k ← BootstrapSample(D) // Sample n instances with replacement
T_k ← BuildTree(D_k, m) // Grow tree with random feature subsets at nodes
Append T_k to F
return F
Algorithm BuildTree(D_k, m):
// Recursive function to grow unpruned tree
If stopping criterion met (e.g., all samples same class or min samples):
Return leaf node with majority class or mean target
Else:
Select m random features from p
Find best split among those m
Split D_k into child subsets
Left child ← BuildTree(left subset, m)
Right child ← BuildTree(right subset, m)
Return internal node with split
Theoretical Properties
Bias-Variance Tradeoff
In machine learning, the expected prediction error for a model can be decomposed into bias squared, variance, and irreducible noise: , where measures systematic deviation from the true function , captures sensitivity to training data fluctuations, and is the inherent noise variance.[14] Single decision trees typically exhibit low bias but high variance due to their tendency to overfit by capturing noise in the training data, leading to unstable predictions across different samples.[15] The bagging procedure in random forests addresses this high variance by averaging predictions from bootstrap-aggregated trees, reducing the ensemble variance approximately by a factor of when tree errors are uncorrelated: for .[12] In practice, residual correlations among trees temper this reduction, but the averaging still substantially lowers variance without altering bias, as each tree is trained on similar bootstrap samples.[14] The additional randomization in feature selection at each split further decorrelates trees, slightly increasing bias by restricting the search space but yielding a net decrease in overall error, as the variance reduction outweighs the minor bias inflation.[15] Empirically, the generalization error of random forests decreases monotonically as the number of trees increases, reflecting diminishing variance contributions, before plateauing at a level close to the irreducible error once correlations stabilize.[15] In the limit of an infinite number of trees, the random forest estimator converges in probability to the true regression function under certain conditions on the data distribution and tree construction, with variance approaching the residual correlation term if .[2][15]Variable Importance
Variable importance in random forests quantifies the contribution of each feature to the predictive performance of the ensemble, aiding in feature selection and model interpretability. Two primary methods are employed: mean decrease in impurity (MDI), which measures the average reduction in node impurity attributed to a feature across the forest, and permutation importance, which assesses the impact of disrupting a feature's relationship with the target by randomly shuffling its values.[16] Mean decrease in impurity (MDI), also known as Gini importance for classification tasks, computes the total impurity reduction from splits using a feature , summed over all trees and then averaged. The formula is where is the number of trees, are the nodes in tree split on , is the impurity measure (e.g., Gini index), and are left and right child nodes, denotes node size, and is the total number of training samples.[17] This method favors features that frequently appear at the top of trees, providing a fast, built-in assessment without additional computation.[17] Permutation importance evaluates the degradation in model accuracy when a feature's values are randomly permuted in out-of-bag (OOB) samples, breaking its predictive association while preserving others. For stability, multiple permutations are averaged. The score is the relative increase in error: expressed as a percentage.[16] This approach uses OOB estimates for unbiased evaluation and is applicable to both classification and regression.[16] MDI is computationally efficient but biased toward continuous variables or high-cardinality categorical features, which offer more split opportunities and thus higher impurity reductions, potentially inflating their importance.[18] Permutation importance avoids this bias, providing a more robust measure of true predictive power, though it is more expensive, scaling with the number of features and permutations required.[18][16] Importances are often visualized via horizontal bar plots, with features ranked by score. For the Iris dataset, MDI typically ranks petal length and width highest, reflecting their strong separation of species, while sepal measurements show lower importance. To facilitate comparison, importances are standardized by normalizing them to sum to 1 across all features, converting raw scores into relative contributions.[17]Out-of-Bag Error Estimation
In random forests, out-of-bag (OOB) samples arise during the bootstrap aggregation process, where approximately 37% of the training data instances are excluded from the bootstrap sample used to grow each individual decision tree, providing an independent evaluation set for that tree.[2] This exclusion fraction, derived from the bootstrap sampling with replacement, is expected to be , ensuring that each tree is trained on a subset that does not include these OOB instances.[2] The OOB error estimation leverages these samples to assess the ensemble's performance without requiring a separate validation or test set. For each data instance, predictions are obtained by averaging the outputs from all trees in the forest for which that instance was OOB, treating it as an internal cross-validation mechanism.[2] The overall OOB error is then computed as the average loss across all instances, formalized as: \text{OOB_error} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_{\text{OOB},i}) where is the number of instances, is the loss function (e.g., mean squared error for regression or misclassification rate for classification), is the true label, and is the prediction aggregated from the relevant OOB trees.[2] This approach offers several advantages as an unbiased alternative to traditional cross-validation, delivering reliable error estimates that closely correlate with out-of-sample test error, as demonstrated in empirical evaluations on benchmark datasets where OOB estimates tracked test errors with minimal bias.[2] In Breiman's original analysis, OOB estimates were consistently slightly higher than test set errors but exhibited strong agreement in trends across varying forest sizes and parameters.[2] OOB error estimation is particularly useful for practical applications such as hyperparameter tuning, where it guides the selection of parameters like the number of trees or feature subsets by monitoring performance improvements, and for early stopping during forest construction to prevent overfitting without excessive computational cost.[2]Extensions and Variants
ExtraTrees
Extremely Randomized Trees (ExtraTrees) is a variant of random forests that introduces additional randomization in the tree construction process to enhance efficiency and generalization. Unlike standard decision trees or random forests, which optimize split points by exhaustively searching for the best threshold within each feature's range, ExtraTrees selects split thresholds randomly from the empirical range of the selected feature values at each node. This method builds an ensemble of such trees without using bootstrap sampling, instead employing the entire training dataset for every tree, further emphasizing randomization to decorrelate the trees.[19] The core algorithmic difference lies in the split selection procedure: at each internal node, a random subset of features is chosen—typically K attributes, where K is a hyperparameter similar to that in random forests but often set larger (e.g., K = n for regression tasks, with n being the number of features)—and for each candidate feature, a split value is drawn uniformly at random from the observed values in the node's data subset, rather than optimizing for criteria like Gini impurity or mean squared error. This eliminates the computational overhead of exhaustive search, making tree growth significantly faster; the time complexity per tree is O(N log N) for N samples in balanced trees, with reduced constants compared to the O(N log N \cdot m) typical of optimized splits in random forests, where m is the number of features. By drawing on the random subspace method for feature selection while extending randomization to split points, ExtraTrees produces trees that are less prone to overfitting due to the increased stochasticity.[19] This approach yields several benefits, including reduced variance and thus lower overfitting, particularly in noisy environments, as the random splits average out irregularities in the data. Computationally, the lack of optimization enables faster training, scaling well to large datasets. Empirically, ExtraTrees has demonstrated superior performance over random forests and bagging on regression tasks across multiple benchmarks, outperforming on 22 out of 24 datasets tested. It is particularly advantageous for high-dimensional or noisy data, where the extra randomization helps mitigate the curse of dimensionality and noise sensitivity. Hyperparameters like the number of trees (T), minimum samples per leaf (n_min, often 5 for regression), and K control the trade-off between bias and variance, with larger K values promoting more thorough exploration at the cost of some added computation.[19]Unsupervised Random Forests
Unsupervised random forests adapt the ensemble structure of standard random forests to tasks lacking labeled data, such as clustering and anomaly detection, by leveraging proximity measures derived from tree structures. In this approach, random class labels are assigned to the unlabeled data points to simulate a supervised classification problem. A random forest is then constructed using bootstrap sampling and random feature selection at each split, as in the supervised case. The resulting proximity matrix captures the similarity between pairs of observations based on the fraction of trees in which they co-occur in the same terminal node; this proximity can be transformed into a dissimilarity measure, such as , for downstream unsupervised analyses.[20] For clustering, the dissimilarity matrix serves as input to hierarchical clustering or partitioning algorithms like PAM (Partitioning Around Medoids). One common variant is the random partitioning forest (RPF), where trees are built by randomly partitioning the feature space without relying on class labels, and the similarity measure is defined as the proportion of trees in which observations and end in the same leaf. This method has been applied in genomics to cluster tumor samples based on expression data from DNA microarrays, identifying clinically meaningful subgroups with significant survival differences (e.g., p-value of 4e-9 for RF clusters versus p=0.019 for Euclidean distance clusters). To mitigate sensitivity to the initial random labels, multiple forests are often averaged, with recommendations for at least five forests each containing 2000 trees.[20] In anomaly detection, points with low average proximity to the rest of the dataset are flagged as outliers, as they rarely co-occur in terminal nodes with others, indicating isolation in the feature space. Path length—the average depth reached in the trees—can also serve as an anomaly score, with shorter paths suggesting anomalies due to easier isolation by random splits; this principle underpins related methods like Isolation Forests but applies directly to standard random forest proximities. Despite these strengths, unsupervised random forests have limitations, including lack of rotation invariance in the proximity measure and potential for spurious clusters depending on the synthetic label distribution. They are generally less accurate than dedicated unsupervised methods like k-means or spectral clustering for pure clustering tasks.[20]Kernel Random Forests
Kernel random forests (KeRF) represent a theoretical extension of standard random forests, reformulating them as kernel methods to enable non-linear embeddings in reproducing kernel Hilbert spaces (RKHS). Introduced by Scornet in 2016, KeRF leverages the partitioning nature of decision trees to define a kernel that captures similarities based on leaf co-occupancy, providing a bridge between ensemble tree methods and kernel-based machine learning. This approach allows random forests to be analyzed through the lens of kernel ridge regression, enhancing interpretability and facilitating connections to functional analysis. Subsequent developments by Scornet and others between 2016 and 2019 explored variants such as centered and uniform KeRF, which differ in split selection mechanisms to improve theoretical guarantees. In formal notation, a single decision tree grown on a sample of size with randomness acts as a partition function, assigning each point to a leaf . The KeRF kernel for a finite forest of trees is defined as the average indicator of co-occupancy: where is the indicator function. In the limit of an infinite forest (), this converges to the expected kernel , yielding the infinite forest regression estimate . The centered variant employs midpoint splits at each node, leading to an explicit kernel form: for trees of depth , which centers the partitions around medians. The uniform variant, in contrast, selects splits uniformly at random within the node's range, resulting in a kernel that subtracts a diagonal term for centering: , though explicit forms are more complex due to the randomization. KeRF kernels are positive semi-definite under mild conditions on the tree construction, as established by the Mercer theorem applied to the partition probabilities, ensuring they induce valid RKHS norms. Furthermore, the infinite KeRF regression function approximates the standard random forest predictor under assumptions like Lipschitz continuity of the target function and uniform distribution of inputs on , with the bias bounded by the forest's approximation error. For the centered KeRF, this relation holds with the RKHS consisting of functions constant on dyadic intervals, linking directly to the piecewise constant nature of tree predictions. Consistency results for KeRF demonstrate convergence to the true regression function . For the centered variant, Scornet (2016) derived a rate of under Lipschitz assumptions. The uniform variant achieves . Recent improvements by Iakovidis and Arcozzi (2023) refine these to for centered and for uniform KeRF, still under basic regularity conditions. Under stronger margin assumptions—such as a margin parameter controlling the separation between regression levels—both variants attain fast minimax rates of , matching kernel ridge regression performance in low-noise regimes.Modern Extensions
Mondrian forests represent an adaptive extension of random forests designed for online learning and streaming data scenarios. Introduced by Lakshminarayanan et al., these forests employ Mondrian processes to dynamically partition the feature space, allowing trees to grow incrementally without full retraining, which enhances efficiency for large-scale or evolving datasets.[21] Subsequent extensions in 2016 adapted Mondrian forests for nonparametric regression, providing well-calibrated uncertainty estimates that outperform approximate Gaussian processes on large regression tasks while maintaining online trainability.[22] To address interpretability challenges in random forests, recent integrations with post-hoc explanation methods like SHAP and LIME have enabled local feature attribution for individual predictions. For instance, a 2023 study applied these techniques to random forest classifiers on diabetes datasets, revealing how specific symptoms contribute to risk scores and improving trust in healthcare applications through visualizations of model decisions.[23] Such approaches facilitate explainable AI by decomposing ensemble predictions into additive feature importance scores, particularly useful in regulated domains like medicine.[24] Evolutions of isolation forests, originally for anomaly detection, have focused on scalability for high-dimensional data. The 2025 introduction of kernel signature isolation forests incorporates kernel methods to better isolate anomalies in non-linear spaces, demonstrating improved detection accuracy and reduced computational overhead on large datasets compared to standard isolation forests. These advancements enable efficient processing of millions of points, making them suitable for real-time fraud detection and cybersecurity. Honest random forests, developed for causal inference, mitigate bias in treatment effect estimation by splitting data into training and estimation subsets during tree construction. Proposed by Wager and Athey in 2017, this "honest" approach ensures unbiased heterogeneous treatment effect estimates with valid inference, outperforming parametric methods in observational studies like policy evaluations.[25] Recent benchmarks highlight random forests' continued relevance, often surpassing gradient boosting on specific tabular tasks. In a 2025 analysis of solar irradiance prediction, random forests achieved the highest accuracy among ensemble methods, attributed to their robustness to noisy or discontinuous data patterns.Applications and Implementations
Supervised Learning Applications
Random forests have been widely applied in supervised learning tasks, particularly for classification and regression problems involving high-dimensional or noisy data. In classification, they excel at predicting categorical outcomes by aggregating predictions from multiple decision trees, reducing overfitting through bagging and random feature selection. For regression, random forests estimate continuous values by averaging tree outputs, making them suitable for forecasting scenarios where interpretability and robustness are key. These applications leverage the ensemble's ability to handle non-linear relationships and interactions without assuming data distribution. In bioinformatics, random forests have been instrumental since the early 2000s for gene selection and disease classification, especially in cancer genomics using microarray data. A seminal approach proposed in 2006 uses random forests to rank genes by importance measures like mean decrease in accuracy, enabling the identification of a small subset of predictive biomarkers from thousands of features while achieving high classification accuracy in multi-class tumor discrimination tasks. For instance, this method was applied to colon cancer datasets, selecting around 16-32 genes for models with error rates around 17-24%, outperforming single decision trees. More recent balanced iterative variants further enhance performance on imbalanced genomic data by iteratively refining gene subsets for disease prediction.[26][27] In the finance domain, random forests support credit scoring and stock price prediction, demonstrating robustness to imbalanced datasets common in fraud or default detection. For credit risk assessment, models combining random forests with oversampling techniques like SMOTE have shown improved prediction on skewed lending data, with one study reporting AUC scores up to 0.965 by prioritizing key financial and macroeconomic features. In stock prediction, random forests have been used to regress closing prices or classify trends using historical trading volumes and technical indicators, demonstrating effectiveness in capturing non-linear market dynamics. Variable importance scores from these models aid interpretability, highlighting influential factors like volatility indices in trading decisions.[28][29] Remote sensing applications utilize random forests for land cover classification from satellite imagery, effectively processing multispectral bands to map vegetation, urban areas, and water bodies. On Sentinel-2 data, random forest classifiers achieve overall accuracies of 85-95% in delineating land covers, outperforming single classifiers by integrating spatial and temporal features like NDVI indices. A 2023 study on diverse ecosystems reported kappa coefficients above 0.80, attributing success to the algorithm's handling of mixed pixels and class imbalances in imagery from regions like forests and croplands.[30][31] Illustrative examples highlight random forests' efficacy in benchmark datasets. On the Iris dataset, a classic multi-class classification problem, random forest models routinely attain 100% accuracy by distinguishing species based on sepal and petal measurements, leveraging feature importance to emphasize petal length as the top predictor. In spam detection, random forests classify emails with accuracies exceeding 97%, using importance rankings to prioritize textual features like word frequencies and sender domains, which reveal patterns in phishing attempts.[32] Performance benchmarks confirm random forests' strengths on tabular data, often rivaling or surpassing other machine learning models in supervised tasks, particularly on small-to-medium datasets (n<10,000), though neural networks may edge out on very large datasets.[33]Unsupervised and Other Applications
Random forests extend beyond supervised classification and regression to unsupervised settings, notably through variants like isolation forests for anomaly detection. In this approach, anomalies are identified by measuring the average path length required to isolate data points in randomized decision trees; shorter paths signify outliers, as anomalous instances are separated from the majority with fewer splits. This method, introduced in the isolation forest algorithm, efficiently handles high-dimensional data without assuming normality and has been applied in fraud detection, where it isolates unusual transactions like those indicative of credit card scams, outperforming traditional distance-based techniques in scalability and speed. In survival analysis, random survival forests adapt the ensemble framework to handle right-censored data, enabling predictions of time-to-event outcomes. Developed by Ishwaran et al. in 2008, this method employs survival-specific splitting rules, such as the log-rank test, to grow trees and aggregates predictions to estimate the cumulative hazard function, which quantifies the instantaneous risk of the event occurring over time. The resulting ensemble provides non-parametric estimates of survival probabilities and hazard rates, useful in medical research for prognostic modeling, such as predicting patient survival in clinical trials.[34] For ranking tasks, random forests support learning-to-rank algorithms by optimizing objectives tailored to ordinal predictions, such as pairwise comparisons or listwise approximations of metrics like normalized discounted cumulative gain. Extremely randomized trees, a random forest variant, have demonstrated strong performance in re-ranking documents for information retrieval, as seen in search engine applications where they prioritize query-relevant results by aggregating tree-based scores. This adaptation maintains the robustness of random forests while addressing the partial order inherent in ranking problems.[35] In multi-label classification, random forests handle scenarios with multiple simultaneous labels by treating each label as an independent binary classification problem, aggregating votes from individual trees via majority voting per label to produce a set of predictions. This binary relevance decomposition, combined with the ensemble's feature selection, has been effectively used in domains like non-intrusive load monitoring, where appliances are identified by multiple electrical signatures simultaneously. The approach scales well to high-label spaces without requiring problem transformation beyond per-label ensembling.[36] Emerging applications include detecting AI-generated text, where random forests classify content based on linguistic features such as vocabulary richness, sentence complexity, and stylometric patterns. Studies from 2023 to 2025 have shown random forests achieving accuracies above 90% in distinguishing human-authored from AI-produced text, leveraging features like perplexity and burstiness to capture subtle stylistic differences; for instance, one framework integrates these with support vector machines for hybrid detection in educational and journalistic contexts.[37]Software Implementations
The randomForest package in R, developed by Liaw and Wiener, provides a foundational implementation of Breiman's random forest algorithm for both classification and regression tasks.[17] It includes core functions such asrandomForest(), which builds an ensemble of decision trees using bootstrap sampling and random feature selection, and varImpPlot() for visualizing variable importance based on mean decrease in impurity or permutation accuracy.[17] This package, first published in 2002, remains widely used for its simplicity and integration with R's statistical ecosystem, supporting out-of-bag error estimation and proximity-based analyses.
In Python, the scikit-learn library offers robust random forest implementations through the RandomForestClassifier and RandomForestRegressor classes, which construct ensembles of decision trees via bagging and feature subsampling.[1] Key parameters include n_estimators, which controls the number of trees in the forest (defaulting to 100 for balanced performance), and max_features, which limits the number of features considered for splits at each node (e.g., 'sqrt' for classification or 'auto' for regression to optimize randomness).[38] These estimators support seamless integration with scikit-learn's preprocessing and evaluation pipelines, making them suitable for rapid prototyping in machine learning workflows.[1]
Other notable implementations include H2O's Distributed Random Forest (DRF), which scales random forests across clusters for big data environments using in-memory processing.[39] In contrast, XGBoost, while primarily a gradient boosting framework, includes tree-based ensembles that can mimic random forest behavior through parallel tree construction but often outperforms traditional random forests in predictive accuracy on structured data due to sequential error correction. For Java users, the Weka toolkit provides the RandomForest classifier, which builds forests of unpruned trees with configurable bag size and attribute selection, emphasizing ease of use in graphical and programmatic data mining.[40]
Implementation notes highlight efficiency enhancements like parallel processing in scikit-learn, achieved via the joblib backend, which distributes tree fitting across CPU cores to reduce training time on multicore systems.[41] GPU acceleration has advanced in the 2020s through RAPIDS cuML, a drop-in replacement for scikit-learn that ports random forest operations to NVIDIA GPUs, enabling 10-50x speedups on large datasets by leveraging CUDA for tree building and inference.
Regarding benchmarks, scikit-learn's random forest scales effectively to datasets with millions of instances on standard hardware, handling up to 10 million rows in under an hour for moderate tree counts, though performance plateaus beyond tens of millions without distributed extensions.[42] H2O and RAPIDS further improve scalability for enterprise-scale data, with RAPIDS demonstrating superior throughput on GPU clusters for high-dimensional problems.[43]
