Recent from talks
Nothing was collected or created yet.
Ensemble learning
View on Wikipedia| Part of a series on |
| Machine learning and data mining |
|---|
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.[1][2][3] Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.
Overview
[edit]Supervised learning algorithms search through a hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem.[4] Even if this space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form one which should be theoretically better.
Ensemble learning trains two or more machine learning algorithms on a specific classification or regression task. The algorithms within the ensemble model are generally referred as "base models", "base learners", or "weak learners" in literature. These base models can be constructed using a single modelling algorithm, or several different algorithms. The idea is to train a diverse set of weak models on the same modelling task, such that the outputs of each weak learner have poor predictive ability (i.e., high bias), and among all weak learners, the outcome and error values exhibit high variance. Fundamentally, an ensemble learning model trains at least two high-bias (weak) and high-variance (diverse) models to be combined into a better-performing model. The set of weak models — which would not produce satisfactory predictive results individually — are combined or averaged to produce a single, high performing, accurate, and low-variance model to fit the task as required.
Ensemble learning typically refers to bagging (bootstrap aggregating), boosting or stacking/blending techniques to induce high variance among the base models. Bagging creates diversity by generating random samples from the training observations and fitting the same model to each different sample — also known as homogeneous parallel ensembles. Boosting follows an iterative process by sequentially training each base model on the up-weighted errors of the previous base model, producing an additive model to reduce the final model errors — also known as sequential ensemble learning. Stacking or blending consists of different base models, each trained independently (i.e. diverse/high variance) to be combined into the ensemble model — producing a heterogeneous parallel ensemble. Common applications of ensemble learning include random forests (an extension of bagging), Boosted Tree models, and Gradient Boosted Tree Models. Models in applications of stacking are generally more task-specific — such as combining clustering techniques with other parametric and/or non-parametric techniques.[5]
Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model. In one sense, ensemble learning may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. On the other hand, the alternative is to do a lot more learning with one non-ensemble model. An ensemble may be more efficient at improving overall accuracy for the same increase in compute, storage, or communication resources by using that increase on two or more methods, than would have been improved by increasing resource use for a single method. Fast algorithms such as decision trees are commonly used in ensemble methods (e.g., random forests), although slower algorithms can benefit from ensemble techniques as well.
By analogy, ensemble techniques have been used also in unsupervised learning scenarios, for example in consensus clustering or in anomaly detection.
Ensemble theory
[edit]Empirically, ensembles tend to yield better results when there is a significant diversity among the models.[6][7] Many ensemble methods, therefore, seek to promote diversity among the models they combine.[8][9] Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees).[10] Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity.[11] It is possible to increase diversity in the training stage of the model using correlation for regression tasks [12] or using information measures such as cross entropy for classification tasks.[13]

Theoretically, one can justify the diversity concept because the lower bound of the error rate of an ensemble system can be decomposed into accuracy, diversity, and the other term.[14]
The geometric framework
[edit]Ensemble learning, including both regression and classification tasks, can be explained using a geometric framework.[15] Within this framework, the output of each individual classifier or regressor for the entire dataset can be viewed as a point in a multi-dimensional space. Additionally, the target result is also represented as a point in this space, referred to as the "ideal point."
The Euclidean distance is used as the metric to measure both the performance of a single classifier or regressor (the distance between its point and the ideal point) and the dissimilarity between two classifiers or regressors (the distance between their respective points). This perspective transforms ensemble learning into a deterministic problem.
For example, within this geometric framework, it can be proved that the averaging of the outputs (scores) of all base classifiers or regressors can lead to equal or better results than the average of all the individual models. It can also be proved that if the optimal weighting scheme is used, then a weighted averaging approach can outperform any of the individual classifiers or regressors that make up the ensemble or as good as the best performer at least.
Ensemble size
[edit]While the number of component classifiers of an ensemble has a great impact on the accuracy of prediction, there is a limited number of studies addressing this problem. A priori determining of ensemble size and the volume and velocity of big data streams make this even more crucial for online ensemble classifiers. Mostly statistical tests were used for determining the proper number of components. More recently, a theoretical framework suggested that there is an ideal number of component classifiers for an ensemble such that having more or less than this number of classifiers would deteriorate the accuracy. It is called "the law of diminishing returns in ensemble construction." Their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy.[16][17]
Common types of ensembles
[edit]Bayes optimal classifier
[edit]The Bayes optimal classifier is a classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it.[18] The Naive Bayes classifier is a version of this that assumes that the data is conditionally independent on the class and makes the computation more feasible. Each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior probability of that hypothesis. The Bayes optimal classifier can be expressed with the following equation:
where is the predicted class, is the set of all possible classes, is the hypothesis space, refers to a probability, and is the training data. As an ensemble, the Bayes optimal classifier represents a hypothesis that is not necessarily in . The hypothesis represented by the Bayes optimal classifier, however, is the optimal hypothesis in ensemble space (the space of all possible ensembles consisting only of hypotheses in ).
This formula can be restated using Bayes' theorem, which says that the posterior is proportional to the likelihood times the prior:
hence,
Bootstrap aggregating (bagging)
[edit]
Bootstrap aggregation (bagging) involves training an ensemble on bootstrapped data sets. A bootstrapped set is created by selecting from original training data set with replacement. Thus, a bootstrap set may contain a given example zero, one, or multiple times. Ensemble members can also have limits on the features (e.g., nodes of a decision tree), to encourage exploring of diverse features.[19] The variance of local information in the bootstrap sets and feature considerations promote diversity in the ensemble, and can strengthen the ensemble.[20] To reduce overfitting, a member can be validated using the out-of-bag set (the examples that are not in its bootstrap set).[21]
Inference is done by voting of predictions of ensemble members, called aggregation. It is illustrated below with an ensemble of four decision trees. The query example is classified by each tree. Because three of the four predict the positive class, the ensemble's overall classification is positive. Random forests like the one shown are a common application of bagging.

Boosting
[edit]Boosting involves training successive models by emphasizing training data mis-classified by previously learned models. Initially, all data (D1) has equal weight and is used to learn a base model M1. The examples mis-classified by M1 are assigned a weight greater than correctly classified examples. This boosted data (D2) is used to train a second base model M2, and so on. Inference is done by voting.
In some cases, boosting has yielded better accuracy than bagging, but tends to over-fit more. The most common implementation of boosting is Adaboost, but some newer algorithms are reported to achieve better results.[citation needed]
Bayesian model averaging
[edit]Bayesian model averaging (BMA) makes predictions by averaging the predictions of models weighted by their posterior probabilities given the data.[22] BMA is known to generally give better answers than a single model, obtained, e.g., via stepwise regression, especially where very different models have nearly identical performance in the training set but may otherwise perform quite differently.
The question with any use of Bayes' theorem is the prior, i.e., the probability (perhaps subjective) that each model is the best to use for a given purpose. Conceptually, BMA can be used with any prior. R packages ensembleBMA[23] and BMA[24] use the prior implied by the Bayesian information criterion, (BIC), following Raftery (1995).[25] R package BAS supports the use of the priors implied by Akaike information criterion (AIC) and other criteria over the alternative models as well as priors over the coefficients.[26]
The difference between BIC and AIC is the strength of preference for parsimony. BIC's penalty for model complexity is , while AIC's is . Large-sample asymptotic theory establishes that if there is a best model, then with increasing sample sizes, BIC is strongly consistent, i.e., will almost certainly find it, while AIC may not, because AIC may continue to place excessive posterior probability on models that are more complicated than they need to be. On the other hand, AIC and AICc are asymptotically "efficient" (i.e., minimum mean square prediction error), while BIC is not .[27]
Haussler et al. (1994) showed that when BMA is used for classification, its expected error is at most twice the expected error of the Bayes optimal classifier.[28] Burnham and Anderson (1998, 2002) contributed greatly to introducing a wider audience to the basic ideas of Bayesian model averaging and popularizing the methodology.[29] The availability of software, including other free open-source packages for R beyond those mentioned above, helped make the methods accessible to a wider audience.[30]
Bayesian model combination
[edit]Bayesian model combination (BMC) is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weights drawn randomly from a Dirichlet distribution having uniform parameters). This modification overcomes the tendency of BMA to converge toward giving all the weight to a single model. Although BMC is somewhat more computationally expensive than BMA, it tends to yield dramatically better results. BMC has been shown to be better on average (with statistical significance) than BMA and bagging.[31]
Use of Bayes' law to compute model weights requires computing the probability of the data given each model. Typically, none of the models in the ensemble are exactly the distribution from which the training data were generated, so all of them correctly receive a value close to zero for this term. This would work well if the ensemble were big enough to sample the entire model-space, but this is rarely possible. Consequently, each pattern in the training data will cause the ensemble weight to shift toward the model in the ensemble that is closest to the distribution of the training data. It essentially reduces to an unnecessarily complex method for doing model selection.
The possible weightings for an ensemble can be visualized as lying on a simplex. At each vertex of the simplex, all of the weight is given to a single model in the ensemble. BMA converges toward the vertex that is closest to the distribution of the training data. By contrast, BMC converges toward the point where this distribution projects onto the simplex. In other words, instead of selecting the one model that is closest to the generating distribution, it seeks the combination of models that is closest to the generating distribution.
The results from BMA can often be approximated by using cross-validation to select the best model from a bucket of models. Likewise, the results from BMC may be approximated by using cross-validation to select the best ensemble combination from a random sampling of possible weightings.
Bucket of models
[edit]A "bucket of models" is an ensemble technique in which a model selection algorithm is used to choose the best model for each problem. When tested with only one problem, a bucket of models can produce no better results than the best model in the set, but when evaluated across many problems, it will typically produce much better results, on average, than any model in the set.
The most common approach used for model-selection is cross-validation selection (sometimes called a "bake-off contest"). It is described with the following pseudo-code:
For each model m in the bucket:
Do c times: (where 'c' is some constant)
Randomly divide the training dataset into two sets: A and B
Train m with A
Test m with B
Select the model that obtains the highest average score
Cross-Validation Selection can be summed up as: "try them all with the training set, and pick the one that works best".[32]
Gating is a generalization of Cross-Validation Selection. It involves training another learning model to decide which of the models in the bucket is best-suited to solve the problem. Often, a perceptron is used for the gating model. It can be used to pick the "best" model, or it can be used to give a linear weight to the predictions from each model in the bucket.
When a bucket of models is used with a large set of problems, it may be desirable to avoid training some of the models that take a long time to train. Landmark learning is a meta-learning approach that seeks to solve this problem. It involves training only the fast (but imprecise) algorithms in the bucket, and then using the performance of these algorithms to help determine which slow (but accurate) algorithm is most likely to do best.[33]
Amended Cross-Entropy Cost: An Approach for Encouraging Diversity in Classification Ensemble
[edit]The most common approach for training classifier is using Cross-entropy cost function. However, one would like to train an ensemble of models that have diversity so when we combine them it would provide best results.[34][35] Assuming we use a simple ensemble of averaging classifiers. Then the Amended Cross-Entropy Cost is
where is the cost function of the classifier, is the probability of the classifier, is the true probability that we need to estimate and is a parameter between 0 and 1 that define the diversity that we would like to establish. When we want each classifier to do its best regardless of the ensemble and when we would like the classifier to be as diverse as possible.
Stacking
[edit]Stacking (sometimes called stacked generalization) involves training a model to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm (final estimator) is trained to make a final prediction using all the predictions of the other algorithms (base estimators) as additional inputs or using cross-validated predictions from the base estimators which can prevent overfitting.[36] If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although, in practice, a logistic regression model is often used as the combiner.
Stacking typically yields performance better than any single one of the trained models.[37] It has been successfully used on both supervised learning tasks (regression,[38] classification and distance learning [39]) and unsupervised learning (density estimation).[40] It has also been used to estimate bagging's error rate.[3][41] It has been reported to out-perform Bayesian model-averaging.[42] The two top-performers in the Netflix competition utilized blending, which may be considered a form of stacking.[43]
Voting
[edit]Voting is another form of ensembling. See e.g. Weighted majority algorithm (machine learning).
Implementations in statistics packages
[edit]- R: at least three packages offer Bayesian model averaging tools,[44] including the BMS (an acronym for Bayesian Model Selection) package,[45] the BAS (an acronym for Bayesian Adaptive Sampling) package,[46] and the BMA package.[47]
- Python: scikit-learn, a package for machine learning in Python offers packages for ensemble learning including packages for bagging, voting and averaging methods.
- MATLAB: classification ensembles are implemented in Statistics and Machine Learning Toolbox.[48]
Ensemble learning applications
[edit]In recent years, due to growing computational power, which allows for training in large ensemble learning in a reasonable time frame, the number of ensemble learning applications has grown increasingly.[49] Some of the applications of ensemble classifiers include:
Remote sensing
[edit]Land cover mapping
[edit]Land cover mapping is one of the major applications of Earth observation satellite sensors, using remote sensing and geospatial data, to identify the materials and objects which are located on the surface of target areas. Generally, the classes of target materials include roads, buildings, rivers, lakes, and vegetation.[50] Some different ensemble learning approaches based on artificial neural networks,[51] kernel principal component analysis (KPCA),[52] decision trees with boosting,[53] random forest[50][54] and automatic design of multiple classifier systems,[55] are proposed to efficiently identify land cover objects.
Change detection
[edit]Change detection is an image analysis problem, consisting of the identification of places where the land cover has changed over time. Change detection is widely used in fields such as urban growth, forest and vegetation dynamics, land use and disaster monitoring.[56] The earliest applications of ensemble classifiers in change detection are designed with the majority voting,[57] Bayesian model averaging,[58] and the maximum posterior probability.[59] Given the growth of satellite data over time, the past decade sees more use of time series methods for continuous change detection from image stacks.[60] One example is a Bayesian ensemble changepoint detection method called BEAST, with the software available as a package Rbeast in R, Python, and Matlab.[61]
Computer security
[edit]Distributed denial of service
[edit]Distributed denial of service is one of the most threatening cyber-attacks that may happen to an internet service provider.[49] By combining the output of single classifiers, ensemble classifiers reduce the total error of detecting and discriminating such attacks from legitimate flash crowds.[62]
Malware Detection
[edit]Classification of malware codes such as computer viruses, computer worms, trojans, ransomware and spywares with the usage of machine learning techniques, is inspired by the document categorization problem.[63] Ensemble learning systems have shown a proper efficacy in this area.[64][65]
Intrusion detection
[edit]An intrusion detection system monitors computer network or computer systems to identify intruder codes like an anomaly detection process. Ensemble learning successfully aids such monitoring systems to reduce their total error.[66][67]
Face recognition
[edit]Face recognition, which recently has become one of the most popular research areas of pattern recognition, copes with identification or verification of a person by their digital images.[68]
Hierarchical ensembles based on Gabor Fisher classifier and independent component analysis preprocessing techniques are some of the earliest ensembles employed in this field.[69][70][71]
Emotion recognition
[edit]While speech recognition is mainly based on deep learning because most of the industry players in this field like Google, Microsoft and IBM reveal that the core technology of their speech recognition is based on this approach, speech-based emotion recognition can also have a satisfactory performance with ensemble learning.[72][73]
It is also being successfully used in facial emotion recognition.[74][75][76]
Fraud detection
[edit]Fraud detection deals with the identification of bank fraud, such as money laundering, credit card fraud and telecommunication fraud, which have vast domains of research and applications of machine learning. Because ensemble learning improves the robustness of the normal behavior modelling, it has been proposed as an efficient technique to detect such fraudulent cases and activities in banking and credit card systems.[77][78]
Financial decision-making
[edit]The accuracy of prediction of business failure is a very crucial issue in financial decision-making. Therefore, different ensemble classifiers are proposed to predict financial crises and financial distress.[79] Also, in the trade-based manipulation problem, where traders attempt to manipulate stock prices by buying and selling activities, ensemble classifiers are required to analyze the changes in the stock market data and detect suspicious symptom of stock price manipulation.[79]
Medicine
[edit]Ensemble classifiers have been successfully applied in neuroscience, proteomics and medical diagnosis like in neuro-cognitive disorder (i.e. Alzheimer or myotonic dystrophy) detection based on MRI datasets,[80][81][82] cervical cytology classification.[83][84]
Besides, ensembles have been successfully applied in medical segmentation tasks, for example brain tumor[85][86] and hyperintensities segmentation.[87]
See also
[edit]References
[edit]- ^ Opitz, D.; Maclin, R. (1999). "Popular ensemble methods: An empirical study". Journal of Artificial Intelligence Research. 11: 169–198. arXiv:1106.0257. doi:10.1613/jair.614.
- ^ Polikar, R. (2006). "Ensemble based systems in decision making". IEEE Circuits and Systems Magazine. 6 (3): 21–45. doi:10.1109/MCAS.2006.1688199. S2CID 18032543.
- ^ a b Rokach, L. (2010). "Ensemble-based classifiers". Artificial Intelligence Review. 33 (1–2): 1–39. doi:10.1007/s10462-009-9124-7. hdl:11323/1748. S2CID 11149239.
- ^ Blockeel H. (2011). "Hypothesis Space". Encyclopedia of Machine Learning. pp. 511–513. doi:10.1007/978-0-387-30164-8_373. ISBN 978-0-387-30768-8.
- ^ Ibomoiye Domor Mienye, Yanxia Sun (2022). A Survey of Ensemble Learning: Concepts, Algorithms, Applications and Prospects.
- ^ Kuncheva, L. and Whitaker, C., Measures of diversity in classifier ensembles, Machine Learning, 51, pp. 181-207, 2003
- ^ Sollich, P. and Krogh, A., Learning with ensembles: How overfitting can be useful, Advances in Neural Information Processing Systems, volume 8, pp. 190–196, 1996.
- ^ Brown, G. and Wyatt, J. and Harris, R. and Yao, X., Diversity creation methods: a survey and categorisation., Information Fusion, 6(1), pp.5-20, 2005.
- ^ Adeva, J. J. García; Cerviño, Ulises; Calvo, R. (December 2005). "Accuracy and Diversity in Ensembles of Text Categorisers" (PDF). CLEI Journal. 8 (2): 1:1–1:12. doi:10.19153/cleiej.8.2.1 (inactive 12 July 2025).
{{cite journal}}: CS1 maint: DOI inactive as of July 2025 (link) - ^ Ho, T., Random Decision Forests, Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 278–282, 1995.
- ^ Gashler, M.; Giraud-Carrier, C.; Martinez, T. (2008). "Decision Tree Ensemble: Small Heterogeneous is Better Than Large Homogeneous" (PDF). 2008 Seventh International Conference on Machine Learning and Applications. Vol. 2008. pp. 900–905. doi:10.1109/ICMLA.2008.154. ISBN 978-0-7695-3495-4. S2CID 614810.
- ^ Liu, Y.; Yao, X. (December 1999). "Ensemble learning via negative correlation". Neural Networks. 12 (10): 1399–1404. doi:10.1016/S0893-6080(99)00073-8. ISSN 0893-6080. PMID 12662623.
- ^ Shoham, Ron; Permuter, Haim (2019). "Amended Cross-Entropy Cost: An Approach for Encouraging Diversity in Classification Ensemble (Brief Announcement)". Cyber Security Cryptography and Machine Learning. Lecture Notes in Computer Science. Vol. 11527. pp. 202–207. doi:10.1007/978-3-030-20951-3_18. ISBN 978-3-030-20950-6. S2CID 189926552.
- ^ Terufumi Morishita et al, Rethinking Fano’s Inequality in Ensemble Learning, International Conference on Machine Learning, 2022
- ^ Wu, S., Li, J., & Ding, W. (2023) A geometric framework for multiclass ensemble classifiers, Machine Learning, 112(12), pp. 4929-4958. doi:10.1007/S10994-023-06406-W
- ^ R. Bonab, Hamed; Can, Fazli (2016). A Theoretical Framework on the Ideal Number of Classifiers for Online Ensembles in Data Streams. CIKM. USA: ACM. p. 2053.
- ^ Bonab, Hamed; Can, Fazli (2017). "Less is More: A Comprehensive Framework for the Number of Components of Ensemble Classifiers". arXiv:1709.02925 [cs.LG].
- ^ Tom M. Mitchell, Machine Learning, 1997, pp. 175
- ^ Salman, R., Alzaatreh, A., Sulieman, H., & Faisal, S. (2021). A Bootstrap Framework for Aggregating within and between Feature Selection Methods. Entropy (Basel, Switzerland), 23(2), 200. doi:10.3390/e23020200
- ^ Breiman, L., Bagging Predictors, Machine Learning, 24(2), pp.123-140, 1996. doi:10.1007/BF00058655
- ^ Brodeur, Z. P., Herman, J. D., & Steinschneider, S. (2020). Bootstrap aggregation and cross-validation methods to reduce overfitting in reservoir control policy search. Water Resources Research, 56, e2020WR027184. doi:10.1029/2020WR027184
- ^ e.g., Jennifer A. Hoeting; David Madigan; Adrian Raftery; Chris Volinsky (1999). "Bayesian Model Averaging: A Tutorial". Statistical Science. ISSN 0883-4237. Wikidata Q98974344.
- ^ Chris Fraley; Adrian Raftery; J. McLean Sloughter; Tilmann Gneiting, ensembleBMA: Probabilistic Forecasting using Ensembles and Bayesian Model Averaging, Wikidata Q98972500
- ^ Adrian Raftery; Jennifer A. Hoeting; Chris Volinsky; Ian Painter; Ka Yee Yeung, BMA: Bayesian Model Averaging, Wikidata Q91674106.
- ^ Adrian Raftery (1995). "Bayesian model selection in social research". Sociological Methodology: 111–196. doi:10.2307/271063. ISSN 0081-1750. Wikidata Q91670340.
- ^ Merlise A. Clyde; Michael L. Littman; Quanli Wang; Joyee Ghosh; Yingbo Li; Don van den Bergh, BAS: Bayesian Variable Selection and Model Averaging using Bayesian Adaptive Sampling, Wikidata Q98974089.
- ^ Gerda Claeskens; Nils Lid Hjort (2008), Model selection and model averaging, Cambridge University Press, Wikidata Q62568358, ch. 4.
- ^ Haussler, David; Kearns, Michael; Schapire, Robert E. (1994). "Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension". Machine Learning. 14: 83–113. doi:10.1007/bf00993163.
- ^ Kenneth P. Burnham; David R. Anderson (1998), Model Selection and Inference: A practical information-theoretic approach, Springer Science+Business Media, Wikidata Q62670082 and Kenneth P. Burnham; David R. Anderson (2002), Model Selection and Multimodel Inference: A practical information-theoretic approach (2nd ed.), Springer Science+Business Media, doi:10.1007/B97636, Wikidata Q76889160.
- ^ The Wikiversity article on Searching R Packages mentions several ways to find available packages for something like this. For example, "sos::findFn('{Bayesian model averaging}')" from within R will search for help files in contributed packages that includes the search term and open two tabs in the default browser. The first will list all the help files found sorted by package. The second summarizes the packages found, sorted by the apparent strength of the match.
- ^ Monteith, Kristine; Carroll, James; Seppi, Kevin; Martinez, Tony. (2011). Turning Bayesian Model Averaging into Bayesian Model Combination (PDF). Proceedings of the International Joint Conference on Neural Networks IJCNN'11. pp. 2657–2663.
- ^ Saso Dzeroski, Bernard Zenko, Is Combining Classifiers Better than Selecting the Best One, Machine Learning, 2004, pp. 255-273
- ^ Bensusan, Hilan; Giraud-Carrier, Christophe (2000). "Discovering Task Neighbourhoods through Landmark Learning Performances" (PDF). Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. Vol. 1910. pp. 325–330. doi:10.1007/3-540-45372-5_32. ISBN 978-3-540-41066-9.
- ^ Shoham, Ron; Permuter, Haim (2019). "Amended Cross-Entropy Cost: An Approach for Encouraging Diversity in Classification Ensemble (Brief Announcement)". Cyber Security Cryptography and Machine Learning. Lecture Notes in Computer Science. Vol. 11527. pp. 202–207. doi:10.1007/978-3-030-20951-3_18. ISBN 978-3-030-20950-6.
- ^ Shoham, Ron; Permuter, Haim (2020). "Amended Cross Entropy Cost: Framework For Explicit Diversity Encouragement". arXiv:2007.08140 [cs.LG].
- ^ "1.11. Ensemble methods".
- ^ Wolpert (1992). "Stacked Generalization". Neural Networks. 5 (2): 241–259. doi:10.1016/s0893-6080(05)80023-1.
- ^ Breiman, Leo (1996). "Stacked regressions". Machine Learning. 24: 49–64. doi:10.1007/BF00117832.
- ^ Ozay, M.; Yarman Vural, F. T. (2013). "A New Fuzzy Stacked Generalization Technique and Analysis of its Performance". arXiv:1204.0171 [cs.LG].
- ^ Smyth, Padhraic; Wolpert, David (1999). "Linearly Combining Density Estimators via Stacking" (PDF). Machine Learning. 36 (1): 59–83. doi:10.1023/A:1007511322260. S2CID 16006860.
- ^ Wolpert, David H.; MacReady, William G. (1999). "An Efficient Method to Estimate Bagging's Generalization Error" (PDF). Machine Learning. 35 (1): 41–55. doi:10.1023/A:1007519102914. S2CID 14357246.
- ^ Clarke, B., Bayes model averaging and stacking when model approximation error cannot be ignored, Journal of Machine Learning Research, pp 683-712, 2003
- ^ Sill, J.; Takacs, G.; Mackey, L.; Lin, D. (2009). "Feature-Weighted Linear Stacking". arXiv:0911.0460 [cs.LG].
- ^ Amini, Shahram M.; Parmeter, Christopher F. (2011). "Bayesian model averaging in R" (PDF). Journal of Economic and Social Measurement. 36 (4): 253–287. doi:10.3233/JEM-2011-0350.
- ^ "BMS: Bayesian Model Averaging Library". The Comprehensive R Archive Network. 2015-11-24. Retrieved September 9, 2016.
- ^ "BAS: Bayesian Model Averaging using Bayesian Adaptive Sampling". The Comprehensive R Archive Network. Retrieved September 9, 2016.
- ^ "BMA: Bayesian Model Averaging". The Comprehensive R Archive Network. Retrieved September 9, 2016.
- ^ "Classification Ensembles". MATLAB & Simulink. Retrieved June 8, 2017.
- ^ a b Woźniak, Michał; Graña, Manuel; Corchado, Emilio (March 2014). "A survey of multiple classifier systems as hybrid systems". Information Fusion. 16: 3–17. doi:10.1016/j.inffus.2013.04.006. hdl:10366/134320. S2CID 11632848.
- ^ a b Rodriguez-Galiano, V.F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J.P. (January 2012). "An assessment of the effectiveness of a random forest classifier for land-cover classification". ISPRS Journal of Photogrammetry and Remote Sensing. 67: 93–104. Bibcode:2012JPRS...67...93R. doi:10.1016/j.isprsjprs.2011.11.002.
- ^ Giacinto, Giorgio; Roli, Fabio (August 2001). "Design of effective neural network ensembles for image classification purposes". Image and Vision Computing. 19 (9–10): 699–707. CiteSeerX 10.1.1.11.5820. doi:10.1016/S0262-8856(01)00045-2.
- ^ Xia, Junshi; Yokoya, Naoto; Iwasaki, Yakira (March 2017). "A novel ensemble classifier of hyperspectral and LiDAR data using morphological features". 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6185–6189. doi:10.1109/ICASSP.2017.7953345. ISBN 978-1-5090-4117-6. S2CID 40210273.
- ^ Mochizuki, S.; Murakami, T. (November 2012). "Accuracy comparison of land cover mapping using the object-oriented image classification with machine learning algorithms". 33rd Asian Conference on Remote Sensing 2012, ACRS 2012. 1: 126–133.
- ^ Liu, Dan; Toman, Elizabeth; Fuller, Zane; Chen, Gang; Londo, Alexis; Xuesong, Zhang; Kaiguang, Zhao (2018). "Integration of historical map and aerial imagery to characterize long-term land-use change and landscape dynamics: An object-based analysis via Random Forests" (PDF). Ecological Indicators. 95 (1): 595–605. Bibcode:2018EcInd..95..595L. doi:10.1016/j.ecolind.2018.08.004. S2CID 92025959.
- ^ Giacinto, G.; Roli, F.; Fumera, G. (September 2000). "Design of effective multiple classifier systems by clustering of classifiers". Proceedings 15th International Conference on Pattern Recognition. ICPR-2000. Vol. 2. pp. 160–163. CiteSeerX 10.1.1.11.5328. doi:10.1109/ICPR.2000.906039. ISBN 978-0-7695-0750-7. S2CID 2625643.
- ^ Du, Peijun; Liu, Sicong; Xia, Junshi; Zhao, Yindi (January 2013). "Information fusion techniques for change detection from multi-temporal remote sensing images". Information Fusion. 14 (1): 19–27. doi:10.1016/j.inffus.2012.05.003.
- ^ Defined by Bruzzone et al. (2002) as "The data class that receives the largest number of votes is taken as the class of the input pattern", this is simple majority, more accurately described as plurality voting.
- ^ Zhao, Kaiguang; Wulder, Michael A; Hu, Tongx; Bright, Ryan; Wu, Qiusheng; Qin, Haiming; Li, Yang (2019). "Detecting change-point, trend, and seasonality in satellite time series data to track abrupt changes and nonlinear dynamics: A Bayesian ensemble algorithm". Remote Sensing of Environment. 232 111181. Bibcode:2019RSEnv.23211181Z. doi:10.1016/j.rse.2019.04.034. hdl:11250/2651134. S2CID 201310998.
- ^ Bruzzone, Lorenzo; Cossu, Roberto; Vernazza, Gianni (December 2002). "Combining parametric and non-parametric algorithms for a partially unsupervised classification of multitemporal remote-sensing images" (PDF). Information Fusion. 3 (4): 289–297. doi:10.1016/S1566-2535(02)00091-X. hdl:11572/358372.
- ^ Theodomir, Mugiraneza; Nascetti, Andrea; Ban., Yifang (2020). "Continuous monitoring of urban land cover change trajectories with landsat time series and landtrendr-google earth engine cloud computing". Remote Sensing. 12 (18): 2883. Bibcode:2020RemS...12.2883M. doi:10.3390/rs12182883.
- ^ Li, Yang; Zhao, Kaiguang; Hu, Tongxi; Zhang, Xuesong. "BEAST: A Bayesian Ensemble Algorithm for Change-Point Detection and Time Series Decomposition". GitHub.
- ^ Raj Kumar, P. Arun; Selvakumar, S. (July 2011). "Distributed denial of service attack detection using an ensemble of neural classifier". Computer Communications. 34 (11): 1328–1341. doi:10.1016/j.comcom.2011.01.012.
- ^ Shabtai, Asaf; Moskovitch, Robert; Elovici, Yuval; Glezer, Chanan (February 2009). "Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey". Information Security Technical Report. 14 (1): 16–29. doi:10.1016/j.istr.2009.03.003.
- ^ Zhang, Boyun; Yin, Jianping; Hao, Jingbo; Zhang, Dingxing; Wang, Shulin (2007). "Malicious Codes Detection Based on Ensemble Learning". Autonomic and Trusted Computing. Lecture Notes in Computer Science. Vol. 4610. pp. 468–477. doi:10.1007/978-3-540-73547-2_48. ISBN 978-3-540-73546-5.
- ^ Menahem, Eitan; Shabtai, Asaf; Rokach, Lior; Elovici, Yuval (February 2009). "Improving malware detection by applying multi-inducer ensemble". Computational Statistics & Data Analysis. 53 (4): 1483–1494. CiteSeerX 10.1.1.150.2722. doi:10.1016/j.csda.2008.10.015.
- ^ Locasto, Michael E.; Wang, Ke; Keromytis, Angeles D.; Salvatore, J. Stolfo (2005). "FLIPS: Hybrid Adaptive Intrusion Prevention". Recent Advances in Intrusion Detection. Lecture Notes in Computer Science. Vol. 3858. pp. 82–101. CiteSeerX 10.1.1.60.3798. doi:10.1007/11663812_5. ISBN 978-3-540-31778-4.
- ^ Giacinto, Giorgio; Perdisci, Roberto; Del Rio, Mauro; Roli, Fabio (January 2008). "Intrusion detection in computer networks by a modular ensemble of one-class classifiers". Information Fusion. 9 (1): 69–82. CiteSeerX 10.1.1.69.9132. doi:10.1016/j.inffus.2006.10.002.
- ^ Mu, Xiaoyan; Lu, Jiangfeng; Watta, Paul; Hassoun, Mohamad H. (July 2009). "Weighted voting-based ensemble classifiers with application to human face recognition and voice recognition". 2009 International Joint Conference on Neural Networks. pp. 2168–2171. doi:10.1109/IJCNN.2009.5178708. ISBN 978-1-4244-3548-7. S2CID 18850747.
- ^ Yu, Su; Shan, Shiguang; Chen, Xilin; Gao, Wen (April 2006). "Hierarchical ensemble of Gabor Fisher classifier for face recognition". 7th International Conference on Automatic Face and Gesture Recognition (FGR06). pp. 91–96. doi:10.1109/FGR.2006.64. ISBN 978-0-7695-2503-7. S2CID 1513315.
- ^ Su, Y.; Shan, S.; Chen, X.; Gao, W. (September 2006). "Patch-Based Gabor Fisher Classifier for Face Recognition". 18th International Conference on Pattern Recognition (ICPR'06). Vol. 2. pp. 528–531. doi:10.1109/ICPR.2006.917. ISBN 978-0-7695-2521-1. S2CID 5381806.
- ^ Liu, Yang; Lin, Yongzheng; Chen, Yuehui (July 2008). "Ensemble Classification Based on ICA for Face Recognition". 2008 Congress on Image and Signal Processing. pp. 144–148. doi:10.1109/CISP.2008.581. ISBN 978-0-7695-3119-9. S2CID 16248842.
- ^ Rieger, Steven A.; Muraleedharan, Rajani; Ramachandran, Ravi P. (2014). "Speech based emotion recognition using spectral feature extraction and an ensemble of KNN classifiers". The 9th International Symposium on Chinese Spoken Language Processing. pp. 589–593. doi:10.1109/ISCSLP.2014.6936711. ISBN 978-1-4799-4219-0. S2CID 31370450.
- ^ Krajewski, Jarek; Batliner, Anton; Kessel, Silke (October 2010). "Comparing Multiple Classifiers for Speech-Based Detection of Self-Confidence - A Pilot Study". 2010 20th International Conference on Pattern Recognition (PDF). pp. 3716–3719. doi:10.1109/ICPR.2010.905. ISBN 978-1-4244-7542-1. S2CID 15431610.
- ^ Rani, P. Ithaya; Muneeswaran, K. (25 May 2016). "Recognize the facial emotion in video sequences using eye and mouth temporal Gabor features". Multimedia Tools and Applications. 76 (7): 10017–10040. doi:10.1007/s11042-016-3592-y. S2CID 20143585.
- ^ Rani, P. Ithaya; Muneeswaran, K. (August 2016). "Facial Emotion Recognition Based on Eye and Mouth Regions". International Journal of Pattern Recognition and Artificial Intelligence. 30 (7): 1655020. doi:10.1142/S021800141655020X.
- ^ Rani, P. Ithaya; Muneeswaran, K (28 March 2018). "Emotion recognition based on facial components". Sādhanā. 43 (3) 48. doi:10.1007/s12046-018-0801-6.
- ^ Louzada, Francisco; Ara, Anderson (October 2012). "Bagging k-dependence probabilistic networks: An alternative powerful fraud detection tool". Expert Systems with Applications. 39 (14): 11583–11592. doi:10.1016/j.eswa.2012.04.024.
- ^ Sundarkumar, G. Ganesh; Ravi, Vadlamani (January 2015). "A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance". Engineering Applications of Artificial Intelligence. 37: 368–377. doi:10.1016/j.engappai.2014.09.019.
- ^ a b Kim, Yoonseong; Sohn, So Young (August 2012). "Stock fraud detection using peer group analysis". Expert Systems with Applications. 39 (10): 8986–8992. doi:10.1016/j.eswa.2012.02.025.
- ^ Savio, A.; García-Sebastián, M.T.; Chyzyk, D.; Hernandez, C.; Graña, M.; Sistiaga, A.; López de Munain, A.; Villanúa, J. (August 2011). "Neurocognitive disorder detection based on feature vectors extracted from VBM analysis of structural MRI". Computers in Biology and Medicine. 41 (8): 600–610. doi:10.1016/j.compbiomed.2011.05.010. PMID 21621760.
- ^ Ayerdi, B.; Savio, A.; Graña, M. (June 2013). "Meta-ensembles of Classifiers for Alzheimer's Disease Detection Using Independent ROI Features". Natural and Artificial Computation in Engineering and Medical Applications. Lecture Notes in Computer Science. Vol. 7931. pp. 122–130. doi:10.1007/978-3-642-38622-0_13. ISBN 978-3-642-38621-3.
- ^ Gu, Quan; Ding, Yong-Sheng; Zhang, Tong-Liang (April 2015). "An ensemble classifier based prediction of G-protein-coupled receptor classes in low homology". Neurocomputing. 154: 110–118. doi:10.1016/j.neucom.2014.12.013.
- ^ Xue, Dan; Zhou, Xiaomin; Li, Chen; Yao, Yudong; Rahaman, Md Mamunur; Zhang, Jinghua; Chen, Hao; Zhang, Jinpeng; Qi, Shouliang; Sun, Hongzan (2020). "An Application of Transfer Learning and Ensemble Learning Techniques for Cervical Histopathology Image Classification". IEEE Access. 8: 104603–104618. Bibcode:2020IEEEA...8j4603X. doi:10.1109/ACCESS.2020.2999816. ISSN 2169-3536. S2CID 219689893.
- ^ Manna, Ankur; Kundu, Rohit; Kaplun, Dmitrii; Sinitca, Aleksandr; Sarkar, Ram (December 2021). "A fuzzy rank-based ensemble of CNN models for classification of cervical cytology". Scientific Reports. 11 (1): 14538. Bibcode:2021NatSR..1114538M. doi:10.1038/s41598-021-93783-8. ISSN 2045-2322. PMC 8282795. PMID 34267261.
- ^ Valenkova, Daria; Lyanova, Asya; Sinitca, Aleksandr; Sarkar, Ram; Kaplun, Dmitrii (April 2025). "A fuzzy rank-based ensemble of CNN models for MRI segmentation". Biomedical Signal Processing and Control. 102 107342. doi:10.1016/j.bspc.2024.107342.
- ^ Rajput, Snehal; Kapdi, Rupal; Roy, Mohendra; Raval, Mehul S. (June 2024). "A triplanar ensemble model for brain tumor segmentation with volumetric multiparametric magnetic resonance images". Healthcare Analytics. 5 100307. doi:10.1016/j.health.2024.100307.
- ^ Sundaresan, Vaanathi; Zamboni, Giovanna; Rothwell, Peter M.; Jenkinson, Mark; Griffanti, Ludovica (October 2021). "Triplanar ensemble U-Net model for white matter hyperintensities segmentation on MR images". Medical Image Analysis. 73 102184. doi:10.1016/j.media.2021.102184. PMC 8505759. PMID 34325148.
Further reading
[edit]- Zhou Zhihua (2012). Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC. ISBN 978-1-439-83003-1.
- Robert Schapire; Yoav Freund (2012). Boosting: Foundations and Algorithms. MIT. ISBN 978-0-262-01718-3.
External links
[edit]- Robi Polikar (ed.). "Ensemble learning". Scholarpedia.
- The Waffles (machine learning) toolkit contains implementations of Bagging, Boosting, Bayesian Model Averaging, Bayesian Model Combination, Bucket-of-models, and other ensemble techniques
Ensemble learning
View on GrokipediaIntroduction
Definition and Motivation
Ensemble learning is a machine learning paradigm in which multiple base learners are trained to address the same problem, and their predictions are combined to form a final output rather than relying on a single hypothesis.[8] This approach leverages the idea that a group of models can collectively produce more reliable results than an individual model by aggregating diverse perspectives on the data.[8] The primary motivation for ensemble learning stems from the limitations of single models, particularly their susceptibility to errors arising from insufficient training data, imperfect learning algorithms, or complex underlying data distributions that are difficult to approximate accurately.[8] By combining multiple learners, ensemble methods improve generalization performance through the averaging out of individual errors, which effectively reduces variance in predictions—especially for unstable base learners sensitive to small changes in the training data—while potentially mitigating bias to some extent.[3] This error reduction aligns with the bias-variance tradeoff, where ensembles prioritize lowering variance without excessively increasing bias.[3] Key benefits of ensemble learning include higher predictive accuracy compared to standalone models, enhanced robustness to noise in the data, and better handling of intricate patterns in high-dimensional or non-linear datasets.[3][9] For instance, studies have shown error rate reductions of 20-47% in classification tasks and 22-46% in regression mean squared error when using ensembles over single predictors.[3] These advantages make ensembles particularly valuable in real-world applications where data quality varies and model reliability is paramount.[8] Basic aggregation mechanisms in ensemble learning include majority voting for classification tasks, where the class predicted by the most base learners is selected, and averaging for regression, where predictions are combined via mean or weighted mean to yield the final output.[8] A simple illustration involves decision trees on noisy datasets: a single tree may overfit to noise, leading to poor generalization, but an ensemble of such trees, through aggregation, smooths out these inconsistencies and outperforms the individual tree by reducing sensitivity to outliers and errors in the training samples.[3][10]Historical Development
The origins of ensemble learning can be traced to early concepts in statistical averaging during the late 1960s and 1970s, building on prior ideas of combining predictions to leverage the wisdom of crowds in statistical forecasting. A foundational contribution came from Bates and Granger, who in 1969 demonstrated that combining multiple forecasts from different models could reduce mean-square error compared to individual predictions, laying groundwork for averaging techniques in predictive modeling.[11] This idea was extended in the 1970s by John Tukey, who explored ensembles of simple linear models to improve prediction accuracy through exploratory data analysis.[12] In the 1960s, neural network architectures featured committee machines—structures combining multiple simple classifiers—as precursors to modern ensembles, emphasizing probabilistic methods for pattern recognition and decision-making.[13] A key theoretical advancement occurred in 1990 when Hansen and Salamon provided justification for neural network ensembles, showing how averaging diverse models reduces variance and enhances generalization in supervised learning tasks.[14] The 1990s marked a pivotal era of breakthroughs that formalized ensemble learning in machine learning. Leo Breiman introduced bagging (bootstrap aggregating) in 1996, a method that generates multiple instances of a training dataset through bootstrapping and aggregates their predictions to stabilize variance-prone models like decision trees. Shortly thereafter, Yoav Freund and Robert Schapire developed AdaBoost in 1997, an adaptive boosting algorithm that iteratively weights misclassified examples to focus subsequent weak learners, achieving strong theoretical guarantees for improved accuracy. Stacking was formalized by David Wolpert in 1992, enabling meta-learning by training a higher-level model on the outputs of base learners to capture complex interactions. These innovations, driven by key figures like Breiman, Freund, and Schapire, shifted ensembles from ad hoc combinations to principled frameworks. In the 2000s, ensemble methods expanded with greater integration into practical applications and theoretical refinements. Breiman further advanced the field in 2001 with random forests, which combine bagging with random feature selection to create diverse decision trees, yielding robust performance on high-dimensional data. Stacking saw increased formalization, while ensembles began incorporating kernel methods, such as in support vector machine combinations, to handle non-linear problems more effectively. The 2010s onward witnessed a shift toward scalability in big data and deep learning contexts, with ensembles of deep neural networks addressing overfitting in complex architectures.[15] Notable was the initial open-source release in 2014, with a seminal paper published in 2016, of XGBoost by Tianqi Chen and Carlos Guestrin, an optimized gradient boosting framework that scaled boosting to massive datasets with superior efficiency. By the mid-2010s, ensembles profoundly influenced competitive machine learning, powering many winning solutions in Kaggle competitions through blended models that outperformed single algorithms.[16]Fundamental Principles
Bias-Variance Tradeoff
The bias-variance tradeoff represents a fundamental challenge in statistical learning, where the goal is to minimize the expected prediction error of a model while balancing two sources of error: bias and variance. Bias refers to the systematic error introduced by approximating a true function with a simpler model, leading to underfitting when the model lacks sufficient complexity to capture underlying patterns in the data. Variance, on the other hand, measures the model's sensitivity to fluctuations in the training data, resulting in overfitting when the model is overly complex and fits noise rather than signal.[17] The irreducible error, often denoted as noise, arises from inherent stochasticity in the data and cannot be reduced by any model. In regression settings, the expected mean squared error (MSE) for a model's prediction at a point decomposes as: where quantifies the squared difference between the average prediction and the true function , is the variance of the predictions across different training sets, and is the variance of the noise in the data-generating process . This decomposition is derived by expanding the MSE expectation: first conditioning on , then using the law of total expectation to separate the error into components attributable to model misspecification (bias) and sampling variability (variance), with the noise term remaining constant. For instance, linear regression typically exhibits low variance but high bias on nonlinear problems, leading to underfitting, while deep decision trees show low bias but high variance, prone to overfitting on finite datasets. The tradeoff is illustrated by error curves plotting MSE against model complexity: bias decreases monotonically, variance increases, and total error forms a U-shape with an optimal complexity minimizing the sum. Ensemble methods address this tradeoff by combining multiple models to achieve lower overall error without substantially altering bias. Specifically, averaging predictions from an ensemble of models with uncorrelated errors reduces the variance term by a factor approximately proportional to for models, while the bias remains close to that of the individual models if they are unbiased or weakly biased. This variance reduction is particularly effective for high-variance base learners, such as unpruned trees, shifting the ensemble toward the low-bias, low-variance region of the tradeoff curve. Graphically, single-model error curves exhibit high variance and erratic performance across datasets, whereas ensemble curves show smoother, lower MSE trajectories, demonstrating improved stability and generalization.Role of Diversity in Ensembles
Diversity in ensemble learning refers to the differences in the predictions or errors made by individual base learners when applied to the same data instances, which can manifest as ambiguity in their outputs or disagreement on classifications. This concept is fundamental because it allows the ensemble to compensate for the weaknesses of any single model, as the varied error patterns among members enable mutual correction during combination. Measures of diversity often focus on how classifiers vote correctly or incorrectly on instances, capturing the extent to which they fail differently.[18] There are several types of diversity that can be engineered in ensembles. Algorithmic diversity arises from employing different learning algorithms or model architectures, leading to fundamentally distinct decision boundaries. Data diversity is achieved by training models on varied subsets of the data, such as through sampling techniques that expose each learner to unique examples. Parameter diversity involves varying initial conditions, hyperparameters, or random seeds during training, which can produce models with similar architectures but divergent behaviors. These types collectively promote disagreement among base learners without compromising their individual accuracies.[19] Common metrics quantify diversity to evaluate ensemble quality. The Q-statistic is a pairwise measure assessing the correlation between the correct/incorrect votes of two classifiers, ranging from -1 (maximum diversity) to 1 (perfect agreement), calculated as , where represents joint vote probabilities. The Kohavi-Wolpert variance provides a non-pairwise assessment by decomposing ensemble error into the average base error minus a diversity term, effectively measuring the variance in predictions across all members; lower variance indicates higher diversity. Interrater agreement, another non-pairwise metric, evaluates the consistency of classifiers in erring on the same instances, with lower agreement signaling greater diversity. These metrics help identify ensembles where errors are uncorrelated.[18][20] Diversity matters because correlated errors among base learners do not cancel out during averaging, limiting the ensemble's ability to reduce overall error; in contrast, uncorrelated errors enable a variance reduction scaling as for models, enhancing generalization as part of the bias-variance tradeoff. Methods to generate diversity include randomizing inputs (e.g., varying training data), outputs (e.g., perturbing predictions), or training procedures (e.g., altering optimization paths), all at a high level to foster independence without targeting specific algorithms. Empirical evidence supports this, with studies on neural network ensembles demonstrating that diverse members yield accuracy gains over single models, though broader analyses show the relationship can be weak in complex real-world datasets, emphasizing the need for balanced accuracy and diversity.[19][18]Core Ensemble Techniques
Bagging and Bootstrap Aggregating
Bagging, also known as bootstrap aggregating, is a parallel ensemble learning technique designed to improve the stability and accuracy of machine learning algorithms by reducing variance through the combination of multiple models trained on different subsets of data. Introduced by Leo Breiman in 1996, it leverages bootstrap sampling to generate diverse training sets, allowing base learners to be trained independently and their predictions aggregated to form a final output. This method is particularly suited for algorithms that exhibit high variance, such as decision trees, where small changes in the training data can lead to significantly different models.[3] The core algorithm of bagging proceeds in the following steps: First, generate B bootstrap samples from the original training dataset D of size n, where each sample is created by drawing n instances with replacement, resulting in each bootstrap sample containing approximately 63.2% unique instances on average due to the probabilistic nature of sampling with replacement. Next, train B independent base learners (e.g., decision trees) on these bootstrap samples in parallel. Finally, aggregate the predictions: for classification tasks, use majority voting (mode) across the base predictions; for regression, compute the average of the predictions. This process ensures that the ensemble benefits from the averaging effect, which smooths out individual model fluctuations.[3] Bootstrap mechanics underpin bagging's effectiveness by introducing controlled randomness into the training process. Each bootstrap sample follows a multinomial distribution where each original instance has an equal probability of 1/n of being selected in any draw, leading to the expected proportion of unique instances approximating 1 - (1 - 1/n)^n ≈ 1 - 1/e ≈ 0.632 for large n. The instances not selected for a particular bootstrap sample—known as out-of-bag (OOB) samples, comprising about 36.8% of the data—provide a natural validation set. OOB error can be estimated by evaluating each base learner on the OOB instances for its sample and aggregating these errors, offering an unbiased performance measure without requiring a held-out test set.[3] A key property of bagging is its ability to reduce prediction variance without altering the expected bias of the base learners, as the aggregation averages over multiple realizations of the same underlying procedure. This makes it especially valuable for unstable learners like unpruned decision trees, where variance dominates the error; empirical studies in the original work showed significant variance reductions in simulated high-variance scenarios, such as up to 46% in the Friedman #1 dataset. However, for stable, low-variance learners such as linear models, the benefits are minimal since there is little variance to mitigate.[3] One prominent variant of bagging is random forests, developed by Breiman in 2001, which enhances diversity by incorporating feature randomness during tree construction. In addition to bootstrap sampling for instances, at each node split, a random subset of mtry features (typically √p for classification, where p is the total number of features) is considered, preventing individual trees from becoming overly correlated and further reducing variance while maintaining low bias. Random forests have demonstrated superior performance over plain bagging on various benchmarks, including those from the UCI repository.[4] The following pseudocode illustrates the bagging procedure for a classification task:Algorithm Bagging(D, B, BaseLearner):
Input: Dataset D of size n, number of bootstrap samples B, base learning algorithm BaseLearner
Output: Ensemble predictor f(x)
for b = 1 to B do:
Sample_b = BootstrapSample(D) // Draw n instances with replacement
h_b = BaseLearner(Sample_b) // Train base model on Sample_b
f(x) = mode({h_1(x), h_2(x), ..., h_B(x)}) // Majority vote for class prediction
Algorithm Bagging(D, B, BaseLearner):
Input: Dataset D of size n, number of bootstrap samples B, base learning algorithm BaseLearner
Output: Ensemble predictor f(x)
for b = 1 to B do:
Sample_b = BootstrapSample(D) // Draw n instances with replacement
h_b = BaseLearner(Sample_b) // Train base model on Sample_b
f(x) = mode({h_1(x), h_2(x), ..., h_B(x)}) // Majority vote for class prediction
Boosting Methods
Boosting is a machine learning ensemble technique that combines multiple weak learners sequentially to create a strong learner, with each subsequent model focusing on correcting the errors of the previous ones by assigning higher weights to misclassified training instances.[21] This adaptive process aims to reduce bias in the ensemble, differing from parallel methods by emphasizing sequential improvement on difficult examples.[22] The seminal AdaBoost algorithm, introduced by Freund and Schapire in 1997, exemplifies this approach for binary classification.[21] It initializes equal weights for all training samples and iteratively trains weak classifiers (typically decision stumps with error less than 0.5) on the weighted data. After each iteration , the weights of misclassified instances are updated as , where is the indicator function, and the classifier weight is with being the weighted error.[21] The final ensemble prediction is given by , where is the number of iterations. AdaBoost minimizes an exponential loss function, , which upper-bounds the zero-one classification loss and promotes margin maximization.[21] Variants of boosting extend this framework to broader settings. Gradient boosting, proposed by Friedman in 2001, generalizes AdaBoost by fitting additive models to a negative gradient of an arbitrary differentiable loss function, such as least squares for regression () or log-loss for classification.[22] Each weak learner, often a regression tree, approximates the residuals from prior models, enabling handling of non-exponential losses and regression tasks. XGBoost, developed by Chen and Guestrin in 2016, builds on gradient boosting with optimizations like L1 and L2 regularization, tree pruning to prevent overfitting, and parallel computation for scalability on large datasets.[6] Under mild conditions, boosting algorithms converge to zero training error when using weak learners with accuracy better than random guessing (error < 0.5).[21] This theoretical guarantee ensures that the ensemble achieves strong generalization if the weak learners are sufficiently diverse and the training data is separable.[22] Boosting methods, particularly gradient boosting variants like XGBoost, demonstrate high predictive accuracy on tabular data benchmarks, often outperforming other ensembles and deep learning models in structured datasets with mixed feature types.Stacking and Meta-Learning
Stacking, also known as stacked generalization, is an ensemble learning technique that employs a hierarchical structure to combine the predictions of multiple base models through a meta-learner, aiming to reduce generalization error. Introduced by David Wolpert in 1992, it operates on two levels: level-0 consists of heterogeneous base models, such as decision trees or neural networks, which are trained on the original training data to produce initial predictions; level-1 involves a meta-learner, often a simple model like logistic regression, that takes these predictions (referred to as meta-features) as input to learn an optimal combination rule.[23] A critical aspect of stacking is the generation of unbiased meta-features to train the meta-learner, which is achieved using k-fold cross-validation on the base models. In this process, the training data is divided into k folds; for each fold, the base models are trained on the remaining k-1 folds and used to predict the held-out fold, ensuring that no base model prediction is derived from data it was trained on. This out-of-fold prediction strategy prevents information leakage and overfitting, with the aggregated meta-features across all folds serving as the dataset for training the meta-learner.[23] The full stacking algorithm proceeds in stages: first, apply cross-validation to the base models to generate the meta-feature dataset and train the meta-learner on it; second, retrain all base models on the entire original training set; finally, for new test instances, obtain predictions from the retrained base models and feed them into the meta-learner to produce the ensemble's output. This approach allows the meta-learner to adaptively weigh or transform base predictions based on their correlations.[23] A common variant of stacking is blending, which replaces cross-validation with a single hold-out set for generating meta-features, typically reserving a portion (e.g., 10%) of the training data solely for this purpose while training base models on the rest. Blending is computationally faster and simpler but may be less accurate due to the reduced amount of data available for meta-learner training.[24] Stacking offers significant advantages over simpler ensemble methods by enabling the meta-learner to capture non-linear interactions and dependencies among base model outputs, leading to more sophisticated aggregation that can yield superior predictive performance, as demonstrated in empirical evaluations on benchmark datasets. Its flexibility in choosing diverse base models further enhances robustness by exploiting complementary strengths.[23] Despite these benefits, stacking presents challenges, particularly the potential for overfitting if cross-validation is inadequately configured or if the meta-learner is overly complex relative to the meta-feature dataset size. Proper implementation, such as using simpler meta-learners and sufficient folds in cross-validation, is essential to maintain generalization.Voting and Simple Ensembles
Voting and simple ensembles represent foundational aggregation techniques in ensemble learning, where predictions from multiple base classifiers are combined using straightforward rules without additional training of a meta-learner. These methods rely on the principle that aggregating diverse or independent predictions can reduce variance and improve overall accuracy, particularly when individual models have comparable performance. Hard voting, also known as majority voting, is applied in classification tasks by assigning each base classifier's predicted class a single vote, with the ensemble selecting the class receiving the most votes. This unweighted approach assumes equal reliability among classifiers and is particularly effective for discrete outputs.[25] For classifiers that output probability distributions, soft voting extends hard voting by averaging the predicted probabilities across all base models for each class and selecting the class with the highest average probability. This method incorporates confidence levels, often leading to more nuanced decisions than hard voting, as it leverages the full probabilistic information rather than binary choices. Weighted voting builds on these by assigning higher weights to more accurate base classifiers, typically determined by their performance on a validation set, such as accuracy or error rate. Weights can be computed as proportional to the inverse of the error rate or through optimization to maximize ensemble performance, enhancing the aggregation when base models vary in quality.[25] The bucket of models approach involves generating a large library of candidate models—often using varied algorithms, hyperparameters, or data subsets—and dynamically selecting the top-k performers based on a held-out validation metric, such as cross-validated accuracy. This selection can be static or adaptive per test instance, creating a simple yet effective ensemble from high-performing subsets without complex combination rules. Implementation of voting and simple ensembles is straightforward, especially for homogeneous setups where all base models share the same architecture (e.g., multiple support vector machines tuned with different kernels or regularization parameters), requiring only aggregation post-training. These techniques are ideal for rapid prototyping with similar models or as a baseline comparator to more advanced methods, offering low computational overhead for combination. Bagging serves as an example of a voting-based method, where bootstrap samples train base learners combined via majority or averaged voting.[26][25] A key advantage of majority voting in simple ensembles is its ability to reduce error rates under assumptions of independence among classifiers. For instance, if each base classifier has an error rate , the ensemble error under majority vote can be bounded using the binomial distribution, where the probability of more than half erring is , which decreases toward zero as increases for independent voters, per the Condorcet jury theorem. In practice, this can approximate an error reduction factor related to for moderate , illustrating how even simple aggregation leverages collective strength to outperform individuals.[25]Advanced Ensemble Methods
Bayesian Model Averaging and Combination
Bayesian model averaging (BMA) operates within a fully probabilistic framework, where predictions are obtained by integrating over the posterior distribution of possible models given the observed data. The posterior predictive distribution for a new observation given input and data is given by where represents the model space. This integral accounts for model uncertainty by weighting each model's contribution according to its posterior probability .[27] In practice, the continuous integral is approximated by a discrete weighted average over a finite set of candidate models , with weights , where is the marginal likelihood or evidence for model . This approach approximates the ideal Bayes optimal classifier by averaging predictions across models, providing a principled way to hedge against uncertainty in model choice.[28][29] Unlike model selection, which commits to a single "best" model and often leads to overconfident predictions by ignoring uncertainty in the choice, BMA distributes probability mass across multiple models, yielding wider predictive distributions and more reliable uncertainty estimates. Model combination in BMA typically employs linear pooling, where the combined posterior predictive is , representing a weighted arithmetic mean of individual model predictions. Alternatively, logarithmic opinion pools can be used for combining densities, forming , which emphasizes consensus and is particularly suited for proper scoring rules in probabilistic forecasting.[27][30] BMA finds applications in uncertainty quantification for both regression and classification tasks, enabling calibrated probability estimates that reflect epistemic uncertainty from model ambiguity. For instance, in regression, it produces predictive intervals that incorporate model variance, improving reliability over single-model baselines. In classification, BMA enhances decision-making under uncertainty by averaging posterior probabilities, as demonstrated in high-dimensional settings where model selection risks overfitting.[27][29] Exact computation of marginal likelihoods for BMA weights often relies on Markov Chain Monte Carlo (MCMC) methods to integrate over parameter spaces, though this can be computationally intensive for large model classes. For scalability, approximations such as the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) are used, where BIC provides a consistent large-sample estimate of , allowing rapid posterior model probabilities. These differ fundamentally from frequentist ensemble weights, which rely on empirical accuracy metrics like validation error, whereas BMA uses probabilistic evidence to prioritize models with strong data support relative to complexity.[27]Specialized Ensembles for Diversity
Negative correlation learning (NCL) is a technique designed to construct ensembles by explicitly promoting diversity among component models through modification of the training objective. Introduced by Liu and Yao, it trains individual neural networks simultaneously, incorporating a penalty term in the loss function that discourages correlated errors across the ensemble members.[31] Specifically, the error function for each individual network in an ensemble of networks is defined as , where is the mean squared error of the individual, measures the correlation of its error with the ensemble average, and controls the strength of the diversity penalty, typically set between 0 and 1.[31] This approach decomposes the ensemble error into bias, variance, and covariance terms, aiming to reduce the covariance while balancing individual accuracy.[31] To further encourage diversity in classification tasks, modified loss functions such as the amended cross-entropy cost adjust the standard cross-entropy by incorporating terms that penalize agreement on incorrect predictions among ensemble members. Frameworks have been proposed where diversity is integrated into the cost via measures like the ambiguity metric, which quantifies the average pairwise disagreement in classifier outputs, thereby promoting varied decision boundaries during training. This modification helps mitigate over-reliance on majority voting by fostering complementary errors, particularly useful in scenarios with overlapping class regions. Decorrelation techniques extend these ideas by enforcing independence in the training processes of ensemble components, often through orthogonal projections or error decomposition. For instance, the ensemble-based decorrelation method regularizes hidden layers of neural networks by minimizing the off-diagonal elements of the covariance matrix of activations, effectively orthogonalizing feature representations across models to enhance diversity without sacrificing individual performance.[32] Error decomposition approaches, such as those partitioning residuals into orthogonal components, ensure that each model focuses on unique aspects of the data variance, reducing redundancy in predictions.[32] During training, pairwise diversity indices serve as metrics to enforce and monitor diversity, guiding optimization or selection processes. Key indices include the Q-statistic, which measures the correlation between two classifiers' errors adjusted for chance agreement, and the disagreement measure, defined as the proportion of samples where classifiers differ in their predictions. These indices are computed iteratively and incorporated into the objective function or used for pruning, ensuring the ensemble maintains high diversity levels, such as Q values below 0.5 indicating beneficial disagreement. Representative examples of specialized ensembles include Forest-RK, which promotes kernel diversity by constructing random forests in reproducing kernel Hilbert spaces, where each tree operates on transformed features via diverse kernel approximations to capture nonlinear interactions uniquely.[33] Applications in imbalanced datasets leverage these methods to amplify minority class representation through diverse sampling and error focusing; for example, diversity-enhanced bagging variants analyze pairwise correlations to balance error distributions across classes, improving recall on rare events in benchmark studies. Evaluation of these ensembles highlights inherent diversity-accuracy tradeoffs, where excessive diversity can degrade individual model strength, leading to suboptimal ensemble performance. Tang et al. demonstrated this tradeoff using multi-objective evolutionary algorithms to optimize both metrics simultaneously, showing that ensembles achieving moderate diversity yield the best generalization over non-diverse counterparts.[34] This balance is assessed via decomposition of ensemble error into bias-variance-covariance components, confirming that controlled diversity minimizes overall variance while preserving low bias.[34]Recent Innovations in Ensemble Learning
Recent innovations in ensemble learning have focused on enhancing interpretability, fairness, and scalability, particularly in handling complex datasets and addressing ethical concerns in machine learning applications. These developments build on foundational techniques by incorporating mechanisms for explainability and bias mitigation, enabling more reliable deployment in sensitive domains such as healthcare and materials science. One notable advancement is Hellsemble, a 2025 framework designed for binary classification that improves efficiency and interpretability by partitioning datasets based on complexity and routing instances through specialized models during training and inference. This approach dynamically selects and combines models, reducing computational overhead while maintaining high accuracy on challenging data subsets. Hellsemble's structure allows for transparent decision paths, making it suitable for scenarios requiring auditable predictions.[35] In parallel, fairness-aware boosting algorithms have emerged since 2024 to mitigate demographic biases in ensemble predictions. For instance, extensions to AdaBoost incorporate reweighting techniques that enforce demographic parity constraints, balancing accuracy with equitable outcomes across protected groups without significant performance degradation. These methods adjust sample weights iteratively to prioritize fairness metrics, demonstrating improved equity in classification tasks on imbalanced datasets.[36] For bioinformatics applications, stratified sampling blending (ssBlending), introduced in 2025, optimizes traditional blending ensembles by incorporating stratified sampling to ensure balanced representation across data strata, leading to more stable and accurate predictions in genomic analyses. This technique enhances ensemble robustness by reducing variance in meta-learner outputs, particularly beneficial for high-dimensional biological data where class imbalance is prevalent.[37] In materials science, interpretable ensembles leveraging regression trees and selective model combination have advanced property forecasting as of 2025. These methods use classical interatomic potentials to train tree-based ensembles, followed by pruning and selection to identify key predictors, yielding precise predictions of material properties like elasticity while providing feature importance rankings for scientific insight.[38] Broader trends include deeper integration of ensembles with neural networks, such as snapshot ensembles applied to masked autoencoders for improved visual representation learning in 2025, which capture diverse model states during training to boost generalization without additional computational cost. Efficiency gains in large-scale settings are also achieved through advanced pruning strategies that preserve out-of-distribution generalization by selectively retaining diverse base learners, reducing ensemble size by up to 50% while maintaining predictive power. These innovations directly tackle persistent challenges like computational expense and lack of explainability. For example, applying SHAP values to ensemble outputs has become standard for attributing predictions to individual features or models, as seen in recent water quality monitoring frameworks as of 2025, enabling users to dissect complex interactions and build trust in high-stakes decisions.[39]Theoretical Foundations
General Ensemble Theory
Ensemble learning's theoretical superiority over single models stems from principles that leverage diversity and aggregation to reduce error and improve generalization. A cornerstone is Condorcet's jury theorem, originally formulated in 1785, which provides a probabilistic justification for majority voting in binary classification. The theorem posits that if each of N independent classifiers has accuracy p > 0.5, the probability that the majority vote errs decreases to 0 as N → ∞, with the error probability decaying exponentially at a rate governed by Hoeffding's inequality: P(error) ≤ exp(-2N(p - 0.5)^2). This convergence holds under the assumption of independence and superior individual performance, establishing ensembles as asymptotically optimal when base models are weakly accurate. Building on this, Leo Breiman's work in the 1990s demonstrated that ensembles of diverse classifiers achieve exponential error reduction compared to individual models. In his analysis of bagging and related methods, Breiman showed that for uncorrelated base classifiers, the ensemble error bound decreases exponentially with the number of members, particularly when diversity minimizes error correlation. This aligns with the bias-variance tradeoff, where ensembles primarily reduce variance while preserving low bias from base learners. The no free lunch theorem further contextualizes these gains: since no single algorithm excels across all problems, ensembles mitigate this by combining diverse hypotheses, effectively broadening coverage over the hypothesis space without a universal "free lunch." Theoretical guarantees for finite-sample performance rely on generalization bounds tailored to ensemble classes. Using the VC dimension, the complexity of an ensemble of N classifiers from a base class with VC dimension d is bounded by O(d log N), ensuring that the shatterable set size grows sub-exponentially, which translates to sample complexity requirements of O((d log N / ε) log(1/δ)) for ε-generalization with confidence 1-δ. Similarly, Rademacher complexity provides tighter data-dependent bounds for ensembles, often scaling as the average complexity of base models divided by √N under independence, yielding excess risk controls like O(ℛ_N(H) + √(log(1/δ)/n)), where ℛ_N(H) is the empirical Rademacher average of the ensemble hypothesis class H.[40] These bounds confirm that ensembles maintain favorable generalization despite increased model count, provided diversity is enforced. Asymptotically, under the independence assumption, ensemble variance scales as O(1/N) times the base variance plus covariance terms, leading to a linear reduction in variance-dominated errors for large N. Breiman's bagging analysis explicitly derives this for regression, where the ensemble variance is ρ σ² + (1 - ρ) σ² / N, with ρ as correlation and σ² as base variance; when ρ is small, variance drops as O(1/N). This scaling underpins the practical observation that ensemble performance plateaus after a moderate N, balancing computational cost with theoretical benefits.Geometric Framework for Ensembles
In the geometric framework for ensembles, individual classifiers are represented as hyperplanes in a vector space, where the decision boundary for each base learner separates classes based on their feature projections. The ensemble decision can then be viewed as the average vector of these hyperplanes, effectively shifting and smoothing the overall boundary to reduce sensitivity to noise in any single classifier. This vector space perspective highlights how the combination leverages the geometric arrangement of base decisions to form a more robust hyperplane in the input space. Margin-based geometry provides a key insight into boosting methods within this framework, where the ensemble iteratively adjusts weak learner hyperplanes to maximize the geometric margin—the perpendicular distance from the decision boundary to the closest data points of each class—analogous to support vector machines. By focusing on margin maximization, boosting geometrically expands the separation between classes, minimizing the risk of misclassification in regions near the boundary and improving generalization by increasing the "slack" around correct predictions. This process can be visualized as successive rotations and translations of hyperplanes that collectively widen the safe zone for the ensemble's final boundary.[41] Error regions, defined as the subsets of the input space where individual classifiers misclassify instances, play a central role in understanding ensemble performance geometrically. High diversity among base classifiers ensures that these error regions overlap minimally, causing the union of errors—the effective error region for the ensemble under majority voting—to shrink compared to individual regions. As a result, the ensemble's decision boundary avoids large ambiguous areas, with the majority vote resolving disagreements in overlapping zones more effectively than any single classifier.[42] Visualizations in two dimensions illustrate this framework clearly: a single linear classifier produces a straight decision boundary dividing the plane into class regions, but an ensemble of such classifiers, through averaging or weighted voting, yields a composite boundary that is piecewise linear or curved, adapting to nonlinear data separability. For instance, combining multiple tilted linear boundaries can approximate a circular or elliptical enclosure around one class, demonstrating how geometric averaging transforms simple hyperplanes into complex separators without explicit nonlinearity in the base learners.[43] Kuncheva's framework further refines this view by mapping ensembles into a feature space constructed from the base classifiers' predictions, treating each prediction as a coordinate for a new "meta-feature" vector per instance. In this space, the ensemble combination acts as a meta-classifier operating on the geometric distribution of these prediction vectors, where the proximity and angular separation of vectors reflect classifier agreement and diversity. This transformation allows geometric analysis of how scattered prediction points reduce classification ambiguity in the meta-space.[18] The implications of this geometric framework underscore why diversity is crucial: by geometrically dispersing error regions and prediction vectors, ensembles minimize overlap in ambiguous zones, leading to sharper decision boundaries and lower overall variance in predictions. This reduction in geometric ambiguity directly correlates with improved accuracy, as diverse base learners collectively cover the input space more comprehensively than correlated ones.[44]Optimizing Ensemble Size
One common approach to optimizing ensemble size involves analyzing learning curves, which plot ensemble accuracy or error against the number of base models (N). As N increases, the ensemble error typically decreases initially but eventually plateaus, indicating the point of diminishing returns where additional models contribute minimally to performance improvement. For bagging ensembles, out-of-bag (OOB) estimation provides an efficient way to assess performance without separate validation data, allowing early stopping when OOB error stabilizes. The OOB error, computed on samples not used in each bootstrap iteration, serves as an unbiased proxy for generalization, guiding the selection of optimal N by monitoring its convergence. Theoretical guidance comes from generalization error bounds, which suggest stopping ensemble growth when further additions yield minimal tightening of the bound. For instance, in random forests, the upper bound on generalization error is given by , where is the average strength (correlation between tree predictions and true values) and is the average correlation among trees; ensembles should halt expansion once this bound plateaus, as high limits further gains. Diminishing returns in ensemble performance arise primarily from increasing correlation among base models, which reduces diversity and caps error reduction, while computational tradeoffs—such as training time scaling linearly with N—necessitate balancing accuracy against resource costs. Empirically, guidelines recommend ensembles of 10-100 trees for random forests, with optimal size determined by monitoring validation loss or OOB error until it plateaus, as larger sizes often yield negligible improvements beyond this range.[4] For stacking ensembles, advanced methods like Bayesian optimization treat ensemble size as a hyperparameter, efficiently searching the space to maximize predictive performance by modeling the objective function with Gaussian processes and balancing exploration and exploitation.[45]Practical Implementation
Software Packages and Libraries
Scikit-learn, a popular Python library for machine learning, offers comprehensive implementations of ensemble methods through itsensemble module. This includes classes such as BaggingClassifier and BaggingRegressor for bagging, VotingClassifier and VotingRegressor for majority/soft voting ensembles, StackingClassifier and StackingRegressor for stacking, RandomForestClassifier and RandomForestRegressor for random forests, and boosting algorithms like GradientBoostingClassifier, GradientBoostingRegressor, HistGradientBoostingClassifier, and HistGradientBoostingRegressor.[46] These tools support both classification and regression tasks, with built-in cross-validation and parallel processing for efficient training.
For gradient boosting, specialized libraries provide optimized implementations with enhanced performance features. XGBoost, an extensible gradient boosting library, supports distributed computing and GPU acceleration, enabling faster training on large datasets through its tree-based ensemble approach.[47] LightGBM, developed by Microsoft, employs histogram-based algorithms for leaf-wise tree growth, achieving up to 20 times faster training than traditional gradient boosting while maintaining high accuracy, and includes GPU support. CatBoost, from Yandex, focuses on handling categorical features natively via ordered boosting, reducing overfitting and supporting GPU computations for scalable ensemble building.[48]
H2O.ai provides enterprise-grade AutoML capabilities with ensemble support, particularly through its Stacked Ensembles in the H2O-3 platform, which automatically combines multiple base models (e.g., GBM, random forests, deep learning) using meta-learners to optimize predictive performance.[49] In the R ecosystem, the randomForest package implements Breiman and Cutler's random forests algorithm for classification and regression, emphasizing out-of-bag error estimation and variable importance measures.[50] The gbm package extends Friedman's gradient boosting machine with support for various loss functions and interactions.[51] Additionally, the caret package offers a unified interface for training ensemble models, integrating methods like random forests and boosting via streamlined workflows.[52]
For deep learning ensembles, TensorFlow and Keras enable model averaging and snapshot ensembling by combining multiple neural network predictions, often through custom layers or the tf.keras API, to improve generalization in complex tasks.[53] Benchmarks of these libraries on UCI Machine Learning Repository datasets demonstrate that ensemble methods like random forests and gradient boosting typically achieve higher classification accuracies (e.g., 95-99% on balanced datasets) compared to single decision trees or linear models, highlighting their robustness to noise and overfitting.[54]
