Hubbry Logo
Automated machine learningAutomated machine learningMain
Open search
Automated machine learning
Community hub
Automated machine learning
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Automated machine learning
Automated machine learning
from Wikipedia

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. It is the combination of automation and ML.[1]

AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready for deployment. AutoML was proposed as an artificial intelligence-based solution to the growing challenge of applying machine learning.[2][3] The high degree of automation in AutoML aims to allow non-experts to make use of machine learning models and techniques without requiring them to become experts in machine learning. Automating the process of applying machine learning end-to-end additionally offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform hand-designed models.[4]

Common techniques used in AutoML include hyperparameter optimization, meta-learning and neural architecture search.

Comparison to the standard approach

[edit]

In a typical machine learning application, practitioners have a set of input data points to be used for training. The raw data may not be in a form that all algorithms can be applied to. To make the data amenable for machine learning, an expert may have to apply appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods. After these steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their model. If deep learning is used, the architecture of the neural network must also be chosen manually by the machine learning expert.

Each of these steps may be challenging, resulting in significant hurdles to using machine learning. AutoML aims to simplify these steps for non-experts, and to make it easier for them to use machine learning techniques correctly and effectively.

AutoML plays an important role within the broader approach of automating data science, which also includes challenging tasks such as data engineering, data exploration and model interpretation and prediction.[5]

Targets of automation

[edit]

Automated machine learning can target various stages of the machine learning process.[3] Steps to automate are:

Challenges and Limitations

[edit]

There are a number of key challenges being tackled around automated machine learning. A big issue surrounding the field is referred to as "development as a cottage industry".[7] This phrase refers to the issue in machine learning where development relies on manual decisions and biases of experts. This is contrasted to the goal of machine learning which is to create systems that can learn and improve from their own usage and analysis of the data. Basically, it's the struggle between how much experts should get involved in the learning of the systems versus how much freedom they should be giving the machines. However, experts and developers must help create and guide these machines to prepare them for their own learning. To create this system, it requires labor intensive work with knowledge of machine learning algorithms and system design.[8]

Additionally, other challenges include meta-learning[9] and computational resource allocation.

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Automated machine learning (AutoML) is a subfield of focused on automating the end-to-end process of developing models, including tasks such as data preprocessing, , algorithm selection, , , and model evaluation, to generate high-performance configurations in a data-driven manner. This automation addresses the complexities of traditional workflows, which often require extensive domain expertise and manual tuning. The primary goals of AutoML are to achieve superior model performance, such as higher accuracy or better on unseen , while minimizing the time and resources needed for model development, thereby making advanced accessible to non-experts across various domains like healthcare, , and . By tackling the "combined algorithm selection and " () problem, AutoML systems evaluate vast combinations of pipelines through systematic search strategies, often outperforming hand-crafted models in benchmark tasks. Its emergence stems from the rapid proliferation of techniques in the , which outpaced the ability of practitioners to manually configure them effectively. Key components of AutoML frameworks include the search space, which defines possible algorithms, hyperparameters, and architectures to explore; the search strategy, employing methods like , evolutionary algorithms, or to navigate this space efficiently; and performance evaluation, using techniques such as cross-validation or multi-fidelity approximations to assess model quality without exhaustive computation. Additional elements encompass data management automation and ensembling to combine multiple models for improved robustness. Historically, AutoML traces its roots to foundational work on algorithm selection in 1976 by John Rice, but modern developments accelerated with the 2013 release of Auto-, which automated algorithm selection and hyperparameter tuning for models, followed by auto-sklearn in 2015, which extended these capabilities to pipelines. The field gained momentum in 2017 with (NAS) methods, such as those by Zoph and Le, which used to design deep neural networks, though initial approaches demanded substantial computational resources like hundreds of GPUs over weeks. As of 2025, AutoML has matured into a vibrant ecosystem with open-source tools like TPOT, Auto-PyTorch, and AutoGluon, alongside commercial platforms such as Google Cloud AutoML and H2O.ai, supporting diverse data types including tabular, image, and text. Advancements emphasize efficiency through meta-learning, surrogate models, and benchmarks like NAS-Bench-301, which evaluate millions of architectures to guide reproducible research and deployment. These systems continue to evolve, integrating with large foundation models and generative AI to further democratize AI applications.

Overview and Fundamentals

Definition and Scope

Automated machine learning (AutoML) encompasses a suite of techniques designed to automate the end-to-end process of applying to real-world problems, including data preparation, , model choice, , and deployment, thereby reducing the reliance on deep expert intervention. This automation addresses the labor-intensive nature of traditional workflows, where practitioners manually handle numerous design decisions that can significantly impact model performance. By streamlining these stages, AutoML democratizes access to , allowing domain experts in fields like healthcare or to build effective models without extensive programming or algorithmic knowledge. The scope of AutoML ranges from narrow implementations that target isolated components, such as hyperparameter tuning for a predefined model, to broader systems that orchestrate the full pipeline from raw data ingestion to production deployment. Narrow AutoML focuses on gains in specific optimization tasks, often using methods like grid search or random sampling, while full integrates all pipeline elements to produce deployable solutions autonomously. This distinction highlights AutoML's flexibility, adapting to scenarios where partial suffices versus those demanding comprehensive hands-off operation. Effective use of AutoML presupposes basic familiarity with machine learning fundamentals, including the distinction between —which trains models on to predict specific outcomes like classifications or regressions—and , which identifies inherent structures or patterns in unlabeled data through techniques like clustering. Users must articulate the problem type and provide suitable datasets, but AutoML handles intricate configurations thereafter, assuming only this foundational understanding to ensure appropriate task formulation and result interpretation. A pivotal milestone in AutoML's development was the 2015 ChaLearn AutoML Challenge, organized by the ChaLearn Looks at People initiative in collaboration with the International Joint Conference on Neural Networks (IJCNN), which sought to benchmark systems capable of solving diverse and regression problems without any human intervention. Featuring six progressive rounds with 30 real-world datasets across domains like and text, the challenge emphasized time-constrained automation and introduced standardized evaluation metrics, fostering advancements in end-to-end pipelines.

Historical Development

The roots of automated machine learning (AutoML) trace back to 1976, with foundational work on algorithm selection by John Rice, though early efforts in the 1990s focused on hyperparameter tuning methods such as grid search to systematically evaluate combinations of model parameters for improved performance. These techniques addressed the challenge of selecting optimal settings for algorithms, laying foundational groundwork for in model configuration. By the early , emerged as a key concept, enabling systems to learn from prior tasks to inform algorithm selection and hyperparameter choices on new problems, thus reducing manual intervention. This period marked the initial shift toward more intelligent, data-driven in workflows. The 2010s brought pivotal advancements through organized challenges and integrated tools that popularized AutoML. The ChaLearn AutoML challenges, launched in 2014 and culminating in a major competition in 2015, stimulated research by evaluating fully automatic, black-box systems for and regression tasks without human input, fostering benchmarks for end-to-end . In 2013, Auto- was introduced as an extension of the WEKA toolkit, automating algorithm selection and via Bayesian methods, making it accessible for non-experts, with version 2.0 released in 2016. Concurrently, auto-sklearn debuted in 2015, building on to incorporate for construction, and it achieved top performance in the 2015 ChaLearn challenge by adapting pipelines based on historical dataset performances. From 2018 to 2020, AutoML experienced a surge in commercial and open-source adoption, driven by scalable frameworks. launched Cloud AutoML in 2018, providing cloud-based tools for custom model training in vision, natural language, and translation, aimed at broadening AI accessibility beyond specialists. The Tree-based Pipeline Optimization Tool (TPOT), gaining prominence around this time, used to evolve pipelines, offering an open-source alternative for optimizing complex workflows. Post-2020 developments integrated AutoML with , exemplified by 's 2020 AutoML-Zero paper, which employed evolutionary algorithms and (NAS) to evolve complete algorithms from basic mathematical primitives, demonstrating competitive performance on simple benchmarks. By , the proliferation of and has significantly accelerated AutoML adoption, enabling scalable processing of massive datasets and democratizing access through platforms like AWS SageMaker and Azure AutoML. This synergy has reduced barriers for enterprises and led to widespread integration into production environments for faster model deployment. As of , the AutoML market is projected to grow by USD 13,531.2 million from to 2029, expanding at a CAGR of 44.8%.

Comparison to Manual Machine Learning

Traditional Workflow

The traditional workflow in machine learning involves a sequential, manual process that requires substantial expertise from data scientists and domain specialists to develop predictive models. This process begins with problem formulation, where practitioners define the objectives, such as or regression tasks, and identify relevant metrics for success. Following this, gathers raw from various sources like databases or sensors, ensuring it aligns with the problem scope. Data cleaning and preprocessing then address issues such as missing values, outliers, and inconsistencies through techniques like imputation or normalization, a step that often demands careful judgment to avoid introducing . Feature follows, where is applied to create or select informative variables, such as deriving ratios from raw attributes or encoding categorical data; this phase is particularly labor-intensive, frequently requiring weeks of effort from experts to craft effective representations. Subsequently, involves choosing algorithms like for simple cases or decision trees for more complex ones, based on the problem type and data characteristics. Hyperparameter tuning refines model settings, often via manual grid search or trial-and-error, to optimize performance. Model training fits the selected algorithm to the prepared data, followed by validation through cross-validation or hold-out sets to assess generalization and detect . Finally, deployment integrates the trained model into production environments, such as web services, with ongoing monitoring for drift. This manual approach demands deep expertise in statistics, programming, and , with each step potentially consuming days to months depending on complexity and scale. Common tools for implementing these workflows include the library in Python, which supports custom for chaining preprocessing, modeling, and evaluation steps without built-in automation. For instance, in a simple pipeline for predicting house prices, a practitioner might manually collect data on features like size and location, clean outliers in price values, engineer a new feature for price per square foot, select a regressor, tune its number of trees via repeated experiments, validate using on a test set, and deploy the model as a script for real-time predictions.

Key Differences and Advantages

Automated machine learning (AutoML) fundamentally differs from manual in its approach to construction and optimization. Manual processes rely on expert-driven iteration, where practitioners manually select algorithms, engineer features, and tune hyperparameters through trial-and-error, often requiring domain-specific knowledge and extensive experimentation that can span days or weeks for complex tasks. In contrast, AutoML employs systematic, data-driven search strategies—such as , , and ensemble construction—to automate these steps, enabling objective decisions without deep human intervention and typically completing tuning in hours rather than days. A primary advantage of AutoML is its of , allowing non-experts to achieve competitive results by abstracting away technical complexities and providing accessible interfaces for . This lowers barriers for practitioners in fields like or healthcare, where ML expertise may be limited. Additionally, AutoML accelerates cycles, with benchmarks showing speedups of up to 10 times over manual methods through techniques like predictive termination and cell-based search. is enhanced via automated logging of search processes and configurations, ensuring consistent outcomes across runs and teams. Quantitative studies underscore these benefits; for instance, analyses from the AutoML community report up to an 80% reduction in engineering time for model design compared to traditional workflows. Tools like auto-sklearn have demonstrated performance improvements, such as 10% or greater reductions in cross-validation error on multiple datasets, while matching or exceeding manually tuned models in accuracy. However, comparisons reveal limitations: AutoML often incurs higher initial computational costs due to exhaustive searches over large configuration spaces, which can demand significant GPU or cloud resources, unlike the more targeted efforts in manual tuning.

Core Components of Automation

Data Preprocessing and Feature Engineering

Automated machine learning (AutoML) systems automate data preprocessing to handle common issues efficiently, reducing manual intervention in preparing datasets for model training. This includes automated imputation for missing values, where methods such as mean or median filling are applied alongside more advanced techniques like generative adversarial networks (GAIN) or variational autoencoders (VAEs) integrated into tools like HyperImpute, which uses AutoML to select optimal imputation strategies based on dataset characteristics. Scaling operations, such as standardization and normalization, are similarly automated to ensure features are on comparable scales, often as part of optimization in frameworks like Auto-sklearn, which incorporates rescaling as one of its four core data preprocessing methods. Categorical encoding is handled through techniques like encoding or learned embeddings, with tools such as H2O AutoML applying these transformations automatically during featurization to convert non-numeric data into model-compatible formats. Feature engineering in AutoML extends this automation to the creation and refinement of input features, enabling the generation of new variables that capture complex relationships in the data. Automatic feature generation includes operations like expansions and interaction terms; for instance, Auto-sklearn employs feature expansion and random sinks (a kernel approximation method) among its 14 feature preprocessing techniques to construct higher-order features without user specification. Tools like Featuretools use deep feature synthesis to automatically produce aggregated features from relational datasets, applying transformations such as sums, means, and counts across temporal or hierarchical structures. Automated feature selection further streamlines engineering by identifying the most relevant , often using wrapper or filter methods integrated into the AutoML . Recursive feature elimination (RFE), which iteratively removes the least important features based on model performance, is commonly employed to reduce dimensionality while preserving . Entropy-based selection, such as scoring, evaluates feature relevance by measuring dependency between features and the target variable; this metric is utilized in approaches like SAFE (Synthesis of High-Quality Features) to prioritize features that maximize information gain during automated construction. These methods collectively enhance dataset quality and model efficiency, with evaluation often relying on cross-validated performance metrics to ensure selected features improve downstream tasks.

Model Selection and Hyperparameter Tuning

In automated machine learning (AutoML), involves systematically searching over a diverse ensemble of algorithms, such as decision trees, support vector machines (SVMs), and neural networks, to identify the most suitable base learner for a given . This process leverages techniques, where prior performance data from similar tasks inform the initial selection, reducing the search space and accelerating convergence to high-performing models. For instance, Auto-sklearn employs to warm-start the configuration process by recommending promising algorithm combinations based on characteristics like meta-features (e.g., number of instances and features). Hyperparameter tuning in AutoML extends this automation by optimizing the configuration parameters of selected models, which are defined within a search space encompassing both continuous variables (e.g., learning rates or regularization strengths) and discrete choices (e.g., kernel types in SVMs or the number of hidden layers in neural networks). The search space is typically constructed by enumerating all possible combinations of algorithms, preprocessors, and their respective hyperparameters, forming a combinatorial landscape that manual tuning would inefficiently explore. An overview of strategies includes random sampling for baseline exploration and more informed methods that iteratively refine candidates based on validation performance, ensuring robustness across varying dataset sizes and complexities. AutoML integrates and hyperparameter tuning into broader pipelines by combining them with data preprocessing steps through stacked generalizations, where multiple candidate pipelines are evaluated and their outputs are ensembled via a meta-learner to produce a final . This approach, as implemented in systems like Auto-sklearn, stacks base models trained on preprocessed data (e.g., after or imputation) to enhance and mitigate , creating end-to-end workflows that automate the transition from raw data to deployable models. Following data preprocessing, which handles cleaning and transformation, this integration ensures seamless algorithmic optimization without manual intervention. Benchmarks on platforms like OpenML demonstrate that AutoML systems achieve performance comparable to or exceeding that of human experts on standard tasks. In evaluations across 12 popular OpenML datasets, automated frameworks outperformed the community in 7 cases, particularly on tabular data where ensemble-based selections excelled, highlighting the practical efficacy of these automated processes.

Techniques and Algorithms

Optimization Methods

Optimization methods in automated machine learning (AutoML) primarily address the challenge of efficiently searching large configuration spaces to identify optimal hyperparameters for machine learning models. These methods are essential for balancing computational cost with performance gains, as evaluating each configuration can be expensive due to training times. Baselines like grid search and provide straightforward approaches, while advanced techniques such as offer greater sample efficiency by modeling the objective function and guiding the search strategically. Multi-fidelity optimization further enhances efficiency by approximating performance at varying resource levels. Grid search exhaustively evaluates all combinations from a predefined grid of hyperparameter values, ensuring complete coverage but suffering from the curse of dimensionality in high-dimensional spaces. This method requires an exponential number of evaluations as the number of hyperparameters and their ranges increase, making it impractical for complex models. Random search, in contrast, samples hyperparameters uniformly at random from the search space, which proves more effective than grid search because hyperparameter importance varies by dataset, and the response surface often exhibits low effective dimensionality—meaning only a subset of hyperparameters significantly influences performance. As a result, random search allocates trials more evenly across relevant subspaces, outperforming grid search by finding better configurations with the same budget; for instance, on tuning tasks, random search achieves superior results after 32 trials compared to 100 for grid search. Bayesian optimization builds on these baselines by constructing a probabilistic of the objective function—typically a (GP)—to predict performance and uncertainty for unevaluated configurations. The GP prior assumes a mean function and covariance kernel, such as the automatic relevance determination (ARD) Matérn 5/2 kernel, which captures smooth, non-stationary behavior in hyperparameter responses: KM5/2(x,x)=θ0d=1D(1+5xdxdθd+53(xdxdθd)2)exp(5xdxdθd),K_{M5/2}(x, x') = \theta_0 \prod_{d=1}^D \left(1 + \sqrt{5} \left| \frac{x_d - x'_d}{\theta_d} \right| + \frac{5}{3} \left( \frac{x_d - x'_d}{\theta_d} \right)^2 \right) \exp\left( -\sqrt{5} \left| \frac{x_d - x'_d}{\theta_d} \right| \right),
Add your contribution
Related Hubs
User Avatar
No comments yet.