Hubbry Logo
Hyperparameter (machine learning)Hyperparameter (machine learning)Main
Open search
Hyperparameter (machine learning)
Community hub
Hyperparameter (machine learning)
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Hyperparameter (machine learning)
Hyperparameter (machine learning)
from Wikipedia

In machine learning, a hyperparameter is a parameter that can be set in order to define any configurable part of a model's learning process. Hyperparameters can be classified as either model hyperparameters (such as the topology and size of a neural network) or algorithm hyperparameters (such as the learning rate and the batch size of an optimizer). These are named hyperparameters in contrast to parameters, which are characteristics that the model learns from the data.

Hyperparameters are not required by every model or algorithm. Some simple algorithms such as ordinary least squares regression require none. However, the LASSO algorithm, for example, adds a regularization hyperparameter to ordinary least squares which must be set before training.[1] Even models and algorithms without a strict requirement to define hyperparameters may not produce meaningful results if these are not carefully chosen. However, optimal values for hyperparameters are not always easy to predict. Some hyperparameters may have no meaningful effect, or one important variable may be conditional upon the value of another. Often a separate process of hyperparameter tuning is needed to find a suitable combination for the data and task.

As well as improving model performance, hyperparameters can be used by researchers to introduce robustness and reproducibility into their work, especially if it uses models that incorporate random number generation.

Considerations

[edit]

The time required to train and test a model can depend upon the choice of its hyperparameters.[2] A hyperparameter is usually of continuous or integer type, leading to mixed-type optimization problems.[2] The existence of some hyperparameters is conditional upon the value of others, e.g. the size of each hidden layer in a neural network can be conditional upon the number of layers.[2]

Difficulty-learnable parameters

[edit]

The objective function is typically non-differentiable with respect to hyperparameters.[clarification needed] As a result, in most instances, hyperparameters cannot be learned using gradient-based optimization methods (such as gradient descent), which are commonly employed to learn model parameters. These hyperparameters are those parameters describing a model representation that cannot be learned by common optimization methods, but nonetheless affect the loss function. An example would be the tolerance hyperparameter for errors in support vector machines.

Untrainable parameters

[edit]

Sometimes, hyperparameters cannot be learned from the training data because they aggressively increase the capacity of a model and can push the loss function to an undesired minimum (overfitting to the data), as opposed to correctly mapping the richness of the structure in the data. For example, if we treat the degree of a polynomial equation fitting a regression model as a trainable parameter, the degree would increase until the model perfectly fit the data, yielding low training error, but poor generalization performance.

Tunability

[edit]

Most performance variation can be attributed to just a few hyperparameters.[3][2][4] The tunability of an algorithm, hyperparameter, or interacting hyperparameters is a measure of how much performance can be gained by tuning it.[5] For an LSTM, while the learning rate followed by the network size are its most crucial hyperparameters,[6] batching and momentum have no significant effect on its performance.[7]

Although some research has advocated the use of mini-batch sizes in the thousands, other work has found the best performance with mini-batch sizes between 2 and 32.[8]

Robustness

[edit]

An inherent stochasticity in learning directly implies that the empirical hyperparameter performance is not necessarily its true performance.[2] Methods that are not robust to simple changes in hyperparameters, random seeds, or even different implementations of the same algorithm cannot be integrated into mission critical control systems without significant simplification and robustification.[9]

Reinforcement learning algorithms, in particular, require measuring their performance over a large number of random seeds, and also measuring their sensitivity to choices of hyperparameters.[9] Their evaluation with a small number of random seeds does not capture performance adequately due to high variance.[9] Some reinforcement learning methods, e.g. DDPG (Deep Deterministic Policy Gradient), are more sensitive to hyperparameter choices than others.[9]

Optimization

[edit]

Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given test data.[2] The objective function takes a tuple of hyperparameters and returns the associated loss.[2] Typically these methods are not gradient based, and instead apply concepts from derivative-free optimization or black box optimization.

Reproducibility

[edit]

Apart from tuning hyperparameters, machine learning involves storing and organizing the parameters and results, and making sure they are reproducible.[10] In the absence of a robust infrastructure for this purpose, research code often evolves quickly and compromises essential aspects like bookkeeping and reproducibility.[11] Online collaboration platforms for machine learning go further by allowing scientists to automatically share, organize and discuss experiments, data, and algorithms.[12] Reproducibility can be particularly difficult for deep learning models.[13] For example, research has shown that deep learning models depend very heavily even on the random seed selection of the random number generator.[14]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In , hyperparameters are external configuration settings that must be specified prior to a model, governing aspects such as the model's , learning dynamics, and capabilities, in contrast to internal parameters that are optimized directly from the training data. These settings influence how the algorithm processes data and learns patterns, directly impacting predictive performance on unseen examples. For instance, in neural networks, hyperparameters include the , which determines the step size in updates. The importance of hyperparameters stems from their role in balancing and variance, where suboptimal choices can lead to underfitting or , while effective tuning can yield substantial improvements in accuracy and efficiency. Algorithms like support vector machines (SVMs) rely on hyperparameters such as the regularization C and kernel coefficient gamma to control the trade-off between margin maximization and classification error, often requiring careful selection to achieve optimal . Similarly, ensemble methods like random forests use hyperparameters like the number of trees (ntree) and the number of features sampled per split (mtry), which can enhance robustness but demand validation to avoid computational overhead. Research shows that tuning can improve area under the curve (AUC) metrics by up to 0.069 for elastic net models and 0.056 for SVMs across diverse datasets, underscoring their tunable impact. Hyperparameter optimization typically involves systematic search strategies, as these values cannot be learned from training data alone without risking overfitting. Common approaches include grid search, which exhaustively evaluates combinations on a predefined grid; random search, which samples values from distributions and often outperforms grid search in high-dimensional spaces; and Bayesian optimization, which uses probabilistic models to intelligently explore promising regions based on prior evaluations. Advanced techniques, such as gradient-based methods for reversible learning, further enable efficient tuning of continuous hyperparameters like learning rates by computing gradients through the training process. Overall, effective hyperparameter management remains a critical challenge in deploying machine learning models, driving ongoing research into automated and scalable optimization frameworks.

Fundamentals

Definition

In machine learning, a hyperparameter is a configuration setting that is externally specified prior to the training phase and influences the underlying learning algorithm's behavior, such as determining the step size in optimization or the architecture's complexity. These settings are essential for tailoring the model's capacity to the task at hand, enabling effective learning from data without being derived from the training process itself. The concept of hyperparameters draws from , where the term originally described parameters of prior distributions, but it gained prominence in machine learning during the 1990s amid the emergence of kernel methods and support vector machines (SVMs). In SVMs, for instance, the regularization parameter C, introduced in foundational work on optimal margin classifiers, exemplifies a hyperparameter by balancing model complexity against fitting errors. Early efforts in automatic parameter selection, such as cross-validated tuning for decision tree algorithms like C4.5, further highlighted their role in practical model development during this period. As meta-parameters, hyperparameters assume familiarity with core machine learning models and focus on governing algorithmic aspects like convergence speed or generalization, thereby bridging model design and empirical validation in the training workflow.

Examples

Hyperparameters in machine learning encompass a diverse set of configuration settings that guide the training process across different algorithms. In supervised learning, a prominent example is the learning rate α\alpha used in gradient descent optimization, which determines the step size for updating model parameters according to the rule θt+1=θtαJ(θt)\theta_{t+1} = \theta_t - \alpha \nabla J(\theta_t), where θt\theta_t represents the parameters at iteration tt and J(θt)\nabla J(\theta_t) is the gradient of the objective function. This hyperparameter balances convergence speed and stability, with values typically tuned within small positive ranges to avoid overshooting minima. In , the number of clusters kk in the algorithm serves as a key hyperparameter, specifying the desired number of groups into which the data points are partitioned to minimize within-cluster variance. The choice of kk directly influences the granularity of the clustering structure, often selected based on or validation metrics to capture meaningful patterns in unlabeled data. Deep learning introduces hyperparameters like the dropout rate pp, a regularization technique that randomly deactivates a fraction pp of neurons during training to prevent overfitting in neural networks. Commonly set between 0.2 and 0.5, this value helps maintain generalization by simulating an ensemble of thinner networks, with higher rates applied to larger hidden layers for more aggressive regularization. Some hyperparameters are conditional, meaning their relevance or range depends on the values of other hyperparameters; for instance, the number of layers in a architecture is only applicable if the model depth is variable, allowing flexible exploration of deeper or shallower configurations in architecture search. Hyperparameters also vary by type, including continuous ones like the , which can take any real value within a bounded interval, and discrete ones like the maximum in decision trees, often restricted to integers to control model complexity and avoid excessive splitting. These distinctions affect the search strategies employed during tuning, with continuous types enabling fine-grained adjustments and discrete types requiring enumeration over finite options.

Distinctions

From Model Parameters

In models, parameters refer to the internal variables that are directly learned from the training data through optimization algorithms, such as , to fit the observed patterns. These parameters define the specific mapping from inputs to outputs once trained. For instance, in , the weights ww in the equation y=wx+by = w x + b serve as model parameters, adjusted iteratively to minimize the difference between predicted and actual values on the training set. Hyperparameters, by contrast, are external configuration settings provided before training begins, which govern the overall , capacity, and learning dynamics of the model without being updated during the core optimization process. Unlike model parameters, hyperparameters are not refined via or similar methods that propagate errors through the model; instead, they remain fixed throughout the of any given model instance. This separation ensures that hyperparameters shape high-level decisions, such as the complexity of the architecture or the intensity of regularization, influencing how effectively the model parameters can be learned. A fundamental difference in their roles emerges during and tuning: model are optimized specifically to reduce the loss on training , potentially leading to if unconstrained, while hyperparameters are adjusted using a held-out validation set to promote and mitigate such risks. This validation-based tuning for hyperparameters helps select configurations that yield models with strong performance on unseen , distinct from the data-fitting focus of parameter optimization. The interplay is often formalized in the loss function J(θ;λ)J(\theta; \lambda), where θ\theta represents the model optimized within JJ, and λ\lambda denotes the hyperparameters that define the structure or penalties in JJ itself.

From Statistics

In statistical modeling, parameters represent unknown quantities that are estimated from observed data to describe the underlying distribution or process. For instance, in a Gaussian distribution, the μ\mu and variance σ2\sigma^2 are statistical inferred via methods like to fit the model to the data. In Bayesian , hyperparameters serve as parameters of the prior distributions placed on these statistical parameters, guiding the inference process without being directly estimated from the data. For example, in Dirichlet process mixture models, the precision parameter α\alpha (or τ\tau) acts as a hyperparameter that controls the concentration of the prior GDP(α,G0)G \sim DP(\alpha, G_0), influencing the expected number of mixture components and the smoothing of the mixing distribution GG from which component (e.g., means θj\theta_j and variances vjv_j) are drawn. This setup allows α\alpha to specify the flexibility of the model family, while the statistical parameters within each component are posteriorly inferred given the data and prior. The key distinction lies in their roles: statistical parameters are inferred quantities tailored to the specific , enabling the model to adapt to observed patterns, whereas hyperparameters define the broader generative structure or prior beliefs about those parameters, remaining fixed or tuned externally to select the appropriate model class. In this framework, hyperparameters operate at a meta-level, specifying the assumptions under which statistical parameters are estimated, rather than being part of the core likelihood. This separation is analogous to the broader differentiation from model parameters in non-Bayesian settings, where hyperparameters configure the learning algorithm itself. Historically, this concept emerged in early statistical approaches, such as the , where the smoothing parameter α\alpha functions as a hyperparameter to control the strength of the uniform prior on feature probabilities, mitigating zero-probability issues in categorical data estimation. Introduced to enhance robustness in , α\alpha (often set to 1 for Laplace ) adjusts the effective prior influence on the statistical parameters representing class-conditional probabilities, a practice that underscored the need for hyperparameter specification in statistical models predating modern .

Optimization

Search Methods

Search methods for hyperparameter optimization encompass systematic strategies to explore the hyperparameter space, focusing on exhaustive and probabilistic sampling techniques to identify configurations that yield optimal model performance. These methods evaluate candidate hyperparameters by training models and assessing their efficacy on validation data, balancing exploration breadth with computational feasibility. Foundational approaches prioritize simplicity and reliability, forming the basis for more sophisticated techniques. Grid search represents the most straightforward exhaustive method, systematically evaluating all possible combinations from a predefined discrete grid of hyperparameter values. For instance, if tuning the across [0.01, 0.1] and batch size across [32, 64], grid search would test four configurations: (0.01, 32), (0.01, 64), (0.1, 32), and (0.1, 64). This approach ensures comprehensive coverage within the specified grid but becomes impractical as the number of hyperparameters or values per hyperparameter increases, due to its exponential scaling with dimensionality. Specifically, for d values per dimension across n dimensions, the computational cost is O(d^n), rendering it inefficient for high-dimensional spaces. Random search, in contrast, mitigates the curse of dimensionality by uniformly sampling random combinations from the hyperparameter space, rather than enumerating all possibilities. This method draws a fixed number of samples—often comparable to the grid size—and evaluates them independently, allowing for broader exploration in continuous or high-dimensional settings. Empirical studies demonstrate that random search outperforms grid search in many scenarios, particularly when only a few hyperparameters significantly influence performance, as it allocates resources more efficiently across the space. Bergstra and Bengio (2012) showed, through experiments on architectures, that can achieve better results with the same budget by focusing on impactful dimensions rather than exhaustive enumeration. To ensure robust evaluation, both grid and random search integrate cross-validation, typically k-fold, to assess each hyperparameter configuration's generalization on held-out data. In k-fold cross-validation, the dataset is partitioned into k subsets, with k-1 used for and one for validation, rotating across folds to compute an average performance metric such as accuracy or . This mitigates to a single validation split and provides a more reliable estimate of hyperparameter quality, especially in data-scarce regimes. The process involves repeating the search evaluations within the cross-validation framework, though at the cost of multiplying computational demands by k. Standard references emphasize this integration as essential for unbiased hyperparameter selection.

Advanced Techniques

Bayesian optimization represents a probabilistic approach to that models the objective function as a surrogate, typically using Gaussian processes, to guide the search efficiently in high-dimensional spaces. This method constructs a posterior distribution over the hyperparameter performance landscape based on prior evaluations, enabling the selection of promising candidates that balance exploration of uncertain regions and exploitation of high-performing areas. A key component is the acquisition function, which quantifies the utility of evaluating a point; one widely used example is the expected improvement, defined as EI(x)=E[max(f(x)f(x+),0)],EI(\mathbf{x}) = \mathbb{E} \left[ \max \left( f(\mathbf{x}) - f(\mathbf{x}^+), 0 \right) \right], where f(x)f(\mathbf{x}) is the objective at hyperparameter configuration x\mathbf{x}, and f(x+)f(\mathbf{x}^+) is the current best observed value, with the expectation taken over the posterior predictive distribution. This formulation, originally developed for global optimization of expensive black-box functions, has been adapted for hyperparameter tuning, often outperforming grid or random search by requiring fewer evaluations, especially for costly model trainings. Evolutionary algorithms apply principles of to evolve populations of hyperparameter configurations, making them suitable for discrete or mixed search spaces where gradient information is unavailable. In genetic algorithms, a variant commonly used for this purpose, an initial population of candidate hyperparameters is generated, and subsequent generations are produced through operations such as crossover (combining features from parent configurations) and (random perturbations), with selection favoring those yielding superior validation . This population-based enhances robustness to noisy evaluations and multimodal objectives, as demonstrated in pipelines where it optimizes both hyperparameters and model structures simultaneously. adaptation evolution (CMA-ES), another evolutionary method, adapts the search distribution online, showing strong on continuous hyperparameter spaces compared to simpler heuristics. Gradient-based methods leverage differentiability to directly optimize hyperparameters, particularly when the validation loss LL is expressed as a function of hyperparameters λ\lambda, approximating the hypergradient L/λ\partial L / \partial \lambda through rule applied to the inner optimization loop. Hypergradient descent, for instance, computes these derivatives by unrolling the steps for model parameters, enabling efficient updates to λ\lambda during training without full re-optimization from scratch. This approach is especially valuable for large-scale , where traditional methods falter due to computational expense; recent implicit differentiation techniques further scale it to millions of hyperparameters by solving linear systems to avoid explicit unrolling. Recent developments in multi-fidelity optimization include model-assisted search with early-stopping mechanisms to reduce evaluation costs by approximating full performance using partial executions. BOHB, which combines with the Hyperband successive halving algorithm, uses for the surrogate model and dynamically allocates resources to promising configurations, achieving up to 3-5 times speedup over on benchmarks like tuning. Building on this, DEHB replaces the Bayesian component with , an , to handle discrete hyperparameters more effectively while retaining Hyperband's bracketing for fidelity control, demonstrating superior sample efficiency on high-dimensional tasks such as image classification. More recent advances as of 2025 include techniques for large language models using population-based and adaptive methods, as well as dynamic HPO for with concept drift, integrated into modern AutoML frameworks.

Considerations

Tunability

Tunability refers to the extent to which variations in a hyperparameter influence the overall model performance and training behavior in machine learning systems. Hyperparameters with high tunability, such as the learning rate in stochastic gradient descent (SGD), exhibit pronounced effects; even minor adjustments can dramatically alter convergence rates and final accuracy, often by several percentage points on benchmarks like CIFAR-10. In contrast, parameters like the momentum in batch normalization layers demonstrate lower tunability, where changes typically yield marginal impacts on outcomes due to their stabilizing role in normalizing activations across mini-batches. Many hyperparameters exhibit conditional dependencies, meaning their effective tunability is contingent on the values of other hyperparameters. For example, the optimal range for the is highly dependent on the choice of optimizer—SGD often requires learning rates around 0.1 for effective training, while typically performs best with rates near 0.001, as the adaptive nature of Adam adjusts step sizes internally. Such dependencies necessitate sequential or hierarchical tuning approaches to avoid suboptimal configurations that diminish overall model efficacy. Inadequate tuning of tunable hyperparameters can severely impair training dynamics, leading to prolonged convergence times or outright divergence. A notably low learning rate, for instance, results in excessively small parameter updates, causing slow progress toward the loss minimum and potentially exacerbating issues like vanishing gradients in deep networks by failing to propagate meaningful changes through layers. Empirical analyses in image classification show that hyperparameter optimization can improve accuracy by up to 6% on datasets such as CIFAR-100, emphasizing the need for deliberate adjustment to achieve robust results.

Robustness

In machine learning, hyperparameter robustness refers to the degree to which a model's performance remains stable despite small perturbations in hyperparameter values, such as variations in or regularization strength. For instance, a robust model exhibits minimal degradation in accuracy or loss when the learning rate is altered by ±10%, ensuring reliable across minor configuration changes. This property is crucial for practical deployment, as exact hyperparameter replication may be challenging due to implementation variations or resource constraints. Several factors can diminish hyperparameter robustness. High-dimensional models, such as deep neural networks, often exhibit increased sensitivity because the expanded parameter space heightens the risk of , where small hyperparameter tweaks lead to substantial swings. Similarly, dataset shifts—changes in data distribution between training and deployment—can amplify this sensitivity by altering the optimal hyperparameter landscape, causing previously tuned models to underperform. To quantify robustness, researchers commonly measure the variance in cross-validation (CV) scores when hyperparameters are systematically perturbed around their optimal values. This approach captures instability by computing the spread in evaluation metrics, such as accuracy or negative log-likelihood, across multiple CV folds and hyperparameter samples; low variance indicates high robustness. Ensemble methods enhance hyperparameter robustness by averaging predictions from models trained with diverse hyperparameter sets, thereby mitigating the impact of any single suboptimal choice. For example, random forests achieve this stability through bagging and feature subsampling, which make their performance relatively insensitive to specific hyperparameter values like the number of trees or maximum depth. Such techniques promote consistent outcomes even under hyperparameter noise, improving overall model reliability.

Reproducibility

Challenges

Achieving reproducibility in (HPO) for models is complicated by several specific challenges that introduce variability in results across repeated experiments. A primary issue is the inherent stochasticity in many optimization processes and model trainings, arising from random seeds used in data shuffling, weight initialization, or sampling methods like mini-batches. This noise can lead to different optimal hyperparameters even under identical configurations, making it difficult to replicate exact performance metrics without controlling all sources of randomness. For instance, in deep neural networks, variations in initial weights or dropout masks can cause significant fluctuations in validation scores during tuning. Another challenge is the lack of in experimental environments, where differences in hardware (e.g., GPU vs. CPU), software versions (e.g., library updates in or ), or even floating-point precision across systems can alter outcomes. This non-stationarity in the computational setup means that hyperparameters tuned on one machine may not yield the same results elsewhere, exacerbating issues in collaborative or distributed research settings. Such variability is particularly pronounced in resource-intensive HPO for large models, where parallel evaluations on clusters introduce additional inconsistencies from load balancing or network latency. The opaque nature of HPO pipelines further hinders , as full configurations—including search spaces, evaluation protocols, and criteria—are often not comprehensively documented. Without detailed , reconstructing the tuning process becomes impossible, leading to the "reproducibility crisis" observed in literature, where reported results fail to replicate in independent studies. Additionally, the high computational demands of HPO amplify these issues, as limited access to identical resources prevents thorough verification of findings.

Best Practices

When initiating hyperparameter tuning for models, it is advisable to begin with default values provided by established libraries, as these are empirically chosen for robust performance on common tasks. For instance, 's estimators incorporate defaults that yield good initial results on standard problems. A fundamental practice is to reserve validation sets exclusively for hyperparameter selection and optimization, while holding out independent test sets for unbiased final model evaluation to prevent and ensure generalizability. This separation mitigates data leakage risks, where information from the test set inadvertently influences tuning decisions, as emphasized in guidelines for validation. For efficient and scalable hyperparameter searches, especially in large-scale or resource-intensive settings, via specialized frameworks is recommended; Optuna employs define-by-run optimization with mechanisms to accelerate trials, while Ray Tune supports distributed execution across clusters for parallel evaluations. These tools facilitate Bayesian or evolutionary strategies without manual intervention, improving upon exhaustive methods like grid search. Conducting early in the process helps prioritize critical hyperparameters by evaluating their impact on model performance across defined ranges, using metrics like the Hilbert-Schmidt Independence Criterion to quantify dependencies and interactions. This approach identifies influential parameters—such as learning rates or regularization strengths—allowing focused tuning and , as demonstrated in applications on benchmarks like MNIST and CIFAR-10. To address reproducibility challenges, best practices include logging all tuning configurations (e.g., search spaces, trial results, and metadata) using tools like MLflow or Weights & Biases, and using fixed random seeds during validation-based searches to minimize variability. Additionally, manage dependencies with version pinning (e.g., via pip requirements files) and containerize environments using Docker to ensure consistent setups across runs. Adhering to checklists, such as those from NeurIPS, further promotes transparent reporting of hardware, software versions, and full code availability.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.