Recent from talks
Nothing was collected or created yet.
Learning curve (machine learning)
View on Wikipedia
| Part of a series on |
| Machine learning and data mining |
|---|
In machine learning (ML), a learning curve (or training curve) is a graphical representation that shows how a model's performance on a training set (and usually a validation set) changes with the number of training iterations (epochs) or the amount of training data.[1] Typically, the number of training epochs or training set size is plotted on the x-axis, and the value of the loss function (and possibly some other metric such as the cross-validation score) on the y-axis.
Synonyms include error curve, experience curve, improvement curve and generalization curve.[2]
More abstractly, learning curves plot the difference between learning effort and predictive performance, where "learning effort" usually means the number of training samples, and "predictive performance" means accuracy on testing samples.[3]
Learning curves have many useful purposes in ML, including:[4][5][6]
- choosing model parameters during design,
- adjusting optimization to improve convergence,
- and diagnosing problems such as overfitting (or underfitting).
Learning curves can also be tools for determining how much a model benefits from adding more training data, and whether the model suffers more from a variance error or a bias error. If both the validation score and the training score converge to a certain value, then the model will no longer significantly benefit from more training data.[7]
Formal definition
[edit]When creating a function to approximate the distribution of some data, it is necessary to define a loss function to measure how good the model output is (e.g., accuracy for classification tasks or mean squared error for regression). We then define an optimization process which finds model parameters such that is minimized, referred to as .
Training curve for amount of data
[edit]If the training data is
and the validation data is
,
a learning curve is the plot of the two curves
where
Training curve for number of iterations
[edit]Many optimization algorithms are iterative, repeating the same step (such as backpropagation) until the process converges to an optimal value. Gradient descent is one such algorithm. If is the approximation of the optimal after steps, a learning curve is the plot of
See also
[edit]References
[edit]- ^ "Mohr, Felix and van Rijn, Jan N. "Learning Curves for Decision Making in Supervised Machine Learning - A Survey." arXiv preprint arXiv:2201.12150 (2022)". arXiv:2201.12150.
- ^ Viering, Tom; Loog, Marco (2023-06-01). "The Shape of Learning Curves: A Review". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (6): 7799–7819. arXiv:2103.10948. Bibcode:2023ITPAM..45.7799V. doi:10.1109/TPAMI.2022.3220744. ISSN 0162-8828. PMID 36350870.
- ^ Perlich, Claudia (2010), "Learning Curves in Machine Learning", in Sammut, Claude; Webb, Geoffrey I. (eds.), Encyclopedia of Machine Learning, Boston, MA: Springer US, pp. 577–580, doi:10.1007/978-0-387-30164-8_452, ISBN 978-0-387-30164-8, retrieved 2023-07-06
- ^ Madhavan, P.G. (1997). "A New Recurrent Neural Network Learning Algorithm for Time Series Prediction" (PDF). Journal of Intelligent Systems. p. 113 Fig. 3.
- ^ "Machine Learning 102: Practical Advice". Tutorial: Machine Learning for Astronomy with Scikit-learn. Archived from the original on 2012-07-30. Retrieved 2019-02-15.
- ^ Meek, Christopher; Thiesson, Bo; Heckerman, David (Summer 2002). "The Learning-Curve Sampling Method Applied to Model-Based Clustering". Journal of Machine Learning Research. 2 (3): 397. Archived from the original on 2013-07-15.
- ^ scikit-learn developers. "Validation curves: plotting scores to evaluate models — scikit-learn 0.20.2 documentation". Retrieved February 15, 2019.
Learning curve (machine learning)
View on GrokipediaFundamentals
Definition and Purpose
In machine learning, a learning curve is a graphical representation that plots a model's performance metric, such as error rate or accuracy, against variables like the size of the training dataset or the number of training iterations, thereby illustrating how the model's ability to generalize from data evolves over time and potentially reaches a plateau.[2] This visualization captures the progression of learning dynamics, showing initial rapid improvements followed by diminishing returns as the model saturates its capacity to extract patterns from the available data. The primary purpose of a learning curve is to evaluate the efficiency with which a model acquires knowledge from data, diagnose limitations in its representational capacity, and inform practical decisions regarding the acquisition of additional training examples or modifications to the model architecture.[2] By revealing whether performance gains are constrained by data scarcity or model complexity, learning curves serve as a diagnostic tool to optimize resource allocation in machine learning workflows, ensuring that efforts focus on bottlenecks that hinder generalization. Subtypes such as training curves, which track performance on the training set, and validation curves, which monitor unseen data, provide complementary insights into this process.[2] The concept of learning curves originated in psychological research on human memory and skill acquisition, notably in Hermann Ebbinghaus's 1885 work Über das Gedächtnis.[2] It was later applied in industrial engineering, such as Theodore Wright's 1936 analysis of labor efficiency in aircraft manufacturing, where he observed that unit production costs decrease predictably as cumulative experience grows.[3] This idea was adapted to machine learning contexts in the mid-20th century, with early experimental uses by Foley in pattern recognition and theoretical developments by Rosenblatt in his 1958 perceptron model, building on statistical learning theory from the 1960s.[2] In the 1980s and 1990s, particularly in the context of neural networks, researchers like David Rumelhart and James McClelland employed learning curves to demonstrate how connectionist models progressively refine internal representations through iterative exposure to examples. Their work in parallel distributed processing highlighted the curve's utility in modeling acquisition of complex skills, bridging psychological principles with computational learning.[2][4]Key Components
The horizontal axis of a learning curve in machine learning typically represents the amount of training experience, which can be measured as the size of the training dataset (e.g., number of samples ) or the number of training iterations (e.g., epochs or steps). The vertical axis depicts a performance metric, such as error rate or accuracy, quantifying how well the model performs on the data. This setup allows visualization of how performance evolves with increasing resources dedicated to training.[5][6] Performance metrics on the vertical axis are chosen based on the task: for regression problems, mean squared error (MSE) is common, defined as , where are true values and are predictions; for classification, cross-entropy loss is frequently used, given by for probabilistic outputs. These metrics distinguish between empirical risk, which is the training error computed on seen data, and generalization error, estimated via validation or test error on unseen data to reflect real-world performance. Accuracy, as the proportion of correct predictions, serves as an alternative score for classification, though it is less informative for imbalanced datasets. Datasets underpinning learning curves rely on splits into training and validation (or hold-out) sets to prevent data leakage, where the training set is used to fit the model and the validation set evaluates generalization without influencing parameter updates. Cross-validation techniques, such as k-fold, often generate multiple such splits to compute robust estimates, ensuring the curve reflects average behavior across folds rather than a single partition. This separation is crucial for reliable metric computation, as reusing training data for evaluation would inflate performance unrealistically.[5] Learning curves comprise distinct types: the training curve tracks performance on the training set, typically showing rapid improvement as the model memorizes seen data, while the validation curve monitors held-out data, decreasing more slowly to indicate generalization progress. Training error generally declines faster than validation error due to the model's increasing fit to the specific training samples, highlighting the gap between memorization and true learning. These curves are plotted separately but often overlaid for comparison, using the same axes to reveal discrepancies.[5][6]Construction and Visualization
Plotting Against Training Data Size
To construct a learning curve by plotting against training data size, the standard procedure involves incrementally subsampling the available training dataset to create subsets of increasing size, typically from a small fraction (e.g., 10%) up to the full dataset (100%). For each subset of size , a model is trained, and both training error and validation error are computed; these errors are then averaged across multiple runs, often using k-fold cross-validation to ensure robustness against specific data splits. This approach allows visualization of how model performance evolves as more data is incorporated, with the x-axis representing (often on a logarithmic scale) and the y-axis showing error metrics such as mean squared error (MSE). Formally, for a hypothesis class and loss function , the training error is defined as the empirical risk , where is the training sample drawn i.i.d. from the data distribution , and is the learned hypothesis. The true risk, or expected risk, is , which the validation error approximates by evaluating on held-out data. The learning curve then traces the expected empirical risk or the problem-averaged version as a function of , where denotes the learning algorithm. In statistical learning theory, the learning curve is often modeled parametrically as a power-law form , where captures the initial decay amplitude, is the learning rate (steeper for faster convergence), and is the irreducible error (Bayes risk). This form arises from generalization bounds tied to the VC dimension of ; a sketch of the derivation follows from the fundamental theorem of statistical learning, which bounds the deviation with probability at least (uniformly over ) via the growth function . For empirical risk minimization, the excess risk thus decays as (ignoring logs), implying a power-law with exponent in the large- regime, though empirical curves may exhibit faster decay (e.g., ) depending on the problem.[7] Practical implementation must address computational demands, as training scales roughly linearly with (O() per subset, compounded by k-fold repetition), often mitigated by progressive sampling that halts early if the curve plateaus. Variance in curve estimates can be quantified using bootstrapping, where multiple resamples (with replacement) from the dataset generate error bars as the standard deviation across bootstrap replicates, providing uncertainty intervals without assuming normality. For illustration, consider linear regression on synthetic data generated from a known linear relationship with Gaussian noise (e.g., , , over ); subsampling from 100 to 10,000 points yields a training MSE dropping rapidly as initially (matching the parametric form with ), while validation MSE converges to the noise variance around 1, demonstrating data-limited regimes for small .[8]Plotting Against Training Iterations
In machine learning, learning curves plotted against training iterations track the evolution of a model's performance metrics, such as loss or accuracy, as the optimization algorithm progresses through update steps. This approach focuses on the temporal dynamics of training, where a model is iteratively updated using an optimizer like gradient descent on a fixed dataset. The procedure involves initializing model parameters, then performing repeated passes over the training data—often organized into epochs—while computing gradients and updating parameters at each step. Errors are recorded at fixed intervals, such as every epoch or a specified number of iterations, to capture both training and validation performance. These points are then plotted with iterations (t) on the x-axis and loss values on the y-axis, revealing patterns like initial rapid improvement followed by stabilization or fluctuations. Formally, an iteration-based learning curve is defined by the training loss and validation loss , where denotes the number of parameter update steps. These curves arise from the trajectories of stochastic optimization algorithms, such as stochastic gradient descent (SGD), which update the model parameters according to the rule , with as the learning rate and a random minibatch. Convergence properties of these trajectories are analyzed in optimization theory, particularly for smooth and strongly convex loss landscapes, where the expected loss decreases at rates depending on the learning rate schedule and noise in gradient estimates.[9][10] For well-behaved convergence under gradient-based methods, the loss often follows an exponential decay model approximated as where is the initial loss, is the decay rate influenced by the learning rate and strong convexity parameter, and is the irreducible minimum loss. This form captures the linear convergence typical in strongly convex settings, where the distance to the optimum shrinks exponentially with iterations.[11] Practical implementation requires monitoring to avoid issues like divergence from large learning rates or prolonged training without improvement. Early stopping is a key technique, halting iterations when validation loss ceases to decrease for a patience threshold (e.g., 10 epochs), thereby preventing overfitting while saving computational resources. Logging tools such as TensorBoard facilitate visualization by recording metrics at intervals and generating interactive plots of loss curves during training. For instance, training a multilayer perceptron on the MNIST dataset via SGD typically shows training loss dropping rapidly in the first 20-50 epochs before plateauing around 0.1 cross-entropy loss, with validation curves closely tracking until potential divergence if unchecked.[12]Interpretation
Indicators of Model Performance
In learning curves, convergence refers to the point where both training and validation error rates stabilize, approaching a low error floor that indicates the model has adequately captured the underlying patterns in the data with sufficient capacity and training resources. This stabilization typically manifests as the curves flattening out, signaling that further training yields minimal improvements in performance. Slow convergence, characterized by prolonged high error rates before stabilization, often points to high model variance or suboptimal optimization processes, such as inadequate learning rates or initialization.[13] Saturation occurs when the learning curve reaches a horizontal asymptote, demonstrating diminishing returns where additional data or iterations provide negligible gains in accuracy. This pattern is commonly quantified by observing when the slope of the curve falls below a small threshold, beyond which resource allocation becomes inefficient.[13] For instance, in empirical studies of supervised learners, saturation points are identified using progressive sampling techniques to predict the onset of this plateau without exhaustive computation. Variance estimation in learning curves involves computing confidence intervals around the error estimates, typically derived from multiple training runs or k-fold cross-validation, to assess the stability of the model's performance across different data subsets. High variance in the validation curve, evidenced by wide confidence intervals that do not narrow with more data, suggests model instability, often due to sensitivity to initial conditions or noisy features.[13] These intervals are particularly useful for distinguishing reliable convergence from erratic fluctuations in real-world datasets. From a theoretical perspective, the asymptotic behavior of learning curves is governed by bounds in statistical learning theory, where the generalization error decreases at a rate of for hypothesis classes with finite VC dimension, reflecting the model's ability to converge to the optimal error as the training sample size grows.[14] Ideal curves exhibit smooth decay following this rate under noise-free assumptions, while noisy or ill-behaved curves may show deviations like temporary peaks before settling into the asymptote, as observed in reviews of diverse empirical and theoretical shapes.Diagnosing Overfitting and Underfitting
In machine learning, learning curves serve as a diagnostic tool to identify overfitting, where a model performs well on training data but poorly on unseen validation data due to excessive memorization of noise and idiosyncrasies in the training set. This manifests as a low and steadily decreasing training error curve, contrasted by a validation error curve that initially decreases but then plateaus or increases after a certain point, creating a widening gap or divergence between the two curves. Such divergence indicates that the model has high variance, capturing random fluctuations rather than underlying patterns, as detailed in the bias-variance decomposition framework. Underfitting, conversely, occurs when the model fails to capture the underlying structure of the data, resulting in high bias and poor performance on both training and validation sets. In learning curves, this appears as both error curves remaining high and relatively flat, with little improvement as training progresses, signaling insufficient model complexity, inadequate features, or improper hyperparameters. This high-bias scenario prevents the model from achieving low error even on familiar data, limiting its predictive capability.[5] The bias-variance decomposition provides a theoretical foundation for interpreting these curve patterns, decomposing the expected prediction error into three components: bias squared (systematic error from erroneous assumptions), variance (sensitivity to fluctuations in the training data), and irreducible noise (inherent randomness in the data). Early in the learning process, bias dominates, leading to underfitting with parallel high error curves; as model complexity increases, bias decreases but variance rises, potentially causing overfitting and divergence in later stages. The decomposition is expressed as: where is the learned function and is the noise variance. This tradeoff, first formalized for neural networks, underscores how learning curves reveal shifts from bias-dominated to variance-dominated regimes. To diagnose these issues quantitatively, practitioners often examine the gap between training and validation errors; a persistent gap signals overfitting, while consistently parallel high errors (e.g., both above a baseline performance metric) indicate underfitting. Remedies include regularization techniques like L2 penalties to curb variance in overfitting cases or increasing model capacity to reduce bias in underfitting, though full implementation details extend beyond diagnosis. These insights from learning curves enable targeted adjustments to improve generalization.[5]Applications
Model Selection and Hyperparameter Tuning
Learning curves play a crucial role in model selection by enabling the comparison of multiple architectures based on their convergence rates and asymptotic performance on validation data. For instance, plotting learning curves for linear models versus deep neural networks reveals that simpler models may saturate quickly with high error floors, while deeper architectures continue improving with more data, guiding the choice toward complex models when sufficient training resources are available. This approach, known as learning curve cross-validation (LCCV), iteratively evaluates models by incrementally increasing training set sizes and early termination of underperformers, achieving comparable accuracy to traditional k-fold cross-validation while reducing runtime by over 20% across diverse benchmarks.[15] In hyperparameter tuning, learning curves illustrate the impact of parameters such as learning rate, batch size, and regularization strength on training dynamics. A higher learning rate often accelerates initial convergence but risks instability, as shown by steeper early slopes followed by divergence in validation error; optimal rates yield balanced curves with minimal gap between training and validation losses. Similarly, varying regularization strength affects overfitting, with curves for stronger penalties exhibiting slower convergence but lower final validation errors in high-variance settings. Techniques like curve extrapolation, using parametric models (e.g., power-law forms ) fitted to partial training trajectories, predict full performance to expedite tuning, integrating seamlessly with methods like Bayesian optimization or SMAC by terminating suboptimal configurations early and yielding up to 2x speedups in deep network optimization.[16] The tuning process is inherently iterative, leveraging learning curve saturation—where validation error plateaus—as a signal to halt further adjustments, thereby avoiding unnecessary computations. When combined with grid search, curves prioritize promising hyperparameter regions by forecasting convergence; in Bayesian optimization, they inform acquisition functions by estimating uncertainty in extrapolated performance. Overfitting indicators from curves, such as widening train-validation gaps, further refine this process by prompting regularization tweaks.[16] A practical case study on dropout prediction in digital mental health interventions, using data from 3,654 users, demonstrates curve-based selection between support vector machines (SVM) and random forests (RF). Learning curves showed RF converging at around 1,000 samples with superior validation AUC (0.81 on mixed features) compared to SVM's 0.75-0.80, particularly on high-information feature sets, leading to selection of RF and an approximate 10% relative improvement in predictive accuracy over SVM baselines. This choice was informed by RF's lower overfitting at larger dataset sizes and faster saturation, enhancing model deployment efficiency.[17]Resource Estimation
Learning curves enable the forecasting of data requirements by extrapolating fitted models to determine the training set size needed to achieve a target error rate . A common approach involves fitting a power-law form to observed performance data, such as , where is the risk or error as a function of sample size , represents the irreducible error, and are fitted parameters capturing the rate of improvement, and indicates the curve's steepness. To estimate for , one solves for , assuming ; this extrapolation has been applied in domains like machine translation to predict data needs without exhaustive collection. Computational scaling predictions leverage learning curves by relating training iterations or epochs to floating-point operations (FLOPs), where total compute often scales linearly with data size and model parameters as . The curve's slope informs efficiency: for instance, in power-law regimes, doubling the data size typically halves the error but doubles the training time, guiding resource allocation for target performance.[18] In large language models, early empirical scaling laws (Kaplan et al., 2020) suggested compute-optimal configurations where loss with , recommending investment in model size () over data () to minimize loss under fixed budgets; however, later work such as the Chinchilla study (Hoffmann et al., 2022) revised the data scaling exponent to align with model size, yielding more balanced allocations () and an effective .[18][19] These methods rely on assumptions such as independent and identically distributed (i.i.d.) training data from a fixed distribution, which underpins theoretical guarantees for monotonic convergence. However, violations like non-stationarity—where data distribution shifts over time—can lead to ill-behaved curves, such as unexpected dips in performance, complicating reliable extrapolation and requiring adaptations like continual learning techniques. Software tools facilitate automated resource estimation; for example, scikit-learn'slearning_curve function computes cross-validated scores across varying training sizes, enabling fits and predictions of data needs for specified accuracy thresholds.[20].png)