Hubbry Logo
Learning curve (machine learning)Learning curve (machine learning)Main
Open search
Learning curve (machine learning)
Community hub
Learning curve (machine learning)
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Learning curve (machine learning)
Learning curve (machine learning)
from Wikipedia
Learning curve plot of training set size vs training score (loss) and cross-validation score

In machine learning (ML), a learning curve (or training curve) is a graphical representation that shows how a model's performance on a training set (and usually a validation set) changes with the number of training iterations (epochs) or the amount of training data.[1] Typically, the number of training epochs or training set size is plotted on the x-axis, and the value of the loss function (and possibly some other metric such as the cross-validation score) on the y-axis.

Synonyms include error curve, experience curve, improvement curve and generalization curve.[2]

More abstractly, learning curves plot the difference between learning effort and predictive performance, where "learning effort" usually means the number of training samples, and "predictive performance" means accuracy on testing samples.[3]

Learning curves have many useful purposes in ML, including:[4][5][6]

  • choosing model parameters during design,
  • adjusting optimization to improve convergence,
  • and diagnosing problems such as overfitting (or underfitting).

Learning curves can also be tools for determining how much a model benefits from adding more training data, and whether the model suffers more from a variance error or a bias error. If both the validation score and the training score converge to a certain value, then the model will no longer significantly benefit from more training data.[7]

Formal definition

[edit]

When creating a function to approximate the distribution of some data, it is necessary to define a loss function to measure how good the model output is (e.g., accuracy for classification tasks or mean squared error for regression). We then define an optimization process which finds model parameters such that is minimized, referred to as .

Training curve for amount of data

[edit]

If the training data is

and the validation data is

,

a learning curve is the plot of the two curves

where

Training curve for number of iterations

[edit]

Many optimization algorithms are iterative, repeating the same step (such as backpropagation) until the process converges to an optimal value. Gradient descent is one such algorithm. If is the approximation of the optimal after steps, a learning curve is the plot of

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In , a is a graphical representation that depicts how the expected (or ) of a learning changes as the size of the increases. Formally, it is defined as the plot of Rˉn(A)=ESnPn[R(A(Sn))]\bar{R}_n(A) = \mathbb{E}_{S_n \sim P^n} [R(A(S_n))] against the training set size nn, where SnS_n consists of nn independent and identically distributed samples drawn from an unknown joint distribution PXYP_{XY}, AA denotes the learning trained on SnS_n, and RR measures the or expected loss of the resulting model over the true distribution PXYP_{XY}. This curve provides insight into the 's sample efficiency and convergence behavior under the assumption of i.i.d. . The term "learning curve" originated in psychological research on human memory and skill acquisition, notably in Hermann Ebbinghaus's 1885 work Über das Gedächtnis, which described how repetition reduces forgetting over time. It was later adapted to machine learning contexts in the mid-20th century, with early experimental uses by Foley in and theoretical developments by Rosenblatt in his 1958 model, building on from the 1960s. In practice, learning curves are estimated empirically through methods like k-fold cross-validation or hold-out validation, where performance is averaged over multiple random subsets of varying sizes, often with hyperparameter tuning adjusted for each size to ensure reliable approximations. Synonyms in the literature include error curve, experience curve, and generalization curve, reflecting its interdisciplinary roots in statistics, , and . Learning curves are essential for practical machine learning tasks, such as selecting algorithms by identifying where their performance curves intersect, predicting the data requirements needed to achieve target accuracy, and optimizing resource allocation in data-scarce scenarios. A key application is detecting overfitting, achieved by plotting the learning curve alongside the training error: divergence between the two indicates memorization of training data without generalization. Typically, well-behaved learning curves show monotonic improvement, often following power-law (Rˉnnα\bar{R}_n \propto n^{-\alpha}) or exponential decay patterns supported by probably approximately correct (PAC) learning bounds and Gaussian process models, as observed across diverse datasets in deep learning and classical algorithms. However, ill-behaved curves—exhibiting peaking (temporary worsening before improvement), dipping (initial degradation), or phase transitions—can arise due to model misspecification, optimization instability, or non-i.i.d. data effects, as documented in empirical studies on over 200 datasets. These shapes inform advanced techniques like progressive sampling and meta-learning to extrapolate performance and mitigate inefficiencies.

Fundamentals

Definition and Purpose

In , a is a graphical representation that plots a model's performance metric, such as error rate or accuracy, against variables like the size of the training set or the number of training iterations, thereby illustrating how the model's ability to generalize from evolves over time and potentially reaches a plateau. This visualization captures the progression of learning dynamics, showing initial rapid improvements followed by diminishing returns as the model saturates its capacity to extract patterns from the available . The primary purpose of a learning curve is to evaluate the efficiency with which a model acquires from , diagnose limitations in its representational capacity, and inform practical decisions regarding the acquisition of additional examples or modifications to the model . By revealing whether performance gains are constrained by scarcity or model complexity, learning curves serve as a diagnostic tool to optimize in workflows, ensuring that efforts focus on bottlenecks that hinder . Subtypes such as curves, which track performance on the training set, and validation curves, which monitor unseen , provide complementary insights into this process. The concept of learning curves originated in psychological research on human memory and skill acquisition, notably in Hermann Ebbinghaus's 1885 work Über das Gedächtnis. It was later applied in , such as Theodore Wright's 1936 analysis of labor efficiency in aircraft manufacturing, where he observed that unit production costs decrease predictably as cumulative experience grows. This idea was adapted to contexts in the mid-20th century, with early experimental uses by Foley in and theoretical developments by Rosenblatt in his 1958 model, building on from the 1960s. In the and , particularly in the context of neural networks, researchers like David Rumelhart and James McClelland employed learning curves to demonstrate how connectionist models progressively refine internal representations through iterative exposure to examples. Their work in parallel distributed processing highlighted the curve's utility in modeling acquisition of complex skills, bridging psychological principles with computational learning.

Key Components

The horizontal axis of a learning curve in typically represents the amount of training experience, which can be measured as the size of the training dataset (e.g., number of samples nn) or the number of training iterations (e.g., epochs or steps). The vertical axis depicts a performance metric, such as error rate or accuracy, quantifying how well the model performs on the data. This setup allows visualization of how performance evolves with increasing resources dedicated to training. Performance metrics on the vertical axis are chosen based on the task: for regression problems, (MSE) is common, defined as MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, where yiy_i are true values and y^i\hat{y}_i are predictions; for , loss is frequently used, given by iyilog(y^i)-\sum_{i} y_i \log(\hat{y}_i) for probabilistic outputs. These metrics distinguish between empirical risk, which is the training error computed on seen data, and , estimated via validation or test error on unseen data to reflect real-world performance. Accuracy, as the proportion of correct predictions, serves as an alternative score for , though it is less informative for imbalanced datasets. Datasets underpinning learning curves rely on splits into and validation (or hold-out) sets to prevent leakage, where the set is used to fit the model and the validation set evaluates generalization without influencing parameter updates. Cross-validation techniques, such as k-fold, often generate multiple such splits to compute robust estimates, ensuring the curve reflects average behavior across folds rather than a single partition. This separation is crucial for reliable metric computation, as reusing data for evaluation would inflate performance unrealistically. Learning curves comprise distinct types: the training curve tracks performance on the training set, typically showing rapid improvement as the model memorizes seen data, while the validation curve monitors held-out data, decreasing more slowly to indicate generalization progress. Training error generally declines faster than validation error due to the model's increasing fit to the specific training samples, highlighting the gap between memorization and true learning. These curves are plotted separately but often overlaid for comparison, using the same axes to reveal discrepancies.

Construction and Visualization

Plotting Against Training Data Size

To construct a learning curve by plotting against training data size, the standard procedure involves incrementally subsampling the available training dataset to create subsets of increasing size, typically from a small fraction (e.g., 10%) up to the full dataset (100%). For each subset of size nn, a model is trained, and both training error and validation error are computed; these errors are then averaged across multiple runs, often using k-fold cross-validation to ensure robustness against specific data splits. This approach allows visualization of how model performance evolves as more data is incorporated, with the x-axis representing nn (often on a logarithmic scale) and the y-axis showing error metrics such as mean squared error (MSE). Formally, for a hypothesis class H\mathcal{H} and loss function LL, the training error is defined as the empirical risk Remp(h;Sn)=1ni=1nL(yi,h(xi))R_{\text{emp}}(h; S_n) = \frac{1}{n} \sum_{i=1}^n L(y_i, h(x_i)), where Sn={(xi,yi)}i=1nS_n = \{(x_i, y_i)\}_{i=1}^n is the training sample drawn i.i.d. from the data distribution PP, and hHh \in \mathcal{H} is the learned hypothesis. The true risk, or expected risk, is R(h)=E(x,y)P[L(y,h(x))]R(h) = \mathbb{E}_{(x,y) \sim P} [L(y, h(x))], which the validation error approximates by evaluating on held-out data. The learning curve then traces the expected empirical risk Rˉn(A)=ESnPn[R(A(Sn))]\bar{R}_n(A) = \mathbb{E}_{S_n \sim P^n} [R(A(S_n))] or the problem-averaged version RˉPAn(A)=EP[Rˉn(A)]\bar{R}_{PA_n}(A) = \mathbb{E}_P [\bar{R}_n(A)] as a function of nn, where AA denotes the learning algorithm. In , the learning curve is often modeled parametrically as a power-law form R(n)anb+cR(n) \approx a n^{-b} + c, where a>0a > 0 captures the initial decay , b>0b > 0 is the (steeper for faster convergence), and c0c \geq 0 is the irreducible error (Bayes risk). This form arises from generalization bounds tied to the VC dimension dd of H\mathcal{H}; a sketch of the derivation follows from the fundamental theorem of statistical learning, which bounds the deviation R(h)Remp(h)2n(dln2nd+ln4δ)|R(h) - R_{\text{emp}}(h)| \leq \sqrt{\frac{2}{n} \left( d \ln \frac{2n}{d} + \ln \frac{4}{\delta} \right)}
Add your contribution
Related Hubs
User Avatar
No comments yet.