Hubbry Logo
Online machine learningOnline machine learningMain
Open search
Online machine learning
Community hub
Online machine learning
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Online machine learning
Online machine learning
from Wikipedia

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., prediction of prices in the financial international markets. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

Online machine learning algorithms find applications in a wide variety of fields such as sponsored search to maximize ad revenue, portfolio optimization, shortest path prediction (with stochastic weights, e.g. traffic on roads for a maps application), spam filtering, real-time fraud detection, dynamic pricing for e-commerce, etc. There is also growing interest in usage of online learning paradigms for LLMs to enable continuous, real-time adaptation after the initial training.[1]

Introduction

[edit]

In the setting of supervised learning, a function of is to be learned, where is thought of as a space of inputs and as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution on . In reality, the learner never knows the true distribution over instances. Instead, the learner usually has access to a training set of examples . In this setting, the loss function is given as , such that measures the difference between the predicted value and the true value . The ideal goal is to select a function , where is a space of functions called a hypothesis space, so that some notion of total loss is minimized. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms.

Statistical view of online learning

[edit]

In statistical learning models, the training sample are assumed to have been drawn from the true distribution and the objective is to minimize the expected "risk" A common paradigm in this situation is to estimate a function through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines. A purely online model in this category would learn based on just the new input , the current best predictor and some extra stored information (which is usually expected to have storage requirements independent of training data size). For many formulations, for example nonlinear kernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used where is permitted to depend on and all previous data points . In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous data points, but the solution may take less time to compute with the addition of a new data point, as compared to batch learning techniques.

A common strategy to overcome the above issues is to learn using mini-batches, which process a small batch of data points at a time, this can be considered as pseudo-online learning for much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data to obtain optimized out-of-core versions of machine learning algorithms, for example, stochastic gradient descent. When combined with backpropagation, this is currently the de facto training method for training artificial neural networks.

Example: linear least squares

[edit]

The simple example of linear least squares is used to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for example, with other convex loss functions.

Batch learning

[edit]

Consider the setting of supervised learning with being a linear function to be learned: where is a vector of inputs (data points) and is a linear filter vector. The goal is to compute the filter vector . To this end, a square loss function is used to compute the vector that minimizes the empirical loss where

Let be the data matrix and is the column vector of target values after the arrival of the first data points. Assuming that the covariance matrix is invertible (otherwise it is preferential to proceed in a similar fashion with Tikhonov regularization), the best solution to the linear least squares problem is given by

Now, calculating the covariance matrix takes time , inverting the matrix takes time , while the rest of the multiplication takes time , giving a total time of . When there are total points in the dataset, to recompute the solution after the arrival of every datapoint , the naive approach will have a total complexity . Note that when storing the matrix , then updating it at each step needs only adding , which takes time, reducing the total time to , but with an additional storage space of to store .[2]

Online learning: recursive least squares

[edit]

The recursive least squares (RLS) algorithm considers an online approach to the least squares problem. It can be shown that by initialising and , the solution of the linear least squares problem given in the previous section can be computed by the following iteration: The above iteration algorithm can be proved using induction on .[3] The proof also shows that . One can look at RLS also in the context of adaptive filters (see RLS).

The complexity for steps of this algorithm is , which is an order of magnitude faster than the corresponding batch learning complexity. The storage requirements at every step here are to store the matrix , which is constant at . For the case when is not invertible, consider the regularised version of the problem loss function . Then, it's easy to show that the same algorithm works with , and the iterations proceed to give .[2]

Stochastic gradient descent

[edit]

When this is replaced by or by , this becomes the stochastic gradient descent algorithm. In this case, the complexity for steps of this algorithm reduces to . The storage requirements at every step are constant at .

However, the stepsize needs to be chosen carefully to solve the expected risk minimization problem, as detailed above. By choosing a decaying step size one can prove the convergence of the average iterate . This setting is a special case of stochastic optimization, a well known problem in optimization.[2]

Incremental stochastic gradient descent

[edit]

In practice, one can perform multiple stochastic gradient passes (also called cycles or epochs) over the data. The algorithm thus obtained is called incremental gradient method and corresponds to an iteration The main difference with the stochastic gradient method is that here a sequence is chosen to decide which training point is visited in the -th step. Such a sequence can be stochastic or deterministic. The number of iterations is then decoupled to the number of points (each point can be considered more than once). The incremental gradient method can be shown to provide a minimizer to the empirical risk.[4] Incremental techniques can be advantageous when considering objective functions made up of a sum of many terms e.g. an empirical error corresponding to a very large dataset.[2]

Kernel methods

[edit]

Kernels can be used to extend the above algorithms to non-parametric models (or models where the parameters form an infinite dimensional space). The corresponding procedure will no longer be truly online and instead involve storing all the data points, but is still faster than the brute force method. This discussion is restricted to the case of the square loss, though it can be extended to any convex loss. It can be shown by an easy induction [2] that if is the data matrix and is the output after steps of the SGD algorithm, then, where and the sequence satisfies the recursion: and Notice that here is just the standard Kernel on , and the predictor is of the form


Now, if a general kernel is introduced instead and let the predictor be then the same proof will also show that predictor minimising the least squares loss is obtained by changing the above recursion to The above expression requires storing all the data for updating . The total time complexity for the recursion when evaluating for the -th datapoint is , where is the cost of evaluating the kernel on a single pair of points.[2] Thus, the use of the kernel has allowed the movement from a finite dimensional parameter space to a possibly infinite dimensional feature represented by a kernel by instead performing the recursion on the space of parameters , whose dimension is the same as the size of the training dataset. In general, this is a consequence of the representer theorem.[2]

Online convex optimization

[edit]

Online convex optimization (OCO) [5] is a general framework for decision making which leverages convex optimization to allow for efficient algorithms. The framework is that of repeated game playing as follows:

For

  • Learner receives input
  • Learner outputs from a fixed convex set
  • Nature sends back a convex loss function .
  • Learner suffers loss and updates its model

The goal is to minimize regret, or the difference between cumulative loss and the loss of the best fixed point in hindsight. As an example, consider the case of online least squares linear regression. Here, the weight vectors come from the convex set , and nature sends back the convex loss function . Note here that is implicitly sent with .

Some online prediction problems however cannot fit in the framework of OCO. For example, in online classification, the prediction domain and the loss functions are not convex. In such scenarios, two simple techniques for convexification are used: randomisation and surrogate loss functions.[citation needed]

Some simple online convex optimisation algorithms are:

Follow the leader (FTL)

[edit]

The simplest learning rule to try is to select (at the current step) the hypothesis that has the least loss over all past rounds. This algorithm is called Follow the leader, and round is simply given by: This method can thus be looked as a greedy algorithm. For the case of online quadratic optimization (where the loss function is ), one can show a regret bound that grows as . However, similar bounds cannot be obtained for the FTL algorithm for other important families of models like online linear optimization. To do so, one modifies FTL by adding regularisation.

Follow the regularised leader (FTRL)

[edit]

This is a natural modification of FTL that is used to stabilise the FTL solutions and obtain better regret bounds. A regularisation function is chosen and learning performed in round t as follows: As a special example, consider the case of online linear optimisation i.e. where nature sends back loss functions of the form . Also, let . Suppose the regularisation function is chosen for some positive number . Then, one can show that the regret minimising iteration becomes Note that this can be rewritten as , which looks exactly like online gradient descent.

If S is instead some convex subspace of , S would need to be projected onto, leading to the modified update rule This algorithm is known as lazy projection, as the vector accumulates the gradients. It is also known as Nesterov's dual averaging algorithm. In this scenario of linear loss functions and quadratic regularisation, the regret is bounded by , and thus the average regret goes to 0 as desired.

Online subgradient descent (OSD)

[edit]

The above proved a regret bound for linear loss functions . To generalise the algorithm to any convex loss function, the subgradient of is used as a linear approximation to near , leading to the online subgradient descent algorithm:

Initialise parameter

For

  • Predict using , receive from nature.
  • Choose
  • If , update as
  • If , project cumulative gradients onto i.e.

One can use the OSD algorithm to derive regret bounds for the online version of SVM's for classification, which use the hinge loss

Other algorithms

[edit]

Quadratically regularised FTRL algorithms lead to lazily projected gradient algorithms as described above. To use the above for arbitrary convex functions and regularisers, one uses online mirror descent. The optimal regularization in hindsight can be derived for linear loss functions, this leads to the AdaGrad algorithm. For the Euclidean regularisation, one can show a regret bound of , which can be improved further to a for strongly convex and exp-concave loss functions.

Continual learning

[edit]

Continual learning means constantly improving the learned model by processing continuous streams of information.[6] Continual learning capabilities are essential for software systems and autonomous agents interacting in an ever changing real world. However, continual learning is a challenge for machine learning and neural network models since the continual acquisition of incrementally available information from non-stationary data distributions generally leads to catastrophic forgetting.

Interpretations of online learning

[edit]

The paradigm of online learning has different interpretations depending on the choice of the learning model, each of which has distinct implications about the predictive quality of the sequence of functions . The prototypical stochastic gradient descent algorithm is used for this discussion. As noted above, its recursion is given by

The first interpretation consider the stochastic gradient descent method as applied to the problem of minimizing the expected risk defined above.[7] Indeed, in the case of an infinite stream of data, since the examples are assumed to be drawn i.i.d. from the distribution , the sequence of gradients of in the above iteration are an i.i.d. sample of stochastic estimates of the gradient of the expected risk and therefore one can apply complexity results for the stochastic gradient descent method to bound the deviation , where is the minimizer of .[8] This interpretation is also valid in the case of a finite training set; although with multiple passes through the data the gradients are no longer independent, still complexity results can be obtained in special cases.

The second interpretation applies to the case of a finite training set and considers the SGD algorithm as an instance of incremental gradient descent method.[4] In this case, one instead looks at the empirical risk: Since the gradients of in the incremental gradient descent iterations are also stochastic estimates of the gradient of , this interpretation is also related to the stochastic gradient descent method, but applied to minimize the empirical risk as opposed to the expected risk. Since this interpretation concerns the empirical risk and not the expected risk, multiple passes through the data are readily allowed and actually lead to tighter bounds on the deviations , where is the minimizer of .

Implementations

[edit]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Online machine learning, also known as online learning, is a subfield of machine learning that enables algorithms to learn incrementally from a continuous stream of data points arriving sequentially in real time, updating the model after each observation to make predictions or decisions without revisiting previous data or requiring a complete dataset upfront. This approach contrasts with traditional batch learning, where models are trained offline on a fixed, static dataset all at once, making online methods particularly suited for dynamic environments with non-stationary data distributions. Unlike batch methods, online learning emphasizes efficiency, adaptability, and low computational overhead, as it processes data one instance at a time and incorporates feedback immediately to refine performance. Historically, online learning traces back to early algorithms like the perceptron in the 1950s and gained prominence in the 2000s with advancements in online convex optimization. A core aspect of online machine learning is its handling of feedback mechanisms, which can be full (supervised, where true labels are revealed after each prediction), partial (bandit feedback, revealing only the outcome for the chosen action), or none (unsupervised, focusing on pattern discovery without labels). Key challenges include managing concept drift, where the underlying data distribution changes over time, potentially degrading model accuracy, and avoiding catastrophic forgetting while adapting to new information. To address these, online algorithms often employ regret minimization frameworks, measuring cumulative loss relative to the best fixed comparator in hindsight, ensuring sublinear regret bounds for long-term performance guarantees. Prominent algorithms in online machine learning include stochastic gradient descent variants for convex optimization, such as the perceptron algorithm for binary classification and follow-the-regularized-leader (FTRL) for generalized linear models, which achieve efficient updates with provable convergence rates. More advanced methods integrate kernel tricks for non-linear problems or incorporate meta-learning to accelerate adaptation across tasks, as seen in online meta-learning approaches like model-agnostic meta-learning (MAML) adaptations for sequential episodes. Applications span diverse domains, including real-time fraud detection in healthcare and finance, where models must evolve with emerging patterns (e.g., detecting an estimated $130–430 billion in annual U.S. healthcare fraud, based on 3–10% of total expenditures); network intrusion detection; spam filtering; and adaptive robotics for environment-specific tasks like manipulation. Developments since 2020 emphasize integration with deep learning and continual learning paradigms to handle large-scale, non-stationary streams in areas like large language model fine-tuning and reinforcement learning.

Introduction

Definition and Core Principles

Online machine learning is a paradigm in which models are trained incrementally on a stream of data arriving sequentially, updating predictions or decisions based on each new instance without requiring access to the full dataset in advance. This approach enables the learner to process data one at a time, incorporating immediate feedback—such as true labels in supervised settings—to refine the hypothesis continuously. Unlike traditional batch learning, which processes a fixed dataset holistically, online machine learning emphasizes real-time adaptation to evolving information. At its core, online machine learning operates on principles of sequential decision-making, where the model generates a prediction for each incoming instance before receiving feedback and adjusting accordingly to minimize cumulative error over time. This involves immediate feedback loops that allow the system to learn from discrepancies between predictions and outcomes, fostering efficiency in resource-constrained environments like streaming data applications. Additionally, the paradigm prioritizes adaptability to non-stationary data, where underlying distributions may shift, enabling models to evolve without retraining from scratch. These principles underpin the framework's suitability for dynamic scenarios, such as real-time recommendation systems or sensor networks. Key characteristics of online machine learning include single-pass learning, where each data instance is observed only once, and real-time model updates that support scalability for unbounded streams. It also inherently addresses concept drift—the gradual or abrupt changes in data relationships—at a high level by facilitating continual refinement rather than static models. For instance, a simple online averaging algorithm for mean prediction updates the estimate μ^t\hat{\mu}_t sequentially as follows: μ^t=μ^t1+1t(xtμ^t1)\hat{\mu}_t = \hat{\mu}_{t-1} + \frac{1}{t} (x_t - \hat{\mu}_{t-1}) where xtx_t is the new observation at time tt, and the learning rate η=1/t\eta = 1/t ensures convergence to the true mean under stationary conditions. This basic update exemplifies how online methods maintain a lightweight, adaptive predictor with minimal storage.

Historical Development and Motivations

The roots of online machine learning can be traced to the 1960s in the field of adaptive signal processing, where researchers developed algorithms to enable systems to learn incrementally from sequential data without requiring full retraining. A seminal contribution was the least mean squares (LMS) algorithm introduced by Bernard Widrow and Marcian E. Hoff in 1960, which provided a practical method for adjusting model parameters in real-time based on error feedback, laying the groundwork for adaptive filtering and early forms of online adaptation. This work emerged from efforts to build self-adjusting systems, such as the ADALINE neural network, and marked the initial shift toward learning paradigms that could handle dynamic inputs efficiently. The formalization of online machine learning accelerated in the late 1990s and early 2000s through advancements in theoretical frameworks, particularly in online convex optimization. A key milestone was Martin Zinkevich's 2003 paper, which introduced online gradient descent as a method for solving sequential convex optimization problems with sublinear regret bounds, enabling robust performance in adversarial settings. Building on this, Nicolo Cesa-Bianchi and Gabor Lugosi's 2006 book Prediction, Learning, and Games synthesized the field by exploring online prediction strategies, expert advice aggregation, and game-theoretic perspectives, providing a comprehensive theoretical foundation that influenced subsequent algorithm design. These developments established online learning as a distinct subfield, emphasizing regret minimization and no-regret learning over traditional batch approaches. The rapid growth of online machine learning in the 2000s was propelled by the explosion of big data and the need to process streaming information in real-time, motivating its adoption in scenarios where batch retraining was infeasible due to computational constraints. Key drivers included resource-limited environments, such as embedded systems, and applications requiring immediate responsiveness, like recommendation engines that update user preferences on-the-fly to deliver personalized suggestions. Additionally, the ability to manage infinite or non-stationary data streams—common in sensor networks and financial modeling—highlighted online methods' advantages in scalability and adaptability, fostering their integration into practical systems. This evolution also intersected with continual learning challenges, where models must accumulate knowledge over time without forgetting prior adaptations.

Background and Foundations

Batch Learning versus Online Learning

Batch learning, also known as offline learning, involves training a machine learning model on the entire available dataset at once, enabling global optimization of the model's parameters. In this paradigm, the full dataset is accessible upfront, allowing algorithms to compute solutions that minimize a loss function over all data points simultaneously. For instance, ordinary least squares (OLS) regression exemplifies batch learning through its closed-form solution, w=(XTX)1XTy\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}, where X\mathbf{X} is the full design matrix and y\mathbf{y} the response vector, requiring inversion of the Gram matrix derived from the complete dataset. This approach suits static environments where data is fixed and retraining occurs periodically after accumulating new samples. In contrast, online learning processes data incrementally as it arrives in a stream, updating the model sequentially without storing or revisiting the full dataset. Each new instance triggers an immediate parameter adjustment, making it ideal for scenarios with continuous data flows, such as real-time sensor inputs or web traffic monitoring. Unlike batch methods, online learning avoids the need for large-scale matrix operations, relying instead on lightweight, per-instance computations to maintain efficiency. The paradigms differ fundamentally in data handling, computational demands, and adaptability to changing conditions. Batch learning demands full dataset access for holistic optimization, often leading to higher computational costs from operations like matrix inversions, but it yields precise models by considering global patterns. Online learning, however, excels in computational efficiency by sidestepping such inversions through incremental updates, though it may introduce minor accuracy trade-offs due to sequential processing. In terms of hardware requirements, batch learning often relies on GPUs for their parallel processing capabilities to handle large-scale matrix operations efficiently, whereas online learning's incremental nature requires lower parallelization, making it suitable for CPUs and NPUs on edge devices, which improves overall compute economics without fully replacing GPUs. Regarding adaptability, batch methods struggle with evolving data distributions, requiring full retraining to incorporate changes, whereas online approaches handle non-stationary streams dynamically, responding instantly to new information. These differences make batch learning preferable for offline analysis on bounded datasets, while online learning is suited to resource-constrained, streaming applications. Trade-offs between the two include batch learning's tendency for greater accuracy at the expense of slower updates and higher resource use, versus online learning's speed and low memory footprint, potentially offset by cumulative errors in long sequences if updates are not robust. The following table summarizes key pros and cons:
AspectBatch LearningOnline Learning
Memory UsageHigh (requires full dataset storage)Low (processes one instance at a time)
Update SpeedSlow (full retraining needed)Fast (incremental updates)
Error AccumulationMinimal (global optimization)Possible (sequential updates may propagate errors)
These contrasts lay the groundwork for statistical frameworks that formalize online learning's guarantees under sequential data assumptions.

Statistical Framework for Online Learning

Online learning is framed statistically as a sequential process in which a learner makes predictions over a series of rounds t=1,,Tt = 1, \dots, T, aiming to minimize the cumulative expected loss incurred against an unknown sequence of outcomes. In each round, the learner selects a decision wtw_t from a decision set SS, observes an example ψt\psi_t, and suffers a loss (wt,ψt)\ell(w_t, \psi_t) based on a predefined loss function \ell. The expected loss for a fixed decision ww is defined as C(w)=EψQ[(w,ψ)]C(w) = \mathbb{E}_{\psi \sim Q} [\ell(w, \psi)], where QQ is the (possibly adversarial or stochastic) distribution over examples; the learner's objective is to ensure that the average expected loss 1Tt=1TC(wt)\frac{1}{T} \sum_{t=1}^T C(w_t) approaches the minimum possible expected loss minuSC(u)\min_{u \in S} C(u). A central metric in this framework is regret, which quantifies the learner's suboptimality relative to the best fixed decision in hindsight: RegretT(S)=t=1T(wt,ψt)minuSt=1T(u,ψt)\text{Regret}_T(S) = \sum_{t=1}^T \ell(w_t, \psi_t) - \min_{u \in S} \sum_{t=1}^T \ell(u, \psi_t). Sublinear regret growth in TT implies that the learner's average loss converges to that of the optimal decision, providing a performance guarantee even without prior knowledge of the data distribution. This setup contrasts with batch learning, where least squares minimization is performed over a complete dataset to estimate parameters globally. Loss functions in online learning are typically convex to ensure tractable optimization and theoretical guarantees. For regression problems, the squared loss L(y,y^)=(yy^)2L(y, \hat{y}) = (y - \hat{y})^2 measures the deviation between the true outcome yy and prediction y^\hat{y}, facilitating analysis under mean-squared error criteria. In classification settings, the hinge loss L(y,y^)=max(0,1yy^)L(y, \hat{y}) = \max(0, 1 - y \hat{y}), where y{1,1}y \in \{-1, 1\}, promotes margin-based separation and is widely used in online variants of support vector machines. From a probabilistic viewpoint, online learning aligns with Bayesian inference, where parameters θ\theta (e.g., model weights) are updated sequentially to approximate the posterior distribution given streaming data Dt={(x1,y1),,(xt,yt)}D_t = \{ (x_1, y_1), \dots, (x_t, y_t) \}. The update follows p(θDt)p(ytxt,θ)p(θDt1)p(\theta | D_t) \propto p(y_t | x_t, \theta) p(\theta | D_{t-1}), enabling incremental methods to approximate the posterior mode or distribution without full recomputation, which is particularly useful for handling non-stationary environments. A canonical example is online linear regression, where data arrives as sequential pairs (xt,yt)Rd×R(\mathbf{x}_t, y_t) \in \mathbb{R}^d \times \mathbb{R}, and the learner maintains a parameter vector w\mathbf{w} to predict y^t=wTxt\hat{y}_t = \mathbf{w}^T \mathbf{x}_t. The online risk minimization problem seeks to reduce the cumulative expected squared loss t=1TE[(ytwTxt)2]\sum_{t=1}^T \mathbb{E} [(y_t - \mathbf{w}^T \mathbf{x}_t)^2] by updating w\mathbf{w} after each observation, adapting to potential shifts in the underlying data distribution.

Fundamental Algorithms

Stochastic Gradient Descent

Stochastic gradient descent (SGD) serves as a cornerstone algorithm in online machine learning, enabling iterative parameter updates based on individual data points as they arrive, thereby adapting to streaming data without requiring full dataset storage. Introduced as a stochastic approximation method, SGD approximates the gradient of the empirical risk by using the gradient computed on a single example at each step, which facilitates efficient processing in resource-constrained environments. The core update rule is given by wt+1=wtηtL(wt;xt,yt),\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \nabla L(\mathbf{w}_t; \mathbf{x}_t, y_t), where wt\mathbf{w}_t denotes the parameter vector at time tt, ηt\eta_t is the learning rate (often diminishing, such as ηt=η0/t\eta_t = \eta_0 / \sqrt{t}
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.