Recent from talks
Nothing was collected or created yet.
Sequential minimal optimization
View on Wikipedia| Class | Optimization algorithm for training support vector machines |
|---|---|
| Worst-case performance | O(n³) |
Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM). It was invented by John Platt in 1998 at Microsoft Research.[1] SMO is widely used for training support vector machines and is implemented by the popular LIBSVM tool.[2][3] The publication of the SMO algorithm in 1998 has generated a lot of excitement in the SVM community, as previously available methods for SVM training were much more complex and required expensive third-party QP solvers.[4]
Optimization problem
[edit]Consider a binary classification problem with a dataset (x1, y1), ..., (xn, yn), where xi is an input vector and yi ∈ {-1, +1} is a binary label corresponding to it. A soft-margin support vector machine is trained by solving a quadratic programming problem, which is expressed in the dual form as follows:
- subject to:
where C is an SVM hyperparameter and K(xi, xj) is the kernel function, both supplied by the user; and the variables are Lagrange multipliers.
Algorithm
[edit]SMO is an iterative algorithm for solving the optimization problem described above. SMO breaks this problem into a series of smallest possible sub-problems, which are then solved analytically. Because of the linear equality constraint involving the Lagrange multipliers , the smallest possible problem involves two such multipliers. Then, for any two multipliers and , the constraints are reduced to:
and this reduced problem can be solved analytically: one needs to find a minimum of a one-dimensional quadratic function. is the negative of the sum over the rest of terms in the equality constraint, which is fixed in each iteration.
The algorithm proceeds as follows:
- Find a Lagrange multiplier that violates the Karush–Kuhn–Tucker (KKT) conditions for the optimization problem.
- Pick a second multiplier and optimize the pair .
- Repeat steps 1 and 2 until convergence.
When all the Lagrange multipliers satisfy the KKT conditions (within a user-defined tolerance), the problem has been solved. Although this algorithm is guaranteed to converge, heuristics are used to choose the pair of multipliers so as to accelerate the rate of convergence. This is critical for large data sets since there are possible choices for and .
Related work
[edit]The first approach to splitting large SVM learning problems into a series of smaller optimization tasks was proposed by Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik.[5] It is known as the "chunking algorithm". The algorithm starts with a random subset of the data, solves this problem, and iteratively adds examples which violate the optimality conditions. One disadvantage of this algorithm is that it is necessary to solve QP-problems scaling with the number of SVs. On real world sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm.[1]
In 1997, E. Osuna, R. Freund, and F. Girosi proved a theorem which suggests a whole new set of QP algorithms for SVMs.[6] By the virtue of this theorem a large QP problem can be broken down into a series of smaller QP sub-problems. A sequence of QP sub-problems that always add at least one violator of the Karush–Kuhn–Tucker (KKT) conditions is guaranteed to converge. The chunking algorithm obeys the conditions of the theorem, and hence will converge.[1] The SMO algorithm can be considered a special case of the Osuna algorithm, where the size of the optimization is two and both Lagrange multipliers are replaced at every step with new multipliers that are chosen via good heuristics.[1]
The SMO algorithm is closely related to a family of optimization algorithms called Bregman methods or row-action methods. These methods solve convex programming problems with linear constraints. They are iterative methods where each step projects the current primal point onto each constraint.[1]
See also
[edit]References
[edit]- ^ a b c d e Platt, John (1998). "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines" (PDF). CiteSeerX 10.1.1.43.4376.
- ^ Chang, Chih-Chung; Lin, Chih-Jen (2011). "LIBSVM: A library for support vector machines". ACM Transactions on Intelligent Systems and Technology. 2 (3). doi:10.1145/1961189.1961199. S2CID 961425.
- ^ Zanni, Luca (2006). "Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems" (PDF).
- ^ Rifkin, Ryan (2002). Everything Old is New Again: a Fresh Look at Historical Approaches in Machine Learning (Ph.D. Thesis). Massachusetts Institute of Technology. p. 18. hdl:1721.1/17549.
- ^ Boser, B. E.; Guyon, I. M.; Vapnik, V. N. (1992). "A training algorithm for optimal margin classifiers". Proceedings of the fifth annual workshop on Computational learning theory - COLT '92. p. 144. CiteSeerX 10.1.1.21.3818. doi:10.1145/130385.130401. ISBN 978-0897914970. S2CID 207165665.
- ^ Osuna, E.; Freund, R.; Girosi, F. (1997). "An improved training algorithm for support vector machines". Neural Networks for Signal Processing [1997] VII. Proceedings of the 1997 IEEE Workshop. pp. 276–285. CiteSeerX 10.1.1.392.7405. doi:10.1109/NNSP.1997.622408. ISBN 978-0-7803-4256-9. S2CID 5667586.
Sequential minimal optimization
View on GrokipediaBackground
Support Vector Machines
Support vector machines (SVMs) are supervised learning algorithms primarily used for binary classification tasks, where the goal is to find an optimal hyperplane that separates data points of two classes while maximizing the margin between them.[4] This maximum-margin hyperplane is defined by a weight vector and bias such that the distance from the hyperplane to the nearest data point (the margin) is maximized, leading to better generalization on unseen data.[4] The formulation assumes linearly separable data in the hard-margin case, where all training examples with labels satisfy .[4] In practice, data is often not perfectly separable, so the soft-margin SVM introduces slack variables to allow some misclassifications or points within the margin, controlled by a regularization parameter .[4] The hard-margin approach corresponds to , enforcing strict separation, while finite trades off margin maximization against classification errors, enabling handling of noisy or overlapping datasets.[4] This is formulated as the primal optimization problem: [4] To solve this constrained optimization, the primal is converted to its dual form using Lagrange multipliers and , forming the Lagrangian and applying the Karush-Kuhn-Tucker (KKT) conditions.[5] After eliminating the primal variables and , the dual problem emerges as a quadratic program that maximizes subject to and , highlighting the quadratic programming structure central to SVM training.[4][5] In the dual formulation, support vectors are the training points where , as these are the only ones contributing to the decision boundary; points with lie outside the margin and do not influence .[4] The classification decision function is then , where is a kernel function (e.g., linear or radial basis) that implicitly maps data to higher dimensions for non-linear separation via the kernel trick.[5] The bias is typically computed using any support vector as .[4]Quadratic Programming Formulation
The soft-margin support vector machine (SVM) formulation seeks to minimize the primal objective function , subject to the constraints and for , where is the weight vector, is the bias, is the regularization parameter, are the input features, are the labels, and are the slack variables allowing for misclassifications.[6] To derive the dual problem, introduce the Lagrange multipliers for the inequality constraints and for . The Lagrangian is then Setting the partial derivatives with respect to the primal variables to zero yields the stationarity conditions: , , and for each . Substituting these back into the Lagrangian eliminates , , , and , resulting in the dual optimization problem of maximizing subject to and for all .[6] This dual is a quadratic program (QP) with a quadratic objective function in the variables . The Hessian matrix of the objective is where and is the kernel function (initially the linear kernel ). The QP is convex because is positive semi-definite, as kernels are defined to ensure this property, guaranteeing a unique global maximum. The constraints consist of box bounds and a single linear equality .[6] The kernel trick extends this formulation to non-linear SVMs by replacing the dot product with a kernel , where maps inputs to a high-dimensional feature space, without explicitly computing or storing the feature map, thus enabling efficient computation of the decision function .[6] Solving this QP directly poses significant computational challenges for large , the number of training examples. General QP solvers, such as interior-point or active-set methods, typically require to time due to the need to form, store, and decompose or invert the dense Hessian , leading to prohibitive memory usage (up to space) and runtime for datasets with . For instance, active-set methods iteratively solve reduced subproblems but struggle with the dense Hessian in high dimensions, often failing to scale beyond moderate-sized problems due to caching inefficiencies and numerical instability in matrix operations.[7]Algorithm Description
Core Principles
Sequential minimal optimization (SMO) addresses the scalability challenges of training support vector machines (SVMs) by decomposing the underlying quadratic programming (QP) problem into a series of smaller, analytically solvable subproblems involving only two Lagrange multipliers at a time. This approach avoids the need for general-purpose QP solvers, which are computationally intensive for large datasets, by iteratively updating pairs of multipliers and while exploiting the closed-form solution for two-variable QPs.[1] The algorithm was introduced by John C. Platt in 1998 to enable efficient SVM training on datasets with thousands of examples, overcoming limitations in prior methods like chunking that still required significant computational resources.[1] Central to SMO's foundation are the Karush-Kuhn-Tucker (KKT) conditions, which define optimality in the dual formulation of the SVM QP. These conditions ensure that the solution satisfies stationarity (the gradient of the Lagrangian is zero), primal and dual feasibility (constraints like and hold), and complementarity. Specifically, for each training example , the complementarity slackness conditions are: , , and , where is the SVM output for example , is the label, and is the regularization parameter.[1] SMO selects pairs of multipliers that violate these KKT conditions to drive the solution toward optimality.[1] SMO guarantees progress toward the global optimum due to the convexity of the QP objective function; each update of a two-variable subproblem maintains feasibility and strictly decreases the objective value, as ensured by Osuna's theorem, which states that optimizing any subset containing at least one KKT violator reduces the objective.[1] Compared to full QP solvers, SMO achieves linear time complexity per iteration in the number of training examples , requiring no additional storage beyond the kernel matrix, and demonstrates empirical speedups of 10 to 100 times or more on real-world datasets.[1]Step-by-Step Procedure
The Sequential Minimal Optimization (SMO) algorithm begins by initializing all Lagrange multipliers for , where is the number of training examples, and the bias . The initial prediction errors are then computed as for all .[1] This setup ensures the starting point satisfies the non-negativity constraints and prepares for iterative optimization of the dual quadratic programming problem subject to .[1] The core iterative process forms the main loop of SMO, which continues until the Karush-Kuhn-Tucker (KKT) conditions are satisfied within a tolerance (typically ) for all training examples.[1] In each iteration, two indices and are selected to form a two-variable subproblem, which is solved analytically to update and .[1] Following the update, the bias is recomputed, and the algorithm checks for convergence by evaluating prediction errors for all , where .[1] The process typically converges in fewer than iterations for many datasets, though worst-case complexity remains quadratic.[1] Solving the two-variable subproblem analytically involves deriving new values for and while respecting the box constraints (where is the regularization parameter) and the equality constraint .[1] First, compute the second derivative term , and if , the unconstrained update for is , with .[1] These values are then clipped to their feasible bounds , where the bounds (assuming is primarily updated) are: if , , ; if , , , ensuring feasibility after the update. Clip to , then compute .[1] After updating and , the bias is adjusted to maintain consistency with the KKT conditions. Compute and . If , set ; if , set ; if both are interior, set ; if neither (both on bounds), retain the previous or use an appropriate bound-based adjustment.[1] Convergence is assessed by verifying that for every example , the KKT conditions hold approximately: , if , if , and if .[1] The algorithm terminates when no violates these within , ensuring the solution is -optimal for the dual problem.[1] The high-level pseudocode for SMO can be outlined as follows:Initialize α = [0, ..., 0], b = 0
Compute initial errors E_k = -y_k for all k
While there exists a k with |E_k| > ε or KKT violated:
Select indices i and j
Compute L and H based on y_i, y_j, α_i, α_j, C:
if y_i ≠ y_j:
L = max(0, α_j - α_i)
H = min(C, C + α_j - α_i)
else:
L = max(0, α_i + α_j - C)
H = min(C, α_i + α_j)
Compute η = K(x_i, x_i) + K(x_j, x_j) - 2 K(x_i, x_j)
If η ≠ 0:
α_j_new = α_j + y_j (E_i - E_j) / η
Clip α_j_new to [L, H]
α_i_new = α_i + y_i y_j (α_j - α_j_new)
Else:
# Handle degenerate case, e.g., no update
Update α_i and α_j
Update errors E_i and E_j (and others if needed)
Compute b_1 = b - E_i - y_i (α_i_new - α_i) K(x_i, x_i) - y_j (α_j_new - α_j) K(x_i, x_j)
Compute b_2 = b - E_j - y_i (α_i_new - α_i) K(x_j, x_i) - y_j (α_j_new - α_j) K(x_j, x_j)
If 0 < α_i_new < C:
b = b_1
Elif 0 < α_j_new < C:
b = b_2
Else:
b = (b_1 + b_2) / 2 # or retain if both bound
Return α, b
Initialize α = [0, ..., 0], b = 0
Compute initial errors E_k = -y_k for all k
While there exists a k with |E_k| > ε or KKT violated:
Select indices i and j
Compute L and H based on y_i, y_j, α_i, α_j, C:
if y_i ≠ y_j:
L = max(0, α_j - α_i)
H = min(C, C + α_j - α_i)
else:
L = max(0, α_i + α_j - C)
H = min(C, α_i + α_j)
Compute η = K(x_i, x_i) + K(x_j, x_j) - 2 K(x_i, x_j)
If η ≠ 0:
α_j_new = α_j + y_j (E_i - E_j) / η
Clip α_j_new to [L, H]
α_i_new = α_i + y_i y_j (α_j - α_j_new)
Else:
# Handle degenerate case, e.g., no update
Update α_i and α_j
Update errors E_i and E_j (and others if needed)
Compute b_1 = b - E_i - y_i (α_i_new - α_i) K(x_i, x_i) - y_j (α_j_new - α_j) K(x_i, x_j)
Compute b_2 = b - E_j - y_i (α_i_new - α_i) K(x_j, x_i) - y_j (α_j_new - α_j) K(x_j, x_j)
If 0 < α_i_new < C:
b = b_1
Elif 0 < α_j_new < C:
b = b_2
Else:
b = (b_1 + b_2) / 2 # or retain if both bound
Return α, b

