Hubbry Logo
search
logo

Subgradient method

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

Subgradient methods are convex optimization methods which use subderivatives. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of gradient descent.

Subgradient methods are slower than Newton's method when applied to minimize twice continuously differentiable convex functions. However, Newton's method fails to converge on problems that have non-differentiable kinks.

In recent years, some interior-point methods have been suggested for convex minimization problems, but subgradient projection methods and related bundle methods of descent remain competitive. For convex minimization problems with very large number of dimensions, subgradient-projection methods are suitable, because they require little storage.

Subgradient projection methods are often applied to large-scale problems with decomposition techniques. Such decomposition methods often allow a simple distributed method for a problem.

Classical subgradient rules

[edit]

Let be a convex function with domain A classical subgradient method iterates where denotes any subgradient of at and is the iterate of If is differentiable, then its only subgradient is the gradient vector itself. It may happen that is not a descent direction for at We therefore maintain a list that keeps track of the lowest objective function value found so far, i.e.

Step size rules

[edit]

Many different types of step-size rules are used by subgradient methods. This article notes five classical step-size rules for which convergence proofs are known:

  • Constant step size,
  • Constant step length, which gives
  • Square summable but not summable step size, i.e. any step sizes satisfying
  • Nonsummable diminishing, i.e. any step sizes satisfying
  • Nonsummable diminishing step lengths, i.e. where

For all five rules, the step-sizes are determined "off-line", before the method is iterated; the step-sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step-size rules used for descent methods for differentiable functions: Many methods for minimizing differentiable functions satisfy Wolfe's sufficient conditions for convergence, where step-sizes typically depend on the current point and the current search-direction. An extensive discussion of stepsize rules for subgradient methods, including incremental versions, is given in the books by Bertsekas[1] and by Bertsekas, Nedic, and Ozdaglar.[2]

Convergence results

[edit]

For constant step-length and scaled subgradients having Euclidean norm equal to one, the subgradient method converges to an arbitrarily close approximation to the minimum value, that is

by a result of Shor.[3]

These classical subgradient methods have poor performance and are no longer recommended for general use.[4][5] However, they are still used widely in specialized applications because they are simple and they can be easily adapted to take advantage of the special structure of the problem at hand.

Subgradient-projection and bundle methods

[edit]

During the 1970s, Claude Lemaréchal and Phil Wolfe proposed "bundle methods" of descent for problems of convex minimization.[6] The meaning of the term "bundle methods" has changed significantly since that time. Modern versions and full convergence analysis were provided by Kiwiel. [7] Contemporary bundle-methods often use "level control" rules for choosing step-sizes, developing techniques from the "subgradient-projection" method of Boris T. Polyak (1969). However, there are problems on which bundle methods offer little advantage over subgradient-projection methods.[4][5]

Constrained optimization

[edit]

Projected subgradient

[edit]

One extension of the subgradient method is the projected subgradient method, which solves the constrained optimization problem

minimize subject to

where is a convex set. The projected subgradient method uses the iteration where is projection on and is any subgradient of at

General constraints

[edit]

The subgradient method can be extended to solve the inequality constrained problem

minimize subject to

where are convex. The algorithm takes the same form as the unconstrained case where is a step size, and is a subgradient of the objective or one of the constraint functions at Take where denotes the subdifferential of If the current point is feasible, the algorithm uses an objective subgradient; if the current point is infeasible, the algorithm chooses a subgradient of any violated constraint.

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The subgradient method is an iterative optimization algorithm designed to minimize nondifferentiable convex functions by selecting a subgradient from the function's subdifferential at each step, generalizing the classical gradient descent method for smooth functions.[1] In its basic form, the method updates the iterate as $ x^{k+1} = x^k - \alpha_k g^k $, where $ g^k $ is a subgradient of the objective function $ f $ at $ x^k $, and $ \alpha_k > 0 $ is a step size chosen via rules such as constant, diminishing (e.g., $ \alpha_k = a / \sqrt{k} $), or Polyak's method.[1] Unlike gradient descent, it is not guaranteed to decrease the function value at every iteration but converges to the global minimum for convex objectives under appropriate step-size policies, assuming bounded subgradients and knowledge of problem parameters like the distance to the optimum.[1] Originally developed by Naum Z. Shor and collaborators in the Soviet Union during the 1960s and 1970s, the subgradient method addressed the need for optimization techniques applicable to nonsmooth problems arising in areas like operations research and engineering.[1] Shor's foundational work, including his 1962 publication on transportation problems and later monograph Minimization Methods for Nondifferentiable Functions (1985), established the core principles, emphasizing its utility for functions where derivatives do not exist, such as norms or indicator functions of convex sets. Subsequent refinements, such as Polyak's step-size rules in 1969, improved practical performance by adapting to the problem's geometry without requiring exact knowledge of the optimum value.[1] The method's key strengths lie in its simplicity and broad applicability to constrained optimization via projections, dual decomposition, and feasibility problems, making it a cornerstone for large-scale convex programming in machine learning, signal processing, and network design.[1] For instance, it enables solving problems like matrix completion or support vector machine training where the objective involves nonsmooth regularizers.[1] Despite slower convergence compared to interior-point methods for smooth cases—typically $ O(1/\sqrt{k}) $ for the best function value after $ k $ iterations—variants like bundle methods or accelerated schemes have extended its efficiency for modern high-dimensional challenges.[1]

Fundamentals

Subgradients

In convex optimization, the subgradient generalizes the notion of the gradient to nonsmooth convex functions. For a convex function f:RnRf: \mathbb{R}^n \to \mathbb{R} and a point x\domfx \in \dom f, a vector gRng \in \mathbb{R}^n is a subgradient of ff at xx if it satisfies the subgradient inequality
f(y)f(x)+g,yxy\domf.\labeleq:subgrad(1) f(y) \geq f(x) + \langle g, y - x \rangle \quad \forall y \in \dom f. \tag{1}\label{eq:subgrad}
This inequality provides a linear lower bound on ff that supports the graph of the function from below at xx. The set of all such subgradients at xx, denoted by the subdifferential f(x)\partial f(x), is nonempty for any convex ff at every point in the relative interior of its domain.[2] Geometrically, a subgradient gg corresponds to a supporting hyperplane to the epigraph of ff at the point (x,f(x))(x, f(x)), where the epigraph \epif={(z,t)Rn×Rtf(z)}\epi f = \{(z, t) \in \mathbb{R}^n \times \mathbb{R} \mid t \geq f(z)\} is a convex set. The inequality \eqref{eq:subgrad} ensures that this hyperplane lies below the epigraph everywhere, touching it at (x,f(x))(x, f(x)), thus generalizing the supporting role of the gradient for differentiable convex functions.[2] A classic example is the absolute value function f(x)=xf(x) = |x| on R\mathbb{R}, which is convex but nondifferentiable at x=0x = 0. Here, f(x)={sign(x)}\partial f(x) = \{\operatorname{sign}(x)\} for x0x \neq 0, while f(0)=[1,1]\partial f(0) = [-1, 1], reflecting the range of possible slopes at the kink. Another example is the indicator function δC(x)\delta_C(x) of a nonempty convex set CRnC \subseteq \mathbb{R}^n, defined as δC(x)=0\delta_C(x) = 0 if xCx \in C and ++\infty otherwise; its subdifferential at xCx \in C is the normal cone NC(x)={gRng,yx0 yC}N_C(x) = \{g \in \mathbb{R}^n \mid \langle g, y - x \rangle \leq 0 \ \forall y \in C\}.[2][3] Key properties of the subdifferential follow from the convexity of ff: f(x)\partial f(x) is a nonempty, convex, and closed set. If ff is Lipschitz continuous near xx, then f(x)\partial f(x) is compact and bounded. Moreover, the subdifferential relates to one-sided derivatives; for instance, the directional derivative f(x;d)=limt0+f(x+td)f(x)tf'(x; d) = \lim_{t \to 0^+} \frac{f(x + t d) - f(x)}{t} satisfies f(x;d)=supgf(x)g,df'(x; d) = \sup_{g \in \partial f(x)} \langle g, d \rangle.[2] The concept of subgradients traces its origins to the work of Lev Pontryagin in the 1940s on variational problems in optimal control, where similar supporting hyperplane ideas emerged, and was formally developed within convex analysis by R. Tyrrell Rockafellar in his 1970 monograph.

Basic Algorithm

The subgradient method is an iterative algorithm for minimizing a convex, possibly nondifferentiable function $ f: \mathbb{R}^n \to \mathbb{R} $ in an unconstrained setting. It generalizes the gradient descent method by using a subgradient—a vector $ g \in \partial f(x) $ satisfying $ f(y) \geq f(x) + g^T (y - x) $ for all $ y $—in place of the gradient at points where $ f $ is not differentiable.[4] This approach was originally developed by Naum Z. Shor in the early 1960s as a means to handle nondifferentiable optimization problems, such as transportation tasks.[5] Unlike gradient descent, the subgradient method does not guarantee a decrease in function value at each step, but it leverages the convexity of $ f $ to drive iterates toward a minimizer.[1] The core procedure initializes an arbitrary point $ x_0 \in \mathbb{R}^n $ and generates a sequence of iterates via the update rule $ x_{k+1} = x_k - \alpha_k g_k $, where $ g_k \in \partial f(x_k) $ is a selected subgradient and $ \alpha_k > 0 $ is a step size. To monitor progress, the algorithm tracks the best (lowest-function-value) iterate encountered so far, as the sequence $ {f(x_k)} $ may not be monotonically decreasing.[1] The following pseudocode outlines the basic iteration:
pick x_0 ∈ ℝⁿ
for k = 0, 1, 2, ...
    select g_k ∈ ∂f(x_k)
    choose step size α_k > 0
    set x_{k+1} = x_k - α_k g_k
    (optionally track x̄_k = arg min_{i=0,...,k} f(x_i))
This update mimics the negative direction of steepest descent but accommodates nondifferentiability by allowing any valid subgradient at each step.[1] If $ f $ is Lipschitz continuous with constant $ G > 0 $, meaning $ |f(y) - f(x)| \leq G | y - x |_2 $ for all $ x, y $, then every subgradient satisfies $ | g |_2 \leq G $. This bound ensures that subgradients remain controlled in magnitude, facilitating practical implementation.[1] A simple one-dimensional example illustrates the method: consider minimizing $ f(x) = |x - 1| + |x + 1| $, a convex piecewise linear function with minimum value 2 attained on the interval $ [-1, 1] $. The subdifferential is $ \partial f(x) = { \operatorname{sign}(x - 1) + \operatorname{sign}(x + 1) } $ away from the kinks at $ x = \pm 1 $, yielding $ g = 2 $ for $ x > 1 $, $ g = -2 $ for $ x < -1 $, and $ g = 0 $ for $ -1 < x < 1 $. At the kink $ x = 1 $, $ \partial f(1) = [0, 2] $; at $ x = -1 $, $ \partial f(-1) = [-2, 0] $. Starting from $ x_0 = 2 $ (where $ g_0 = 2 $), the update moves leftward; if an iterate lands at a kink, any value in the interval (e.g., the zero subgradient) can be selected to continue.[4] In practice, computing subgradients relies on oracles for common nonsmooth convex functions. For the Euclidean norm $ f(x) = | x |_2 $, the oracle returns $ g = x / | x |_2 $ if $ x \neq 0 $ (with $ | g |2 = 1 $), or any vector in the unit ball if $ x = 0 $. For a pointwise maximum $ f(x) = \max{i=1,\dots,m} (a_i^T x + b_i) $, an oracle selects an active index $ j $ where the maximum is attained and returns $ g = a_j $. These oracles enable efficient evaluation even for high-dimensional problems.[1]

Unconstrained Optimization

Step-Size Rules

In the subgradient method for unconstrained optimization, the choice of step size αk\alpha_k in the update xk+1=xkαkgkx_{k+1} = x_k - \alpha_k g_k, where gkg_k is a subgradient of the objective ff at xkx_k, critically influences the algorithm's progress toward a minimizer.[1] A simple approach is the constant step size αk=α\alpha_k = \alpha for all iterations kk, where α>0\alpha > 0 is fixed and selected sufficiently small relative to the objective's properties to promote descent. This rule ensures bounded updates but can cause the iterates to oscillate around the optimum without converging to it exactly, as the fixed magnitude prevents finer adjustments over time.[1][6] To overcome this limitation and drive convergence to the optimal value, diminishing step sizes are employed, such as αk=a/(b+k)\alpha_k = a / (b + k) for constants a>0a > 0 and b0b \geq 0. These sequences satisfy the conditions kαk=\sum_k \alpha_k = \infty (ensuring the iterates explore sufficiently far) and kαk2<\sum_k \alpha_k^2 < \infty (bounding the variance of updates), allowing the method to approach the minimum asymptotically.[1][7] A common variant is αk=a/k\alpha_k = a / \sqrt{k}, which balances exploration and refinement effectively in practice.[1] When the optimal objective value ff^* is known a priori, Polyak's step size provides an exact line-search-like choice: αk=(f(xk)f)/gk2\alpha_k = (f(x_k) - f^*) / \|g_k\|^2. This rule minimizes the upper bound on the distance to the optimum in each step, yielding rapid progress, though its requirement for ff^* limits applicability unless an estimate is available.[1][8] The formula originates from Polyak's foundational work on subgradient techniques for nonsmooth minimization.[9] In practical implementations, especially without prior knowledge of ff^* or precise function properties, heuristics guide step-size selection. One standard tuning uses an upper bound GG on the subgradient norm gkG\|g_k\| \leq G, initializing αR/G\alpha \approx R / G where RR estimates the distance to a minimizer, or simply α1/G\alpha \approx 1/G for unit-scale problems.[1][10] A robust adaptive strategy starts with a moderate α\alpha and halves it iteratively if the function value does not decrease sufficiently (e.g., by a factor related to gk\|g_k\|) over a few steps, balancing speed and stability.[11]

Convergence Analysis

The convergence properties of the classical subgradient method are established under the assumption that the objective function ff is convex, a minimizer xx^* exists with f(x)=ff(x^*) = f^*, the domain is bounded such that x0xD\|x_0 - x^*\| \leq D, and subgradients are uniformly bounded gkG\|g_k\| \leq G for all kk. These results hold without requiring differentiability or strong convexity, relying instead on the subgradient inequality f(y)f(xk)+gk(yxk)f(y) \geq f(x_k) + g_k^\top (y - x_k) for all yy.[12][1] The foundational convergence theorem states that if the step sizes αk>0\alpha_k > 0 satisfy k=0αk=\sum_{k=0}^\infty \alpha_k = \infty and k=0αk2<\sum_{k=0}^\infty \alpha_k^2 < \infty, then the method drives the function values to the optimum in the sense that
limkmin0jkf(xj)f=0. \lim_{k \to \infty} \min_{0 \leq j \leq k} f(x_j) - f^* = 0.
More quantitatively, for the best iterate up to iteration kk, the error is bounded by
min0jkf(xj)fD2+G2i=0kαi22i=0kαi. \min_{0 \leq j \leq k} f(x_j) - f^* \leq \frac{D^2 + G^2 \sum_{i=0}^k \alpha_i^2}{2 \sum_{i=0}^k \alpha_i}.
This bound tends to zero as kk \to \infty due to the step-size conditions, with ergodic convergence holding for the average iterate xˉk=1k+1j=0kxj\bar{x}_k = \frac{1}{k+1} \sum_{j=0}^k x_j, where f(xˉk)f=O(1i=0kαi)f(\bar{x}_k) - f^* = O\left( \frac{1}{\sum_{i=0}^k \alpha_i} \right). The proof derives from telescoping the squared distance xk+1x2xkx22αk(f(xk)f)+αk2gk2\|x_{k+1} - x^*\|^2 \leq \|x_k - x^*\|^2 - 2\alpha_k (f(x_k) - f^*) + \alpha_k^2 \|g_k\|^2 and averaging over iterations.[12] For specific step-size rules, the choice αk=ck+1\alpha_k = \frac{c}{\sqrt{k+1}} for constant c>0c > 0 yields a sublinear convergence rate of O(1/k)O(1/\sqrt{k}) for the average or best iterate in a bounded domain, matching the bound above with αi2ck\sum \alpha_i \approx 2c \sqrt{k} and αi22c2logk\sum \alpha_i^2 \approx 2c^2 \log k. In contrast, a truly constant step size αk=α\alpha_k = \alpha for all kk achieves only a bounded error f(xˉk)fG2α2+D22αkf(\bar{x}_k) - f^* \leq \frac{G^2 \alpha}{2} + \frac{D^2}{2\alpha k}, which does not converge to zero but can be tuned for practical early stopping. These rates are optimal for nonsmooth convex functions in the worst case, as established by lower bounds on first-order methods.[1] The Polyak step-size rule, αk=f(xk)fgk2\alpha_k = \frac{f(x_k) - f^*}{\|g_k\|^2}, requires knowledge of ff^* and accelerates convergence when it is available, yielding
xk+1x2xkx2(f(xk)f)2gk2xkx2(f(xk)f)2G2. \|x_{k+1} - x^*\|^2 \leq \|x_k - x^*\|^2 - \frac{(f(x_k) - f^*)^2}{\|g_k\|^2} \leq \|x_k - x^*\|^2 - \frac{(f(x_k) - f^*)^2}{G^2}.
This implies linear convergence if f(xk)fμxkx2f(x_k) - f^* \geq \mu \|x_k - x^*\|^2 for some μ>0\mu > 0 (e.g., under strong convexity or sharpness), with the number of iterations to ϵ\epsilon-accuracy scaling as O((RG/ϵ)2)O((RG/\epsilon)^2); otherwise, it still ensures f(xk)f0f(x_k) - f^* \to 0. However, estimating ff^* in practice often undermines these gains, leading to reliance on diminishing rules.[12][1] While these results do not require strong convexity—weakening to mere convexity—the absence of smoothness or second-order information results in slower O(1/k)O(1/\sqrt{k}) rates compared to O(1/k)O(1/k) for smooth gradient descent. Ergodic averaging mitigates oscillation but does not improve the asymptotic rate, highlighting the method's robustness at the cost of efficiency for large-scale problems.

Constrained Optimization

Projected Subgradient

The projected subgradient method extends the basic subgradient algorithm to solve convex optimization problems of the form min{f(x)xC}\min \{ f(x) \mid x \in C \}, where ff is a convex (possibly nonsmooth) function and CRnC \subseteq \mathbb{R}^n is a nonempty, closed, convex set.[1] In each iteration, starting from an initial point x0Cx_0 \in C, the method first computes a tentative update xk+1=xkαkgkx_{k+1}' = x_k - \alpha_k g_k, where gkf(xk)g_k \in \partial f(x_k) is a subgradient of ff at xkx_k and αk>0\alpha_k > 0 is a step size; this is then followed by a projection step onto the feasible set, yielding xk+1=PC(xk+1)x_{k+1} = P_C(x_{k+1}'), where PC(y)=argminzCzy22P_C(y) = \arg\min_{z \in C} \|z - y\|_2^2 denotes the Euclidean projection onto CC.[1][13] This projection ensures that all iterates remain feasible: if x0Cx_0 \in C, then xkCx_k \in C for all kk, since the projection maps any point back to CC while minimizing the Euclidean distance to the tentative update.[1] The method thus handles constraints implicitly through the projection oracle, making it suitable for problems where the feasible set CC admits an efficient projection, such as polyhedral or ball constraints.[1] Step-size selection follows rules analogous to those in the unconstrained subgradient method, such as constant steps αk=α\alpha_k = \alpha, diminishing steps αk=a/k\alpha_k = a / \sqrt{k} for constants a>0a > 0, or Polyak's rule αk=(f(xk)f)/gk22\alpha_k = (f(x_k) - f^*) / \|g_k\|_2^2 (assuming a known lower bound ff^* on the optimal value).[1] In some implementations, the step size may be scaled by a factor related to the distance from xkx_k to the boundary of CC to mitigate aggressive steps near constraints, though standard rules often suffice without such adaptation.[1] A representative example arises in minimizing a nonsmooth convex function over box constraints, such as min{f(x)xu}\min \{ f(x) \mid \ell \leq x \leq u \}, where ,uRn\ell, u \in \mathbb{R}^n with u\ell \leq u componentwise define the box C=[,u]C = [\ell, u]. Here, the projection PC(y)P_C(y) is computed via simple clipping: [PC(y)]i=max(i,min(ui,yi))[P_C(y)]_i = \max(\ell_i, \min(u_i, y_i)) for each component i=1,,ni = 1, \dots, n, which enforces the bounds elementwise and requires only O(n)O(n) operations.[13] The method requires access to a projection oracle for CC, whose computational cost depends on the structure of the set; it is efficient for simple convex sets like boxes (as above, O(n)O(n)), probability simplices (achievable in O(n)O(n) expected time via sorting-based algorithms), or Euclidean balls (scaling the tentative point to the ball's radius if outside, also O(n)O(n)).[1][13][14] For more complex sets, the projection step may dominate the per-iteration cost, but the overall method remains first-order and scalable when projections are tractable.[1]

General Constraints

In subgradient methods for constrained optimization, penalty approaches incorporate general constraints into the objective function to transform the problem into an unconstrained one amenable to standard subgradient descent. For a convex set CC, one common formulation minimizes f(x)+cdist(x,C)2f(x) + c \cdot \text{dist}(x, C)^2, where ff is the nonsmooth objective, c>0c > 0 is a penalty parameter, and dist(x,C)\text{dist}(x, C) measures the Euclidean distance to the constraint set; subgradients of this augmented function are computed as f(x)+2c(xprojC(x))\partial f(x) + 2c (x - \text{proj}_C(x)), with projC\text{proj}_C denoting the projection onto CC.[13] For hard constraints, the indicator function IC(x)I_C(x) (zero on CC, infinite elsewhere) can be used, but its subgradients are undefined outside CC, leading practical implementations to rely on smoothed quadratic penalties or exact l1l_1-style penalties like cmax(0,gi(x))c \cdot \max(0, g_i(x)) for inequality constraints gi(x)0g_i(x) \leq 0, which become exact for sufficiently large cc under constraint qualifications such as Slater's condition.[15] These methods ensure feasibility as cc increases, though tuning cc dynamically (e.g., starting small and doubling) is often required to balance progress on ff and constraint satisfaction.[16] Lagrangian-based subgradient methods address general inequality constraints minf(x) s.t. gi(x)0\min f(x) \text{ s.t. } g_i(x) \leq 0 by forming the Lagrangian L(x,λ)=f(x)+iλigi(x)L(x, \lambda) = f(x) + \sum_i \lambda_i g_i(x), where λi0\lambda_i \geq 0 are dual multipliers, and applying subgradient steps to both primal and dual variables. A subgradient of LL at (x,λ)(x, \lambda) is (g~,g(x))(\tilde{g}, g(x)), where g~f(x)+iλigi(x)\tilde{g} \in \partial f(x) + \sum_i \lambda_i \partial g_i(x) and g(x)g(x) collects the constraint violations; the update proceeds as xk+1=xkαkg~kx^{k+1} = x^k - \alpha_k \tilde{g}^k and λik+1=max(0,λik+βkgi(xk))\lambda_i^{k+1} = \max(0, \lambda_i^k + \beta_k g_i(x^k)), with step sizes αk,βk\alpha_k, \beta_k chosen via diminishing rules to converge to a saddle point under convexity.[1] This primal-dual approach exploits duality to handle coupling constraints, with convergence rates of O(1/k)O(1/\sqrt{k}) for the dual function gap. Augmented variants add penalty terms like (ρ/2)i[gi(x)]+2(\rho/2) \sum_i [g_i(x)]_+^2 to LL for better numerical stability and faster primal recovery. For feasibility restoration in these methods, especially when initial points are infeasible, alternating projections onto the constraint set and the level sets of the objective can be interleaved with subgradient steps, though under the assumption of convex constraints, exact penalty functions (e.g., nonsmooth l1l_1 penalties) suffice to enforce feasibility without projections by rendering the penalized problem equivalent to the original for large penalties.[17] An illustrative example arises in linear programming relaxations with nonsmooth objective costs, such as min{cx+h(Ax) s.t. x0,Ax=b}\min \{ c^\top x + h(Ax) \text{ s.t. } x \geq 0, Ax = b \}, where hh is convex nonsmooth; the dual problem max{bνh(ν) s.t. Aνc}\max \{ b^\top \nu - h^*(\nu) \text{ s.t. } A^\top \nu \leq c \} is solved via subgradients of the dual objective, yielding primal solutions through complementary slackness.[1] Key challenges in these extensions include controlling the duality gap, which may persist due to inexact dual ascent, and the need for computable subgradients of the constraint functions gig_i, which must be available or approximable alongside those of ff; moreover, nonconvex constraints (though typically excluded under convexity assumptions) can undermine exactness, necessitating hybrid restoration techniques.[15]

Advanced Variants

Subgradient-Projection Methods

Subgradient-projection methods extend the projected subgradient approach by replacing Euclidean projections with more general Bregman projections, allowing the exploitation of specific geometric structures in the constraint set CC. Unlike the standard projected subgradient method, which relies on 2\ell_2-norm projections suitable for general convex sets, these variants use Bregman distances to tailor the projection to the problem's intrinsic geometry, such as nonnegativity or sparsity constraints.[18] The Bregman distance generated by a strictly convex differentiable function PP with domain containing CC is defined as
DP(z,y)=P(z)P(y)P(y),zy, D_P(z, y) = P(z) - P(y) - \langle \nabla P(y), z - y \rangle,
which measures the divergence from yy to zz and reduces to the squared Euclidean distance when P(y)=12y2P(y) = \frac{1}{2} \|y\|^2.[19] The core algorithm for Bregman-projected subgradient descent proceeds as follows: given a convex function ff and a subgradient gkf(xk)g_k \in \partial f(x_k) at the current iterate xkCx_k \in C, the next iterate is obtained via
xk+1=argminxCDP(x,xkαkgk), x_{k+1} = \arg\min_{x \in C} D_P\left(x, x_k - \alpha_k g_k\right),
where αk>0\alpha_k > 0 is a step size. This update preserves desirable structures in CC, such as nonnegativity, by choosing PP appropriately—for instance, the negative entropy P(x)=ixilogxiP(x) = \sum_i x_i \log x_i for the probability simplex, yielding the Kullback-Leibler (KL) divergence as the Bregman distance. Under standard assumptions like bounded subgradients and diminishing step sizes, the method achieves O(1/k)O(1/\sqrt{k}) convergence to the optimal value in terms of function evaluations.[19] These methods find applications in sparse optimization and entropic mirror descent, where subgradients steer solutions toward sparsity while respecting constraints like nonnegativity. In entropic mirror descent over the simplex, the updates take a multiplicative form that naturally enforces positivity and sums-to-one conditions, making it efficient for large-scale problems such as portfolio optimization or topic modeling. The advantages include improved conditioning for ill-posed constraints, where Euclidean projections may distort the feasible set, and faster relative convergence rates, often scaling with logn\sqrt{\log n} instead of n\sqrt{n} in high dimensions.[18] A representative example is nonnegative least squares with a nonsmooth regularizer, such as minx0Axb22+λx1\min_{x \geq 0} \|Ax - b\|_2^2 + \lambda \|x\|_1, solved using KL-divergence projections. Here, the Bregman update computes
xk+1,j=xk,jexp(αk((AT(Axkb)+λsign(xk,j))j1))/Zk, x_{k+1,j} = x_{k,j} \exp\left(-\alpha_k \left( (A^T (Ax_k - b) + \lambda \operatorname{sign}(x_{k,j}))_j - 1 \right)\right) / Z_k,
where ZkZ_k normalizes to maintain nonnegativity and feasibility; this avoids explicit projections and empirically converges faster than Euclidean variants on sparse signals, as demonstrated in compressed sensing tasks.[20]

Bundle Methods

Bundle methods represent an advancement over basic subgradient descent by aggregating multiple subgradients from prior iterations to build a piecewise linear lower approximation of the nonsmooth objective function, augmented with a proximal quadratic term for stability and to encourage steps toward the current iterate. This bundle-based model facilitates solving a quadratic subproblem that yields more reliable descent directions, particularly effective for convex nonsmooth optimization problems where single-subgradient updates may exhibit erratic behavior. The approach originated in the 1970s with foundational contributions that demonstrated its potential for handling nondifferentiable functions through iterative linearizations.[21][22] In bundle construction, a collection (or bundle) of past information is maintained, consisting of points $ x_i $, function values $ f(x_i) $, and corresponding subgradients $ g_i \in \partial f(x_i) $ for $ i = 1, \dots, m $, where $ m $ is the bundle size, often limited to prevent excessive computational cost. The cutting-plane model is then formed as $ m_k(y) = \max_{i=1}^m \left[ f(x_i) + \langle g_i, y - x_i \rangle \right] $, which underestimates the convex function $ f $, and is stabilized by adding a proximal term: $ m_k(y) + \frac{1}{2 t_k} | y - x_k |^2 $, where $ x_k $ is the current stability center (the best known point) and $ t_k > 0 $ is a proximity parameter controlling the quadratic regularization. This model approximates $ f $ locally around $ x_k $ while ensuring the subproblem remains strongly convex and solvable efficiently, typically via quadratic programming solvers.[23] The algorithm proceeds by solving the proximal bundle subproblem $ \min_y , m_k(y) + \frac{1}{2 t_k} | y - x_k |^2 $ to obtain a trial point $ \tilde{x}_k = x_k + d_k $, where $ d_k $ is the search direction. A function evaluation at $ \tilde{x}_k $ is then performed to compute a new subgradient $ g(\tilde{x}_k) $. If sufficient descent is achieved—specifically, if $ f(\tilde{x}_k) \leq f(x_k) + \delta (m_k(\tilde{x}k) - f(x_k)) $ for some $ 0 < \delta < 1 $, known as the serious step condition—the trial point is accepted as the new stability center $ x{k+1} = \tilde{x}_k $, and the bundle is updated by possibly discarding outdated elements to maintain relevance. Otherwise, a null step occurs: the bundle is augmented with the new linearization from $ \tilde{x}_k $, $ t_k $ may be adjusted (e.g., decreased for stricter proximity), and the process repeats from $ x_k $. The serious step condition guarantees measurable progress in the objective, with the bundle size typically capped at a small number like 10 to balance approximation quality and solvability.[24] Convergence of bundle methods is ensured for convex, locally Lipschitz functions, with the stability center sequence $ {x_k} $ approaching a minimizer as the predicted reduction diminishes, analyzed via cases of infinitely many serious steps or finite serious steps followed by infinitely many null steps, under bounded subgradients. The method achieves $ O(1/\sqrt{k}) $ convergence for the function value gap in the convex case, but under strong convexity of $ f $ with modulus $ \mu > 0 $, it achieves improved rates, such as O(1/k^2) for the squared distance to the optimum when leveraging the quadratic growth property. The number of iterations to reach ε-accuracy is bounded by O(\ln(1/ε)/ε) in some proximal variants with adaptive parameters, outperforming the O(1/ε^2) of plain subgradient methods.[25][26] Variants include proximal bundle methods, which emphasize the quadratic stabilization term for robustness in ill-conditioned problems, and level bundle methods, which incorporate level-set information to enhance stability by restricting the model to a trust region around the current level set of $ f $, reducing null steps in practice. These adaptations maintain the core bundling idea while addressing specific challenges like inexact oracles or nonconvexity extensions.[23] For example, in trust-region styled bundle methods applied to nonsmooth logistic regression—minimizing $ \min_w \frac{1}{m} \sum_{i=1}^m \log(1 + \exp(-y_i \langle w, x_i \rangle)) + \lambda |w|_1 $—the approach constructs bundle approximations of the nonsmooth loss, solving proximal subproblems within a trust region to yield descent directions. Experimental results on benchmark datasets demonstrate that such methods require substantially fewer iterations than plain subgradient descent to achieve comparable accuracy, often converging in O(1/ε) steps versus O(1/ε^2) theoretically, due to the aggregated subgradient information providing better local models.[27]
User Avatar
No comments yet.