Recent from talks
Nothing was collected or created yet.
Broyden–Fletcher–Goldfarb–Shanno algorithm
View on WikipediaThis article needs additional citations for verification. (March 2016) |
In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems.[1] Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations (or approximate gradient evaluations) via a generalized secant method.[2]
Since the updates of the BFGS curvature matrix do not require matrix inversion, its computational complexity is only , compared to in Newton's method. Also in common use is L-BFGS, which is a limited-memory version of BFGS that is particularly suited to problems with very large numbers of variables (e.g., >1000). The BFGS-B variant handles simple box constraints.[3] The BFGS matrix also admits a compact representation, which makes it better suited for large constrained problems.
The algorithm is named after Charles George Broyden, Roger Fletcher, Donald Goldfarb and David Shanno.[4][5][6][7] It is an instance of a more general algorithm by John Greenstadt.[8]
Rationale
[edit]The optimization problem is to minimize , where is a vector in , and is a differentiable scalar function. There are no constraints on the values that can take.
The algorithm begins at an initial estimate for the optimal value and proceeds iteratively to get a better estimate at each stage.
The search direction pk at stage k is given by the solution of the analogue of the Newton equation:
where is an approximation to the Hessian matrix at , which is updated iteratively at each stage, and is the gradient of the function evaluated at xk. A line search in the direction pk is then used to find the next point xk+1 by minimizing over the scalar
The quasi-Newton condition imposed on the update of is
Let and , then satisfies
- ,
which is the secant equation.
The curvature condition should be satisfied for to be positive definite, which can be verified by pre-multiplying the secant equation with . If the function is not strongly convex, then the condition has to be enforced explicitly e.g. by finding a point xk+1 satisfying the Wolfe conditions, which entail the curvature condition, using line search.
Instead of requiring the full Hessian matrix at the point to be computed as , the approximate Hessian at stage k is updated by the addition of two matrices:
Both and are symmetric rank-one matrices, but their sum is a rank-two update matrix. BFGS and DFP updating matrix both differ from its predecessor by a rank-two matrix. Another simpler rank-one method is known as symmetric rank-one method, which does not guarantee the positive definiteness. In order to maintain the symmetry and positive definiteness of , the update form can be chosen as . Imposing the secant condition, . Choosing and , we can obtain:[9]
Finally, we substitute and into and get the update equation of :
Algorithm
[edit]Consider the following unconstrained optimization problem where is a nonlinear, twice-differentiable objective function.
From an initial guess and an initial guess of the Hessian matrix the following steps are repeated as converges to the solution:
- Obtain a direction by solving .
- Perform a one-dimensional optimization (line search) to find an acceptable stepsize in the direction found in the first step. If an exact line search is performed, then . In practice, an inexact line search usually suffices, with an acceptable satisfying Wolfe conditions.
- Set and update .
- .
- .
Convergence can be determined by observing the norm of the gradient; given some , one may stop the algorithm when If is initialized with , the first step will be equivalent to a gradient descent, but further steps are more and more refined by , the approximation to the Hessian.
The first step of the algorithm is carried out using the inverse of the matrix , which can be obtained efficiently by applying the Sherman–Morrison formula to the step 5 of the algorithm, giving
This can be computed efficiently without temporary matrices, recognizing that is symmetric, and that and are scalars, using an expansion such as
Therefore, in order to avoid any matrix inversion, the inverse of the Hessian can be approximated instead of the Hessian itself: [10]
From an initial guess and an approximate inverted Hessian matrix the following steps are repeated as converges to the solution:
- Obtain a direction by solving .
- Perform a one-dimensional optimization (line search) to find an acceptable stepsize in the direction found in the first step. If an exact line search is performed, then . In practice, an inexact line search usually suffices, with an acceptable satisfying Wolfe conditions.
- Set and update .
- .
- .
In statistical estimation problems (such as maximum likelihood or Bayesian inference), credible intervals or confidence intervals for the solution can be estimated from the inverse of the final Hessian matrix [citation needed]. However, these quantities are technically defined by the true Hessian matrix, and the BFGS approximation may not converge to the true Hessian matrix.[11]
Further developments
[edit]The BFGS update formula heavily relies on the curvature being strictly positive and bounded away from zero. This condition is satisfied when we perform a line search with Wolfe conditions on a convex target. However, some real-life applications (like Sequential Quadratic Programming methods) routinely produce negative or nearly-zero curvatures. This can occur when optimizing a nonconvex target or when employing a trust-region approach instead of a line search. It is also possible to produce spurious values due to noise in the target.
In such cases, one of the so-called damped BFGS updates can be used (see [12]) which modify and/or in order to obtain a more robust update.
Notable implementations
[edit]Notable open source implementations are:
- ALGLIB implements BFGS and its limited-memory version in C++ and C#
- GNU Octave uses a form of BFGS in its
fsolvefunction, with trust region extensions. - The GSL implements BFGS as gsl_multimin_fdfminimizer_vector_bfgs2.[13]
- In R, the BFGS algorithm (and the L-BFGS-B version that allows box constraints) is implemented as an option of the base function optim().[14]
- In SciPy, the scipy.optimize.fmin_bfgs function implements BFGS.[15] It is also possible to run BFGS using any of the L-BFGS algorithms by setting the parameter L to a very large number. It is also one of the default methods used when running scipy.optimize.minimize with no constraints.[16]
- In Julia, the Optim.jl package implements BFGS and L-BFGS as a solver option to the optimize() function (among other options).[17]
Notable proprietary implementations include:
- The large scale nonlinear optimization software Artelys Knitro implements, among others, both BFGS and L-BFGS algorithms.
- In the MATLAB Optimization Toolbox, the fminunc function uses BFGS with cubic line search when the problem size is set to "medium scale."
- Mathematica includes BFGS.
- LS-DYNA also uses BFGS to solve implicit Problems.
See also
[edit]References
[edit]- ^ Fletcher, Roger (1987), Practical Methods of Optimization (2nd ed.), New York: John Wiley & Sons, ISBN 978-0-471-91547-8
- ^ Dennis, J. E. Jr.; Schnabel, Robert B. (1983), "Secant Methods for Unconstrained Minimization", Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Englewood Cliffs, NJ: Prentice-Hall, pp. 194–215, ISBN 0-13-627216-9
- ^ Byrd, Richard H.; Lu, Peihuang; Nocedal, Jorge; Zhu, Ciyou (1995), "A Limited Memory Algorithm for Bound Constrained Optimization", SIAM Journal on Scientific Computing, 16 (5): 1190–1208, CiteSeerX 10.1.1.645.5814, doi:10.1137/0916069
- ^ Broyden, C. G. (1970), "The convergence of a class of double-rank minimization algorithms", Journal of the Institute of Mathematics and Its Applications, 6: 76–90, doi:10.1093/imamat/6.1.76
- ^ Fletcher, R. (1970), "A New Approach to Variable Metric Algorithms", Computer Journal, 13 (3): 317–322, doi:10.1093/comjnl/13.3.317
- ^ Goldfarb, D. (1970), "A Family of Variable Metric Updates Derived by Variational Means", Mathematics of Computation, 24 (109): 23–26, doi:10.1090/S0025-5718-1970-0258249-6
- ^ Shanno, David F. (July 1970), "Conditioning of quasi-Newton methods for function minimization", Mathematics of Computation, 24 (111): 647–656, doi:10.1090/S0025-5718-1970-0274029-X, MR 0274029
- ^ Greenstadt, J. (1970). "Variations on variable-metric methods. (With discussion)". Mathematics of Computation. 24 (109): 1–22. doi:10.1090/S0025-5718-1970-0258248-4. ISSN 0025-5718.
- ^ Fletcher, Roger (1987), Practical methods of optimization (2nd ed.), New York: John Wiley & Sons, ISBN 978-0-471-91547-8
- ^ Nocedal, Jorge; Wright, Stephen J. (2006), Numerical Optimization (2nd ed.), Berlin, New York: Springer-Verlag, ISBN 978-0-387-30303-1
- ^ Ge, Ren-pu; Powell, M. J. D. (1983). "The Convergence of Variable Metric Matrices in Unconstrained Optimization". Mathematical Programming. 27 (2). 123. doi:10.1007/BF02591941. S2CID 8113073.
- ^ Jorge Nocedal; Stephen J. Wright (2006), Numerical Optimization
- ^ "GNU Scientific Library — GSL 2.6 documentation". www.gnu.org. Retrieved 2020-11-22.
- ^ "R: General-purpose Optimization". stat.ethz.ch. Retrieved 2020-11-22.
- ^ "scipy.optimize.fmin_bfgs — SciPy v1.5.4 Reference Guide". docs.scipy.org. Retrieved 2020-11-22.
- ^ "scipy.optimize.minimize — SciPy v1.5.4 Reference Guide". docs.scipy.org. Retrieved 2025-01-22.
- ^ "Optim.jl Configurable options". julianlsolvers.
Further reading
[edit]- Avriel, Mordecai (2003), Nonlinear Programming: Analysis and Methods, Dover Publishing, ISBN 978-0-486-43227-4
- Bonnans, J. Frédéric; Gilbert, J. Charles; Lemaréchal, Claude; Sagastizábal, Claudia A. (2006), "Newtonian Methods", Numerical Optimization: Theoretical and Practical Aspects (Second ed.), Berlin: Springer, pp. 51–66, ISBN 3-540-35445-X
- Fletcher, Roger (1987), Practical Methods of Optimization (2nd ed.), New York: John Wiley & Sons, ISBN 978-0-471-91547-8
- Luenberger, David G.; Ye, Yinyu (2008), Linear and nonlinear programming, International Series in Operations Research & Management Science, vol. 116 (Third ed.), New York: Springer, pp. xiv+546, ISBN 978-0-387-74502-2, MR 2423726
- Kelley, C. T. (1999), Iterative Methods for Optimization, Philadelphia: Society for Industrial and Applied Mathematics, pp. 71–86, ISBN 0-89871-433-8
Broyden–Fletcher–Goldfarb–Shanno algorithm
View on GrokipediaBackground
Unconstrained Optimization Problems
The unconstrained minimization problem involves finding a point that minimizes a smooth nonlinear function , expressed as .[8] Here, is typically assumed to be continuously differentiable and bounded below, with no restrictions imposed on the decision variables .[8] This formulation arises in numerous applications, including parameter estimation in machine learning, structural design in engineering, and resource allocation in economics, where the objective is to identify local or global minima without boundary or equality constraints.[9] Optimization algorithms for these problems rely on first-order information from the gradient , which indicates the direction of steepest ascent at ; the negative gradient thus provides a descent direction.[8] At a local minimizer , the first-order necessary condition requires .[8] Second-order information is captured by the Hessian matrix , a symmetric matrix of second partial derivatives that encodes the local curvature of . For to qualify as a local minimizer, must be positive semidefinite, ensuring the function increases in all directions from that point; if positive definite, is a strict local minimizer.[8] The classical Newton method, which solves the system for the search direction, exemplifies the use of exact second-order information for rapid convergence near solutions.[8] To address the high computational cost of evaluating and inverting the full Hessian—particularly for high-dimensional problems—iterative methods approximate the Hessian or its inverse using only gradient differences from successive iterations.[8] These approximations enable second-order-like efficiency without repeated second-derivative computations, making them suitable for large-scale nonlinear optimization.[10] The development of such methods gained momentum in the 1960s and 1970s, driven by the advent of digital computers and the need for computationally efficient algorithms in fields like operations research and scientific computing, where full Hessian evaluations were often infeasible due to limited resources.[8] This era saw significant advancements in the United Kingdom, including early quasi-Newton techniques that prioritized gradient-based updates to balance accuracy and speed.[11]Newton and Quasi-Newton Methods
Newton's method is an iterative second-order algorithm designed to find local minima of twice-differentiable functions in unconstrained optimization. The method generates the next iterate via the formula where denotes the gradient vector and the Hessian matrix evaluated at the current point .[12] This approach leverages a quadratic model of the objective function, leading to quadratic convergence rates in a neighborhood of the solution, where the distance to the optimum roughly squares with each successful iteration.[12] Despite its rapid local convergence, Newton's method faces significant computational challenges, particularly for high-dimensional problems. Forming the Hessian typically requires evaluating second derivatives, which can be costly both analytically and numerically, while solving the associated linear system or performing matrix inversion demands operations per iteration in dimensions.[12] These overheads limit its practicality for large-scale optimization, where even a single iteration may become prohibitively expensive.[12] Quasi-Newton methods emerged as an efficient alternative, approximating the Hessian (or its inverse) through low-rank updates based solely on gradient evaluations at successive points, thereby avoiding explicit second-derivative computations.[13] These updates incorporate differences in gradients to refine the curvature estimate iteratively, often guided by the secant condition to ensure the approximation aligns with observed function behavior along search directions.[13] Such methods attain superlinear convergence—faster than the linear rate of first-order techniques like gradient descent—while inheriting much of Newton's efficiency near the solution.[13] A key advantage of quasi-Newton approaches lies in their reduced complexity: they maintain and update an approximation matrix using storage and time per iteration, in contrast to the demands of full Newton steps.[13] This trade-off enables broader applicability in problems where is moderately large, balancing speed and accuracy without the full second-order burden.[13]Mathematical Foundation
Secant Condition and Rank-Two Updates
The secant condition forms the cornerstone of quasi-Newton methods for approximating the Hessian matrix in unconstrained optimization. It requires that the updated Hessian approximation satisfies , where is the step taken from iteration to , and is the corresponding change in the gradient.[14] This equation ensures that the approximation exactly reproduces the gradient difference for a quadratic function, thereby capturing the local curvature along the search direction without requiring second derivatives. To satisfy the secant condition while maintaining the symmetry of the Hessian approximation, quasi-Newton updates are typically constructed as rank-two modifications to the previous approximation . A rank-two update takes the general form , where and are vectors chosen to enforce the secant equation and symmetry, minimizing the deviation from in a suitable norm such as the Frobenius norm.[15] This low-rank adjustment allows efficient computation and storage, as only vector operations are needed rather than full matrix inversions. Early quasi-Newton methods, such as the Davidon–Fletcher–Powell (DFP) update introduced in the 1960s, exemplify this rank-two structure for the Hessian approximation.[14] The DFP method, originally developed by Davidon and refined by Fletcher and Powell, updates by emphasizing the gradient change in a way that generates conjugate directions for quadratics, achieving exact convergence in at most steps for -dimensional problems.[14] However, DFP exhibits limitations in practice, particularly sensitivity to inexact line searches or small values, which can lead to unstable approximations that overly weight recent curvature information and degrade performance on ill-conditioned problems. The BFGS algorithm improves upon DFP by employing a rank-two update that more robustly balances step and gradient information, effectively satisfying the secant condition for both the Hessian and its inverse in a complementary manner.[16] Independently proposed by Broyden, Fletcher, Goldfarb, and Shanno in 1970, this formulation derives from variational principles that minimize weighted Frobenius norm changes, resulting in greater numerical stability and reduced sensitivity to curvature variations compared to DFP.[15] These enhancements make BFGS particularly effective for maintaining desirable properties like positive definiteness under the curvature condition .Positive Definiteness Preservation
In the BFGS algorithm, maintaining positive definiteness of the Hessian approximation is essential to ensure that the computed search direction is a descent direction for the objective function . This follows because the inner product holds whenever , as inherits positive definiteness from .[2] The preservation of this property relies on the curvature condition , where denotes the step taken and is the gradient difference. This condition is typically enforced through a line search that satisfies the Wolfe conditions, ensuring the function decreases sufficiently and exhibits adequate curvature along , thereby preventing the approximation from becoming indefinite.[17] Under the BFGS update, if and , then the next approximation . A sketch of the proof proceeds by showing that for any nonzero vector , the quadratic form ; the update can be expressed as , where , so . Since , the first term is nonnegative and zero only if , i.e., is parallel to . In that case, if , then (as ); otherwise, the second term is positive.[7] Compared to the Davidon-Fletcher-Powell (DFP) method, which preserves positive definiteness under the identical curvature condition, BFGS demonstrates superior robustness in numerical practice, particularly in self-correcting poor initial approximations and maintaining stability across iterations.[2] This property of positive definiteness preservation underpins the algorithm's global convergence guarantees when paired with appropriate line searches.[17]The BFGS Update
Hessian Approximation Formula
The Broyden–Fletcher–Goldfarb–Shanno (BFGS) update provides a rank-two modification to the current Hessian approximation to obtain , ensuring it remains symmetric and positive definite under suitable conditions. The explicit formula is given by where denotes the step taken from the current iterate to the next , and is the corresponding change in the gradient of the objective function . This update satisfies the secant condition , which ensures that the approximation matches the observed curvature along the step direction.[15][18] The derivation of this formula proceeds by seeking a minimal perturbation to that fulfills the secant condition while preserving desirable properties such as positive definiteness. Specifically, it minimizes the change from in a least-squares sense, formulated as minimizing the weighted Frobenius norm subject to the secant equation and symmetry constraints, where the weighting is chosen to align with the inverse Hessian structure for numerical stability. This variational approach, solved using Lagrange multipliers, yields the rank-two correction terms that balance fidelity to prior information with new curvature data from and . The resulting update is unique within the Broyden family for maintaining positive definiteness when , which holds under standard line search conditions ensuring descent.[15][4] This specific form of the update was independently derived in 1970 by four researchers: Charles G. Broyden in his analysis of double-rank minimization algorithms, Roger Fletcher through a novel variable metric framework, Donald Goldfarb via variational principles, and David F. Shanno by focusing on optimal conditioning of quasi-Newton methods. Each contribution emphasized different aspects, such as convergence guarantees or numerical conditioning, but converged on the same formula, establishing it as a cornerstone of quasi-Newton optimization.[19][4][15][18] In practice, is stored and updated as a dense symmetric matrix at each iteration, requiring storage and time per update due to matrix-vector products and outer products involved in the correction terms; this makes it suitable for problems where is moderate, typically up to a few thousand.[1]Inverse Hessian Formula
The inverse Hessian approximation in the BFGS algorithm is updated using a rank-two modification that satisfies the secant condition , where is the step vector and is the gradient difference, while preserving symmetry and positive definiteness provided .[18] This update is particularly efficient for computing the search direction directly as , avoiding the need to form and invert the Hessian approximation at each iteration. One common expression for the BFGS inverse update employs an auxiliary matrix and scalar : where .[18] This form leverages the Sherman-Morrison-Woodbury formula implicitly to ensure computational efficiency, requiring only matrix-vector products and outer products that scale as per update for -dimensional problems. An equivalent two-term representation, often used in implementations for its direct expansion without auxiliary variables, is This expression derives from expanding the form and is symmetric by construction.[18] Both forms maintain the property that the updated is positive definite if is positive definite and the curvature condition holds, which is typically enforced via line search procedures. In practice, the inverse approximation is initialized as , the identity matrix, providing a unit scaling that assumes the Hessian has eigenvalues near 1, or more adaptively as where scales based on initial gradient information, such as after the first step to better match the problem's curvature.[18] This initialization promotes stable early iterations and aligns with the algorithm's goal of approximating the inverse Hessian near the solution.Algorithm Procedure
Initialization and Iteration Steps
The BFGS algorithm begins with the initialization of key parameters to set up the optimization process for an unconstrained minimization problem , where is a smooth function. An initial iterate is selected, often based on domain knowledge or a simple starting guess. An initial approximation to the Hessian is chosen as a symmetric positive definite matrix, commonly the identity matrix for simplicity, ensuring the method starts with a well-behaved quadratic model. Alternatively, the inverse Hessian approximation may be initialized directly as . A tolerance is specified to control convergence, typically on the order of to depending on the problem's scale and required precision. The iteration counter is set to . In each iteration, the algorithm computes the gradient at the current point. If the norm falls below , the process terminates with as the approximate minimizer. Otherwise, a search direction is determined using the current Hessian approximation: , or equivalently when working with the inverse. A step length is then found via a line search procedure that ensures sufficient decrease in . The next iterate is updated as . Curvature information is gathered by computing the displacement and the gradient change . The approximation is then updated to (or ) using the BFGS formula, incorporating and to satisfy the secant condition while preserving positive definiteness, assuming . The counter is incremented to , and the loop repeats. The following pseudocode outlines the standard BFGS procedure using the inverse Hessian approximation for computational efficiency:Algorithm: BFGS for Unconstrained Minimization
Input: Initial x_0, H_0 ≻ 0 (e.g., I), tolerance ε > 0
Set k = 0
While ||∇f(x_k)|| ≥ ε do
p_k = -H_k ∇f(x_k) // Search direction
α_k = LineSearch(f, x_k, p_k) // Step length satisfying Wolfe conditions
x_{k+1} = x_k + α_k p_k
s_k = x_{k+1} - x_k
y_k = ∇f(x_{k+1}) - ∇f(x_k)
ρ_k = 1 / (y_k^T s_k)
H_{k+1} = (I - ρ_k s_k y_k^T) H_k (I - ρ_k y_k s_k^T) + ρ_k s_k s_k^T
k = k + 1
End While
Output: Approximate minimizer x_k
Algorithm: BFGS for Unconstrained Minimization
Input: Initial x_0, H_0 ≻ 0 (e.g., I), tolerance ε > 0
Set k = 0
While ||∇f(x_k)|| ≥ ε do
p_k = -H_k ∇f(x_k) // Search direction
α_k = LineSearch(f, x_k, p_k) // Step length satisfying Wolfe conditions
x_{k+1} = x_k + α_k p_k
s_k = x_{k+1} - x_k
y_k = ∇f(x_{k+1}) - ∇f(x_k)
ρ_k = 1 / (y_k^T s_k)
H_{k+1} = (I - ρ_k s_k y_k^T) H_k (I - ρ_k y_k s_k^T) + ρ_k s_k s_k^T
k = k + 1
End While
Output: Approximate minimizer x_k
Line Search Requirements
In the BFGS algorithm, the line search phase selects a step length along the search direction to ensure meaningful progress in reducing the objective function , while maintaining properties essential for the method's convergence and Hessian approximation updates. This inexact line search replaces the exact minimization over to reduce computational cost, but must satisfy specific conditions to guarantee descent and compatibility with the quasi-Newton update.[8] The Armijo condition, or sufficient decrease condition, forms the core requirement for ensuring that the step provides an adequate reduction in the function value relative to the linear approximation from the gradient. It states that where is a small constant, typically around 0.0001, controlling the fraction of the predicted decrease that must be achieved. This condition prevents excessively large steps that might overshoot the minimum and is derived from the assumption of Lipschitz continuity in the gradient, ensuring the inequality holds for sufficiently small .[20][8] To avoid steps that are too short, which could lead to inefficient progress or violate update assumptions, the Wolfe curvature condition complements the Armijo rule by enforcing that the directional derivative at the new point does not decrease too sharply. This requires with , often , ensuring the step captures sufficient curvature information from the function. The combined Armijo and Wolfe conditions, known as the strong Wolfe conditions, yield more reliable step sizes by jointly controlling both function decrease and gradient alignment, particularly beneficial in nonconvex settings where pure Armijo might accept overly conservative steps.[21][8] Practical algorithms for satisfying these conditions include backtracking line search, which initializes and iteratively reduces it by a fixed factor , such as 0.5, until both strong Wolfe conditions hold; this simple yet effective approach typically requires few evaluations of and . Alternatively, cubic interpolation fits a cubic polynomial to available function and derivative information to estimate a promising , accelerating convergence in smooth problems while still verifying the conditions. These line search strategies ensure the displacement and gradient change satisfy , preserving the positive definiteness of the Hessian approximation in subsequent BFGS updates.[8]Theoretical Properties
Local Superlinear Convergence
The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm exhibits local superlinear convergence near a stationary point of the objective function , meaning that the iterates satisfy , assuming for sufficiently large .[22] This Q-superlinear rate implies that the error diminishes faster than any linear rate but does not achieve the quadratic speed of pure Newton's method.[22] Under standard assumptions, the BFGS method achieves this local Q-superlinear convergence. Specifically, if is twice continuously differentiable in a neighborhood of , the Hessian is Lipschitz continuous there (i.e., for some ), is a strong local minimizer with positive definite, and the initial Hessian approximation is positive definite with eigenvalues bounded away from zero and infinity, then starting sufficiently close to with exact line search, the BFGS iterates converge Q-superlinearly to .[22] The preservation of positive definiteness in BFGS updates plays a crucial role in maintaining descent directions within this local neighborhood.[22] The proof relies on a characterization of superlinear convergence for quasi-Newton methods: the iterates converge Q-superlinearly if and only if .[22] For BFGS, the secant condition and rank-two updates ensure that the approximation error tends to zero as near , satisfying the characterization and yielding Newton-like steps that drive superlinear error reduction.[22] This bounded deterioration property holds under the Lipschitz assumption on . In comparison, linear convergence has a constant ratio bounded away from zero, while quadratic convergence (as in Newton's method) satisfies ; BFGS achieves the former improvement over steepest descent but falls short of the latter due to its approximate second-order information.[22]Global Convergence Conditions
The BFGS algorithm demonstrates global convergence to a stationary point of the objective function from arbitrary starting points under appropriate conditions on the line search and update mechanism. Specifically, when implemented with a Wolfe line search that satisfies both the Armijo sufficient decrease condition and the curvature condition, and provided that the sequence of Hessian approximations remains uniformly bounded, the method generates a sequence of iterates converging to a point where the gradient vanishes.[23] This boundedness of the approximations is ensured if the objective function is twice continuously differentiable and the eigenvalues of the Hessian matrices are bounded away from zero and infinity along the iteration path.[23] A key tool in establishing this global convergence is the Zoutendijk condition, which applies to descent methods satisfying Wolfe line searches. The condition states that where is the angle between the negative gradient and the search direction . This finite sum implies that , ensuring the iterates approach a stationary point. In the context of BFGS, the quasi-Newton updates and the line search guarantee that the search directions are descent directions with bounded away from zero sufficiently often, satisfying the Zoutendijk condition and yielding global convergence even for nonconvex functions.[23] To maintain the positive definiteness of the Hessian approximations and ensure descent, damping strategies are incorporated when the curvature condition fails, where and . In such cases, the update is modified by introducing a damping parameter that scales to , forming a convex combination with the steepest descent direction to restore while preserving the secant condition approximately.[24] This damping prevents indefinite updates and ensures the method remains globally convergent by keeping the approximations positive definite.[24] These global convergence guarantees rely on standard assumptions about the objective function : it must be coercive, satisfying as , which bounds the sublevel sets and keeps iterates in a compact region; additionally, the gradient must be Lipschitz continuous, ensuring the line search parameters are well-defined and the function decrease is controlled.[23] Under these conditions, the BFGS method not only converges globally but also achieves local superlinear convergence rates once sufficiently close to the solution.[23]Variants
Limited-Memory BFGS
The limited-memory BFGS (L-BFGS) algorithm addresses the storage limitations of the full BFGS method for large-scale unconstrained optimization problems, where the dimension is very high (typically ). Instead of maintaining the full approximation of the inverse Hessian, L-BFGS stores only the most recent pairs of vectors for , where and , with being a small fixed integer (often 3 to 20). This implicit representation allows the algorithm to compute the search direction efficiently without explicitly forming . The core of L-BFGS is a two-loop recursion procedure that approximates the action of on the gradient . In the first loop (backward loop), starting from , the algorithm iteratively updates using the stored pairs in reverse chronological order to incorporate curvature information, effectively simulating the product of intermediate matrices. The second loop (forward loop) then computes scalar coefficients implicitly by reversing the process and accumulating the search direction , ensuring that the approximation satisfies the secant conditions for the stored pairs. This recursion avoids direct matrix operations and matrix-vector multiplications beyond the stored vectors, resulting in a computational cost of per iteration. In terms of storage, L-BFGS requires only space to hold the vectors and auxiliary scalars, compared to for the full BFGS approximation, making it practical for problems with millions of variables where full storage is infeasible. The trade-off is a potentially slower convergence rate, as the limited history may lead to fewer iterations of superlinear convergence near the solution, but the overall efficiency gains in memory-constrained environments outweigh this for high-dimensional applications. The method was first introduced by Nocedal in 1980 as a limited-storage update for quasi-Newton matrices, and later refined into the full L-BFGS algorithm by Liu and Nocedal in 1989, with extensive numerical validation on large-scale test problems.Constrained and Damped Variants
The BFGS-B algorithm addresses bound-constrained optimization problems by extending the quasi-Newton framework to incorporate simple bounds on the variables. It employs a gradient projection method to project the search direction onto the feasible region defined by the bounds, ensuring that iterates remain within the bounds during line searches. When the projected direction encounters an active bound, recovery steps are activated to adjust the approximation and restore feasibility, preventing indefinite Hessian updates. This approach maintains the superlinear convergence properties of BFGS in the unconstrained case while handling bounds efficiently, as demonstrated in numerical experiments on large-scale problems.[25] The damped BFGS variant introduces modifications to the standard update formula to ensure the Hessian approximation remains positive definite, particularly when the curvature condition (with typically set to ) is violated due to insufficient decrease in the objective function. In such cases, the gradient difference is scaled by a factor , effectively replacing with in the secant equation, where is the current Hessian approximation. This damping preserves the positive definiteness required for descent directions and has been shown to enhance robustness in nonconvex or ill-conditioned problems. The technique originated in constrained optimization contexts but applies broadly to unconstrained settings.[26] Stochastic BFGS (SBFGS) adapts the method for large-scale machine learning tasks involving noisy or stochastic gradients, where exact gradients are unavailable or computationally expensive. It incorporates regularization into the Hessian update to mitigate the variance introduced by gradient noise, often by adding a proximal term or averaging multiple stochastic gradient differences to form a more stable . This averaging stabilizes the quasi-Newton approximation without requiring full-batch computations, enabling efficient training of models like logistic regression on high-dimensional data. Theoretical analysis confirms almost-sure convergence under mild assumptions on the stochastic oracle.[27] Post-2020 research has explored hybrid BFGS variants that integrate trust-region strategies to bolster global convergence guarantees, particularly for nonconvex objectives where line-search alone may fail. These hybrids solve trust-region subproblems using the BFGS approximation within an factorization framework, balancing local superlinearity with broader exploration of the search space. While no transformative breakthroughs have emerged, such methods demonstrate improved reliability in parameter estimation for differential equation models.[28]Implementations and Applications
Software Libraries
The SciPy library in Python implements the BFGS algorithm via thescipy.optimize.minimize function using the 'BFGS' method, which performs unconstrained minimization by approximating the inverse Hessian with BFGS updates and supports automatic numerical Jacobian computation if no analytical gradient is provided.[29] It also includes the limited-memory variant through the 'L-BFGS-B' method, which handles bound constraints and uses a compact representation of the Hessian approximation to reduce memory usage for large-scale problems.[30] Key parameters include gtol for the gradient norm tolerance (defaulting to 1e-5) and maxiter for the maximum number of iterations (default 200 times the number of variables); for ill-conditioned cases, the documentation recommends increasing gtol (e.g., to 1e-3) to avoid precision loss in finite-difference approximations.[29]
In Julia, the Optim.jl package provides both BFGS and L-BFGS solvers as part of its optimization toolkit, with built-in support for automatic differentiation through compatible packages like ForwardDiff.jl to compute gradients efficiently. The standard BFGS uses full quasi-Newton updates for the Hessian approximation, while L-BFGS employs a limited-memory approach with a default of 10 correction vectors and an option to scale the initial inverse Hessian for better stability. Convergence tolerances default to a gradient norm of approximately 1e-8, and preconditioning via user-supplied functions can address ill-conditioning by improving the conditioning of the approximate Hessian during iterations.
NLopt, a cross-language library with C++ bindings, incorporates a low-storage variant of L-BFGS (algorithm NLOPT_LD_LBFGS) for gradient-based local optimization of nonlinear functions, either unconstrained or bound-constrained.[32] This implementation, derived from Fortran code by Ladislav Luksan and adapted with NLopt's termination criteria, applies variable-metric updates via Strang's recurrences and allows adjustment of the vector storage size (default determined heuristically) to balance memory and accuracy.[32] Tolerances must be explicitly set by the user, as NLopt defaults all to 0 (no enforcement); common choices include ftol_rel=1e-4 for relative function value changes and xtol_rel=1e-8 for parameters, without specific built-in mechanisms for ill-conditioning beyond the limited-memory design that inherently mitigates full Hessian storage issues.[32]
MATLAB's Optimization Toolbox implements BFGS within the fminunc function for unconstrained multivariable minimization, using quasi-Newton updates to approximate the Hessian (default 'bfgs' option) and supporting user-provided gradients for faster convergence.[33] For large-scale or memory-constrained problems, the limited-memory L-BFGS variant ('lbfgs') can be selected with a configurable number of corrections (e.g., 10 by default), which helps manage ill-conditioning by avoiding dense matrix storage.[33] Default tolerances include an optimality threshold of 1e-6 for the projected gradient norm and a step tolerance of 1e-6, with a maximum of 400 iterations; the function also employs cubic or quadratic line searches to ensure sufficient decrease and handle potential Hessian ill-conditioning.[33]
Across these libraries, default tolerances like gradient norms around 1e-5 to 1e-8 promote reliable convergence for well-conditioned problems, but users must often tune them higher for ill-conditioned objectives to prevent stalling due to numerical instability in Hessian approximations.[29][33] Limited-memory variants, such as L-BFGS-B and L-BFGS, are particularly recommended for high-dimensional cases where full BFGS updates could exacerbate conditioning issues through accumulated errors.[30][34]
Real-World Uses
The BFGS algorithm finds extensive application in machine learning for parameter estimation tasks, particularly in maximum likelihood frameworks such as logistic regression, where it efficiently optimizes non-convex loss functions by approximating the Hessian matrix to guide descent directions.[35] In neural network training, variants like L-BFGS accelerate convergence for medium-scale problems by leveraging limited memory updates, enabling faster iteration over stochastic gradients in backpropagation.[36] In operations research, BFGS supports supply chain optimization by solving nonlinear programs that balance inventory, transportation, and risk factors, often integrated with meta-heuristics to handle multi-objective scenarios like network design under failure risks.[37] For portfolio management, it enhances mean-variance optimization models by incorporating higher-order constraints and information discrepancies, with L-BFGS-B variants providing robust solutions for large asset sets through bounded minimization.[38] Physics simulations, especially molecular dynamics, employ L-BFGS within software like GROMACS for energy minimization, where it iteratively reduces potential energy landscapes in biomolecular structures by approximating curvature without full Hessian computation, ensuring stable configurations for subsequent dynamics runs.[39] In engineering, BFGS addresses structural design optimization under nonlinear constraints, such as reliability analysis in civil and mechanical systems, by hybridizing with methods like HLRF to compute failure probabilities and iterate toward feasible designs that minimize weight while satisfying stress and displacement limits.[40] Economic modeling leverages BFGS for estimating vector autoregression (VAR) models, particularly time-varying variants, where it optimizes likelihood functions over dynamic parameters to capture evolving relationships in macroeconomic data like GDP fluctuations and policy impacts.[41]References
- https://julianlsolvers.github.io/Optim.jl/[stable](/page/Stable)/

