Hubbry Logo
Broyden–Fletcher–Goldfarb–Shanno algorithmBroyden–Fletcher–Goldfarb–Shanno algorithmMain
Open search
Broyden–Fletcher–Goldfarb–Shanno algorithm
Community hub
Broyden–Fletcher–Goldfarb–Shanno algorithm
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Broyden–Fletcher–Goldfarb–Shanno algorithm
Broyden–Fletcher–Goldfarb–Shanno algorithm
from Wikipedia

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems.[1] Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations (or approximate gradient evaluations) via a generalized secant method.[2]

Since the updates of the BFGS curvature matrix do not require matrix inversion, its computational complexity is only , compared to in Newton's method. Also in common use is L-BFGS, which is a limited-memory version of BFGS that is particularly suited to problems with very large numbers of variables (e.g., >1000). The BFGS-B variant handles simple box constraints.[3] The BFGS matrix also admits a compact representation, which makes it better suited for large constrained problems.

The algorithm is named after Charles George Broyden, Roger Fletcher, Donald Goldfarb and David Shanno.[4][5][6][7] It is an instance of a more general algorithm by John Greenstadt.[8]

Rationale

[edit]

The optimization problem is to minimize , where is a vector in , and is a differentiable scalar function. There are no constraints on the values that can take.

The algorithm begins at an initial estimate for the optimal value and proceeds iteratively to get a better estimate at each stage.

The search direction pk at stage k is given by the solution of the analogue of the Newton equation:

where is an approximation to the Hessian matrix at , which is updated iteratively at each stage, and is the gradient of the function evaluated at xk. A line search in the direction pk is then used to find the next point xk+1 by minimizing over the scalar

The quasi-Newton condition imposed on the update of is

Let and , then satisfies

,

which is the secant equation.

The curvature condition should be satisfied for to be positive definite, which can be verified by pre-multiplying the secant equation with . If the function is not strongly convex, then the condition has to be enforced explicitly e.g. by finding a point xk+1 satisfying the Wolfe conditions, which entail the curvature condition, using line search.

Instead of requiring the full Hessian matrix at the point to be computed as , the approximate Hessian at stage k is updated by the addition of two matrices:

Both and are symmetric rank-one matrices, but their sum is a rank-two update matrix. BFGS and DFP updating matrix both differ from its predecessor by a rank-two matrix. Another simpler rank-one method is known as symmetric rank-one method, which does not guarantee the positive definiteness. In order to maintain the symmetry and positive definiteness of , the update form can be chosen as . Imposing the secant condition, . Choosing and , we can obtain:[9]

Finally, we substitute and into and get the update equation of :

Algorithm

[edit]

Consider the following unconstrained optimization problem where is a nonlinear, twice-differentiable objective function.

From an initial guess and an initial guess of the Hessian matrix the following steps are repeated as converges to the solution:

  1. Obtain a direction by solving .
  2. Perform a one-dimensional optimization (line search) to find an acceptable stepsize in the direction found in the first step. If an exact line search is performed, then . In practice, an inexact line search usually suffices, with an acceptable satisfying Wolfe conditions.
  3. Set and update .
  4. .
  5. .

Convergence can be determined by observing the norm of the gradient; given some , one may stop the algorithm when If is initialized with , the first step will be equivalent to a gradient descent, but further steps are more and more refined by , the approximation to the Hessian.

The first step of the algorithm is carried out using the inverse of the matrix , which can be obtained efficiently by applying the Sherman–Morrison formula to the step 5 of the algorithm, giving

This can be computed efficiently without temporary matrices, recognizing that is symmetric, and that and are scalars, using an expansion such as

Therefore, in order to avoid any matrix inversion, the inverse of the Hessian can be approximated instead of the Hessian itself: [10]

From an initial guess and an approximate inverted Hessian matrix the following steps are repeated as converges to the solution:

  1. Obtain a direction by solving .
  2. Perform a one-dimensional optimization (line search) to find an acceptable stepsize in the direction found in the first step. If an exact line search is performed, then . In practice, an inexact line search usually suffices, with an acceptable satisfying Wolfe conditions.
  3. Set and update .
  4. .
  5. .

In statistical estimation problems (such as maximum likelihood or Bayesian inference), credible intervals or confidence intervals for the solution can be estimated from the inverse of the final Hessian matrix [citation needed]. However, these quantities are technically defined by the true Hessian matrix, and the BFGS approximation may not converge to the true Hessian matrix.[11]

Further developments

[edit]

The BFGS update formula heavily relies on the curvature being strictly positive and bounded away from zero. This condition is satisfied when we perform a line search with Wolfe conditions on a convex target. However, some real-life applications (like Sequential Quadratic Programming methods) routinely produce negative or nearly-zero curvatures. This can occur when optimizing a nonconvex target or when employing a trust-region approach instead of a line search. It is also possible to produce spurious values due to noise in the target.

In such cases, one of the so-called damped BFGS updates can be used (see [12]) which modify and/or in order to obtain a more robust update.

Notable implementations

[edit]

Notable open source implementations are:

  • ALGLIB implements BFGS and its limited-memory version in C++ and C#
  • GNU Octave uses a form of BFGS in its fsolve function, with trust region extensions.
  • The GSL implements BFGS as gsl_multimin_fdfminimizer_vector_bfgs2.[13]
  • In R, the BFGS algorithm (and the L-BFGS-B version that allows box constraints) is implemented as an option of the base function optim().[14]
  • In SciPy, the scipy.optimize.fmin_bfgs function implements BFGS.[15] It is also possible to run BFGS using any of the L-BFGS algorithms by setting the parameter L to a very large number. It is also one of the default methods used when running scipy.optimize.minimize with no constraints.[16]
  • In Julia, the Optim.jl package implements BFGS and L-BFGS as a solver option to the optimize() function (among other options).[17]

Notable proprietary implementations include:

  • The large scale nonlinear optimization software Artelys Knitro implements, among others, both BFGS and L-BFGS algorithms.
  • In the MATLAB Optimization Toolbox, the fminunc function uses BFGS with cubic line search when the problem size is set to "medium scale."
  • Mathematica includes BFGS.
  • LS-DYNA also uses BFGS to solve implicit Problems.

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative procedure for solving unconstrained nonlinear optimization problems of the form minxf(x)\min_x f(x), where ff is a smooth function. It operates as a , approximating the inverse Hk2f(xk)1H_k \approx \nabla^2 f(x_k)^{-1} through low-rank updates based on gradient differences, thereby avoiding the direct computation of second derivatives while achieving superlinear local convergence rates under suitable conditions. The method generates search directions pk=Hkf(xk)p_k = -H_k \nabla f(x_k) and advances via line searches that satisfy to ensure descent and preservation. Independently developed in 1970, the algorithm derives its name from four researchers who proposed equivalent update formulas in separate publications: C. G. Broyden in The Convergence of a Class of Double-rank Minimization Algorithms 1, R. Fletcher in A New Approach to Variable Metric Algorithms, D. Goldfarb in A Family of Variable-Metric Methods Derived by Variational Means, and D. F. Shanno in Conditioning of Quasi-Newton Methods for Function Minimization. The core BFGS update for the inverse Hessian is given by Hk+1=(IskykTykTsk)Hk(IykskTykTsk)+skskTykTsk,H_{k+1} = \left(I - \frac{s_k y_k^T}{y_k^T s_k}\right) H_k \left(I - \frac{y_k s_k^T}{y_k^T s_k}\right) + \frac{s_k s_k^T}{y_k^T s_k}, where sk=xk+1xks_k = x_{k+1} - x_k and yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k), satisfying the secant condition Hk+1yk=skH_{k+1} y_k = s_k while maintaining symmetry and if H0H_0 is positive definite and ykTsk>0y_k^T s_k > 0. This update combines elements from earlier Davidon–Fletcher–Powell (DFP) methods but demonstrates superior and performance in practice. The BFGS algorithm's efficiency stems from its O(n2)O(n^2) storage and computational cost per iteration for nn-dimensional problems, making it suitable for medium-scale optimization, though it requires exact line searches for theoretical guarantees. For large-scale applications, variants like the (L-BFGS) reduce storage to O(nm)O(n m) using a fixed number mm of past pairs, enabling scalability to high dimensions while retaining much of the convergence speed. BFGS has become a cornerstone in numerical optimization, underpinning software libraries such as and , and finding applications in , , and for constrained problems. Its global convergence can be ensured with modifications like Powell damping or restart strategies when line searches are inexact.

Background

Unconstrained Optimization Problems

The unconstrained minimization problem involves finding a point xRnx^* \in \mathbb{R}^n that minimizes a smooth nonlinear function f:RnRf: \mathbb{R}^n \to \mathbb{R}, expressed as minxRnf(x)\min_{x \in \mathbb{R}^n} f(x). Here, ff is typically assumed to be continuously differentiable and bounded below, with no restrictions imposed on the decision variables xx. This formulation arises in numerous applications, including parameter estimation in , structural design in , and in , where the objective is to identify local or global minima without boundary or equality constraints. Optimization algorithms for these problems rely on first-order information from the f(x)\nabla f(x), which indicates the direction of steepest ascent at xx; the negative thus provides a descent direction. At a local minimizer xx^*, the first-order necessary condition requires f(x)=0\nabla f(x^*) = 0. Second-order information is captured by the 2f(x)\nabla^2 f(x), a symmetric n×nn \times n matrix of second partial derivatives that encodes the local of ff. For xx^* to qualify as a local minimizer, 2f(x)\nabla^2 f(x^*) must be positive semidefinite, ensuring the function increases in all directions from that point; if positive definite, xx^* is a strict local minimizer. The classical Newton method, which solves the system 2f(xk)sk=f(xk)\nabla^2 f(x_k) s_k = -\nabla f(x_k) for the search direction, exemplifies the use of exact second-order information for rapid convergence near solutions. To address the high computational cost of evaluating and inverting the full Hessian—particularly for high-dimensional problems—iterative methods approximate the Hessian or its inverse using only differences from successive iterations. These approximations enable second-order-like efficiency without repeated second-derivative computations, making them suitable for large-scale nonlinear optimization. The development of such methods gained momentum in the and , driven by the advent of digital computers and the need for computationally efficient algorithms in fields like and scientific computing, where full Hessian evaluations were often infeasible due to limited resources. This era saw significant advancements in the , including early quasi-Newton techniques that prioritized gradient-based updates to balance accuracy and speed.

Newton and Quasi-Newton Methods

Newton's method is an iterative second-order algorithm designed to find local minima of twice-differentiable functions in unconstrained optimization. The method generates the next iterate via the formula xk+1=xk[2f(xk)]1f(xk),x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k), where f(xk)\nabla f(x_k) denotes the gradient vector and 2f(xk)\nabla^2 f(x_k) the Hessian matrix evaluated at the current point xkx_k. This approach leverages a quadratic model of the objective function, leading to quadratic convergence rates in a neighborhood of the solution, where the distance to the optimum roughly squares with each successful iteration. Despite its rapid local convergence, Newton's method faces significant computational challenges, particularly for high-dimensional problems. Forming the Hessian typically requires evaluating second derivatives, which can be costly both analytically and numerically, while solving the associated or performing matrix inversion demands O(n3)O(n^3) operations per iteration in nn dimensions. These overheads limit its practicality for large-scale optimization, where even a single iteration may become prohibitively expensive. Quasi-Newton methods emerged as an efficient alternative, approximating the Hessian (or its inverse) through low-rank updates based solely on evaluations at successive points, thereby avoiding explicit second-derivative computations. These updates incorporate differences in gradients to refine the curvature estimate iteratively, often guided by the secant condition to ensure the approximation aligns with observed function behavior along search directions. Such methods attain superlinear convergence—faster than the linear rate of first-order techniques like —while inheriting much of Newton's efficiency near the solution. A key advantage of quasi-Newton approaches lies in their reduced complexity: they maintain and update an n×nn \times n approximation matrix using O(n2)O(n^2) storage and time per iteration, in contrast to the O(n3)O(n^3) demands of full Newton steps. This trade-off enables broader applicability in problems where nn is moderately large, balancing speed and accuracy without the full second-order burden.

Mathematical Foundation

Secant Condition and Rank-Two Updates

The secant condition forms the cornerstone of quasi-Newton methods for approximating the in unconstrained optimization. It requires that the updated Hessian approximation Bk+1B_{k+1} satisfies Bk+1sk=ykB_{k+1} s_k = y_k, where sk=xk+1xks_k = x_{k+1} - x_k is the step taken from kk to k+1k+1, and yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k) is the corresponding change in the . This ensures that the approximation exactly reproduces the gradient difference for a , thereby capturing the local curvature along the search direction without requiring second derivatives. To satisfy the secant condition while maintaining the symmetry of the Hessian approximation, quasi-Newton updates are typically constructed as rank-two modifications to the previous approximation BkB_k. A rank-two update takes the general form Bk+1=Bk+uvT+vuTB_{k+1} = B_k + u v^T + v u^T, where uu and vv are vectors chosen to enforce the secant equation and , minimizing the deviation from BkB_k in a suitable norm such as the Frobenius norm. This low-rank adjustment allows efficient and storage, as only vector operations are needed rather than full matrix inversions. Early quasi-Newton methods, such as the Davidon–Fletcher–Powell (DFP) update introduced in the 1960s, exemplify this rank-two structure for the Hessian approximation. The DFP method, originally developed by Davidon and refined by Fletcher and Powell, updates BkB_k by emphasizing the gradient change yky_k in a way that generates conjugate directions for quadratics, achieving exact convergence in at most nn steps for nn-dimensional problems. However, DFP exhibits limitations in practice, particularly sensitivity to inexact line searches or small ykTsky_k^T s_k values, which can lead to unstable approximations that overly weight recent curvature information and degrade performance on ill-conditioned problems. The BFGS algorithm improves upon DFP by employing a rank-two update that more robustly balances step and information, effectively satisfying the secant condition for both the Hessian and its inverse in a complementary manner. Independently proposed by Broyden, Fletcher, Goldfarb, and Shanno in , this formulation derives from variational principles that minimize weighted Frobenius norm changes, resulting in greater and reduced sensitivity to variations compared to DFP. These enhancements make BFGS particularly effective for maintaining desirable properties like under the curvature condition ykTsk>0y_k^T s_k > 0.

Positive Definiteness Preservation

In the BFGS algorithm, maintaining of the Hessian approximation Bk0B_k \succ 0 is essential to ensure that the computed search direction pk=Bk1f(xk)p_k = -B_k^{-1} \nabla f(x_k) is a descent direction for the objective function ff. This follows because the inner product pkTf(xk)=f(xk)TBk1f(xk)<0p_k^T \nabla f(x_k) = -\nabla f(x_k)^T B_k^{-1} \nabla f(x_k) < 0 holds whenever f(xk)0\nabla f(x_k) \neq 0, as Bk1B_k^{-1} inherits from BkB_k. The preservation of this property relies on the curvature condition ykTsk>0y_k^T s_k > 0, where sk=xk+1xks_k = x_{k+1} - x_k denotes the step taken and yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k) is the gradient difference. This condition is typically enforced through a line search that satisfies the , ensuring the function decreases sufficiently and exhibits adequate curvature along sks_k, thereby preventing the approximation from becoming indefinite. Under the BFGS update, if Bk0B_k \succ 0 and ykTsk>0y_k^T s_k > 0, then the next approximation Bk+10B_{k+1} \succ 0. A sketch of the proof proceeds by showing that for any nonzero vector vv, the vTBk+1v>0v^T B_{k+1} v > 0; the update can be expressed as Bk+1=WTBkW+ykykTykTskB_{k+1} = W^T B_k W + \frac{y_k y_k^T}{y_k^T s_k}, where W=IskykTykTskW = I - \frac{s_k y_k^T}{y_k^T s_k}, so vTBk+1v=(Wv)TBk(Wv)+(ykTv)2ykTskv^T B_{k+1} v = (W v)^T B_k (W v) + \frac{(y_k^T v)^2}{y_k^T s_k}. Since Bk0B_k \succ 0, the first term is nonnegative and zero only if Wv=0W v = 0, i.e., vv is parallel to sks_k. In that case, if ykTv=0y_k^T v = 0, then v=0v = 0 (as ykTsk>0y_k^T s_k > 0); otherwise, the second term is positive. Compared to the Davidon-Fletcher-Powell (DFP) method, which preserves under the identical curvature condition, BFGS demonstrates superior robustness in numerical practice, particularly in self-correcting poor initial approximations and maintaining stability across iterations. This property of preservation underpins the algorithm's global convergence guarantees when paired with appropriate line searches.

The BFGS Update

Hessian Approximation Formula

The Broyden–Fletcher–Goldfarb–Shanno (BFGS) update provides a rank-two modification to the current Hessian approximation BkB_k to obtain Bk+1B_{k+1}, ensuring it remains symmetric and positive definite under suitable conditions. The explicit formula is given by Bk+1=Bk+ykykTykTskBkskskTBkskTBksk,B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}, where sk=xk+1xks_k = x_{k+1} - x_k denotes the step taken from the current iterate xkx_k to the next xk+1x_{k+1}, and yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k) is the corresponding change in the of the objective function ff. This update satisfies the secant condition Bk+1sk=ykB_{k+1} s_k = y_k, which ensures that the approximation matches the observed along the step direction. The derivation of this formula proceeds by seeking a minimal perturbation to BkB_k that fulfills the secant condition while preserving desirable properties such as . Specifically, it minimizes the change from BkB_k in a least-squares sense, formulated as minimizing the weighted Frobenius norm Bk+1BkW\|B_{k+1} - B_k\|_W subject to the secant and constraints, where the weighting WW is chosen to align with the inverse Hessian structure for . This variational approach, solved using Lagrange multipliers, yields the rank-two correction terms that balance fidelity to prior information with new from sks_k and yky_k. The resulting update is unique within the Broyden family for maintaining when ykTsk>0y_k^T s_k > 0, which holds under standard conditions ensuring descent. This specific form of the update was independently derived in 1970 by four researchers: Charles G. Broyden in his analysis of double-rank minimization algorithms, Roger Fletcher through a novel variable metric framework, Donald Goldfarb via variational principles, and David F. Shanno by focusing on optimal conditioning of quasi-Newton methods. Each contribution emphasized different aspects, such as convergence guarantees or numerical conditioning, but converged on the same formula, establishing it as a cornerstone of quasi-Newton optimization. In practice, BkB_k is stored and updated as a dense n×nn \times n at each , requiring O(n2)O(n^2) storage and O(n2)O(n^2) time per update due to matrix-vector products and outer products involved in the correction terms; this makes it suitable for problems where nn is moderate, typically up to a few thousand.

Inverse Hessian Formula

The inverse Hessian approximation in the BFGS algorithm is updated using a rank-two modification that satisfies the secant condition Hk+1yk=skH_{k+1} y_k = s_k, where sk=xk+1xks_k = x_{k+1} - x_k is the step vector and yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k) is the difference, while preserving and provided ykTsk>0y_k^T s_k > 0. This update is particularly efficient for computing the search direction directly as pk=Hkf(xk)p_k = -H_k \nabla f(x_k), avoiding the need to form and invert the Hessian approximation Bk=Hk1B_k = H_k^{-1} at each iteration. One common expression for the BFGS inverse update employs an auxiliary matrix VkV_k and scalar ρk\rho_k: Vk=IskykTρk,Hk+1=VkTHkVk+skskTρk,\begin{aligned} V_k &= I - \frac{s_k y_k^T}{\rho_k}, \\ H_{k+1} &= V_k^T H_k V_k + \frac{s_k s_k^T}{\rho_k}, \end{aligned} where ρk=ykTsk\rho_k = y_k^T s_k. This form leverages the Sherman-Morrison-Woodbury formula implicitly to ensure computational efficiency, requiring only matrix-vector products and outer products that scale as O(n2)O(n^2) per update for nn-dimensional problems. An equivalent two-term representation, often used in implementations for its direct expansion without auxiliary variables, is Hk+1=Hk+(skTyk+ykTHkyk)(skskT)(skTyk)2HkykskT+skykTHkskTyk.H_{k+1} = H_k + \frac{(s_k^T y_k + y_k^T H_k y_k) (s_k s_k^T)}{(s_k^T y_k)^2} - \frac{H_k y_k s_k^T + s_k y_k^T H_k}{s_k^T y_k}. This expression derives from expanding the VkV_k form and is symmetric by . Both forms maintain the property that the updated Hk+1H_{k+1} is positive definite if HkH_k is positive definite and the condition ρk>0\rho_k > 0 holds, which is typically enforced via procedures. In practice, the inverse approximation is initialized as H0=IH_0 = I, the , providing a unit scaling that assumes the Hessian has eigenvalues near 1, or more adaptively as H0=γIH_0 = \gamma I where γ\gamma scales based on initial information, such as γ=s0Ty0y0Ty0\gamma = \frac{s_0^T y_0}{y_0^T y_0} after the first step to better match the problem's . This initialization promotes stable early iterations and aligns with the algorithm's goal of approximating the inverse Hessian near the solution.

Algorithm Procedure

Initialization and Iteration Steps

The BFGS algorithm begins with the initialization of key parameters to set up the optimization process for an unconstrained minimization problem minxf(x)\min_x f(x), where ff is a smooth function. An initial iterate x0Rnx_0 \in \mathbb{R}^n is selected, often based on or a simple starting guess. An initial approximation B0B_0 to the Hessian 2f(x0)\nabla^2 f(x_0) is chosen as a symmetric positive , commonly the II for simplicity, ensuring the method starts with a well-behaved quadratic model. Alternatively, the inverse Hessian approximation H0=B01H_0 = B_0^{-1} may be initialized directly as II. A tolerance ϵ>0\epsilon > 0 is specified to control convergence, typically on the order of 10510^{-5} to 10810^{-8} depending on the problem's scale and required precision. The iteration counter is set to k=0k = 0. In each iteration, the algorithm computes the gradient f(xk)\nabla f(x_k) at the current point. If the norm f(xk)\|\nabla f(x_k)\| falls below ϵ\epsilon, the process terminates with xkx_k as the approximate minimizer. Otherwise, a search direction pkp_k is determined using the current Hessian approximation: pk=Bk1f(xk)p_k = -B_k^{-1} \nabla f(x_k), or equivalently pk=Hkf(xk)p_k = -H_k \nabla f(x_k) when working with the inverse. A step length αk>0\alpha_k > 0 is then found via a line search procedure that ensures sufficient decrease in ff. The next iterate is updated as xk+1=xk+αkpkx_{k+1} = x_k + \alpha_k p_k. Curvature information is gathered by computing the displacement sk=xk+1xks_k = x_{k+1} - x_k and the gradient change yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k). The approximation is then updated to Bk+1B_{k+1} (or Hk+1H_{k+1}) using the BFGS formula, incorporating sks_k and yky_k to satisfy the secant condition while preserving positive definiteness, assuming ykTsk>0y_k^T s_k > 0. The counter is incremented to k+1k+1, and the loop repeats. The following pseudocode outlines the standard BFGS procedure using the inverse Hessian approximation for computational efficiency:

Algorithm: BFGS for Unconstrained Minimization Input: Initial x_0, H_0 ≻ 0 (e.g., I), tolerance ε > 0 Set k = 0 While ||∇f(x_k)|| ≥ ε do p_k = -H_k ∇f(x_k) // Search direction α_k = LineSearch(f, x_k, p_k) // Step length satisfying Wolfe conditions x_{k+1} = x_k + α_k p_k s_k = x_{k+1} - x_k y_k = ∇f(x_{k+1}) - ∇f(x_k) ρ_k = 1 / (y_k^T s_k) H_{k+1} = (I - ρ_k s_k y_k^T) H_k (I - ρ_k y_k s_k^T) + ρ_k s_k s_k^T k = k + 1 End While Output: Approximate minimizer x_k

Algorithm: BFGS for Unconstrained Minimization Input: Initial x_0, H_0 ≻ 0 (e.g., I), tolerance ε > 0 Set k = 0 While ||∇f(x_k)|| ≥ ε do p_k = -H_k ∇f(x_k) // Search direction α_k = LineSearch(f, x_k, p_k) // Step length satisfying Wolfe conditions x_{k+1} = x_k + α_k p_k s_k = x_{k+1} - x_k y_k = ∇f(x_{k+1}) - ∇f(x_k) ρ_k = 1 / (y_k^T s_k) H_{k+1} = (I - ρ_k s_k y_k^T) H_k (I - ρ_k y_k s_k^T) + ρ_k s_k s_k^T k = k + 1 End While Output: Approximate minimizer x_k

This implementation avoids explicit matrix inversion in each step by maintaining and updating HkH_k. Termination occurs primarily when the gradient norm f(xk)<ϵ\|\nabla f(x_k)\| < \epsilon, indicating a stationary point where f(xk)0\nabla f(x_k) \approx 0. Additional criteria may include stagnation in the function value, such as f(xk+1)f(xk)<δ|f(x_{k+1}) - f(x_k)| < \delta for some small δ>0\delta > 0, or a maximum iteration limit to prevent infinite loops in practice.

Line Search Requirements

In the BFGS algorithm, the line search phase selects a step length αk>0\alpha_k > 0 along the search direction pkp_k to ensure meaningful progress in reducing the objective function ff, while maintaining properties essential for the method's convergence and Hessian approximation updates. This inexact replaces the exact minimization over α\alpha to reduce computational cost, but must satisfy specific conditions to guarantee descent and compatibility with the quasi-Newton update. The Armijo condition, or sufficient decrease condition, forms the core requirement for ensuring that the step provides an adequate reduction in the function value relative to the from the . It states that f(xk+αkpk)f(xk)+c1αkf(xk)Tpk,f(x_k + \alpha_k p_k) \leq f(x_k) + c_1 \alpha_k \nabla f(x_k)^T p_k, where 0<c1<10 < c_1 < 1 is a small constant, typically around 0.0001, controlling the fraction of the predicted decrease that must be achieved. This condition prevents excessively large steps that might overshoot the minimum and is derived from the assumption of Lipschitz continuity in the , ensuring the inequality holds for sufficiently small αk\alpha_k. To avoid steps that are too short, which could lead to inefficient progress or violate update assumptions, the Wolfe curvature condition complements the Armijo rule by enforcing that the directional derivative at the new point does not decrease too sharply. This requires f(xk+αkpk)Tpkc2f(xk)Tpk,\nabla f(x_k + \alpha_k p_k)^T p_k \geq c_2 \nabla f(x_k)^T p_k, with c1<c2<1c_1 < c_2 < 1, often c2=0.9c_2 = 0.9, ensuring the step captures sufficient curvature information from the function. The combined Armijo and Wolfe conditions, known as the strong Wolfe conditions, yield more reliable step sizes by jointly controlling both function decrease and gradient alignment, particularly beneficial in nonconvex settings where pure Armijo might accept overly conservative steps. Practical algorithms for satisfying these conditions include backtracking line search, which initializes αk=1\alpha_k = 1 and iteratively reduces it by a fixed factor β(0,1)\beta \in (0,1), such as 0.5, until both strong Wolfe conditions hold; this simple yet effective approach typically requires few evaluations of ff and f\nabla f. Alternatively, cubic interpolation fits a cubic polynomial to available function and derivative information to estimate a promising αk\alpha_k, accelerating convergence in smooth problems while still verifying the conditions. These line search strategies ensure the displacement sk=αkpks_k = \alpha_k p_k and gradient change yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k) satisfy ykTsk>0y_k^T s_k > 0, preserving the positive definiteness of the Hessian approximation in subsequent BFGS updates.

Theoretical Properties

Local Superlinear Convergence

The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm exhibits local superlinear convergence near a xx^* of the objective function ff, meaning that the iterates satisfy limkxk+1xxkx=0\lim_{k \to \infty} \frac{\|x_{k+1} - x^*\|}{\|x_k - x^*\|} = 0, assuming xkxx_k \neq x^* for sufficiently large kk. This Q-superlinear rate implies that the error diminishes faster than any linear rate but does not achieve the quadratic speed of pure . Under standard assumptions, the BFGS method achieves this local Q-superlinear convergence. Specifically, if ff is twice continuously differentiable in a neighborhood of xx^*, the Hessian 2f\nabla^2 f is continuous there (i.e., 2f(x)2f(x)Lxx\|\nabla^2 f(x) - \nabla^2 f(x^*)\| \leq L \|x - x^*\| for some L>0L > 0), xx^* is a strong local minimizer with 2f(x)\nabla^2 f(x^*) positive definite, and the initial Hessian approximation B0B_0 is positive definite with eigenvalues bounded away from zero and infinity, then starting sufficiently close to xx^* with exact , the BFGS iterates converge Q-superlinearly to xx^*. The preservation of positive definiteness in BFGS updates plays a crucial role in maintaining descent directions within this local neighborhood. The proof relies on a characterization of superlinear convergence for quasi-Newton methods: the iterates converge Q-superlinearly if and only if limk[Bk2f(x)](xk+1xk)xk+1xk=0\lim_{k \to \infty} \frac{\| [B_k - \nabla^2 f(x^*)] (x_{k+1} - x_k) \|}{\|x_{k+1} - x_k\|} = 0. For BFGS, the secant condition and rank-two updates ensure that the approximation error Bk2f(x)\|B_k - \nabla^2 f(x^*)\| tends to zero as kk \to \infty near xx^*, satisfying the characterization and yielding Newton-like steps that drive superlinear error reduction. This bounded deterioration property holds under the assumption on 2f\nabla^2 f. In comparison, linear convergence has a constant ratio bounded away from zero, while quadratic convergence (as in ) satisfies xk+1x=O(xkx2)\|x_{k+1} - x^*\| = O(\|x_k - x^*\|^2); BFGS achieves the former improvement over steepest descent but falls short of the latter due to its approximate second-order information.

Global Convergence Conditions

The BFGS algorithm demonstrates global convergence to a of the objective function from arbitrary starting points under appropriate conditions on the and update mechanism. Specifically, when implemented with a Wolfe line search that satisfies both the Armijo sufficient decrease condition and the curvature condition, and provided that the sequence of Hessian approximations remains uniformly bounded, the method generates a sequence of iterates converging to a point where the vanishes. This boundedness of the approximations is ensured if the objective function is twice continuously differentiable and the eigenvalues of the Hessian matrices are bounded away from zero and infinity along the path. A key tool in establishing this global convergence is the Zoutendijk condition, which applies to descent methods satisfying Wolfe line searches. The condition states that k=0cos2θkf(xk)2<,\sum_{k=0}^{\infty} \cos^2 \theta_k \|\nabla f(x_k)\|^2 < \infty, where θk\theta_k is the angle between the negative gradient f(xk)-\nabla f(x_k) and the search direction dkd_k. This finite sum implies that lim infkf(xk)=0\liminf_{k \to \infty} \|\nabla f(x_k)\| = 0, ensuring the iterates approach a stationary point. In the context of BFGS, the quasi-Newton updates and the line search guarantee that the search directions are descent directions with cosθk\cos \theta_k bounded away from zero sufficiently often, satisfying the Zoutendijk condition and yielding global convergence even for nonconvex functions. To maintain the positive definiteness of the Hessian approximations and ensure descent, strategies are incorporated when the curvature condition ykTsk0y_k^T s_k \leq 0 fails, where sk=xk+1xks_k = x_{k+1} - x_k and yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k). In such cases, the update is modified by introducing a parameter θk(0,1]\theta_k \in (0,1] that scales yky_k to θkyk+(1θk)ykTskskTsksk\theta_k y_k + (1 - \theta_k) \frac{y_k^T s_k}{s_k^T s_k} s_k, forming a convex combination with the steepest descent direction to restore ykTsk>0y_k^T s_k > 0 while preserving the secant condition approximately. This prevents indefinite updates and ensures the method remains globally convergent by keeping the approximations positive definite. These global convergence guarantees rely on standard assumptions about the objective function ff: it must be coercive, satisfying f(x)f(x) \to \infty as x\|x\| \to \infty, which bounds the sublevel sets and keeps iterates in a compact ; additionally, the f\nabla f must be continuous, ensuring the parameters are well-defined and the function decrease is controlled. Under these conditions, the BFGS method not only converges globally but also achieves local superlinear convergence rates once sufficiently close to the solution.

Variants

Limited-Memory BFGS

The limited-memory BFGS (L-BFGS) algorithm addresses the storage limitations of the full BFGS method for large-scale unconstrained optimization problems, where the dimension nn is very high (typically n>103n > 10^3). Instead of maintaining the full n×nn \times n approximation HkH_k of the inverse Hessian, L-BFGS stores only the mm most recent pairs of vectors (sj,yj)(s_j, y_j) for j=km+1,,kj = k-m+1, \dots, k, where sj=xj+1xjs_j = x_{j+1} - x_j and yj=f(xj+1)f(xj)y_j = \nabla f(x_{j+1}) - \nabla f(x_j), with mm being a small fixed (often 3 to 20). This implicit representation allows the algorithm to compute the search direction pk=Hkf(xk)p_k = -H_k \nabla f(x_k) efficiently without explicitly forming HkH_k. The core of L-BFGS is a two-loop recursion procedure that approximates the action of HkH_k on the f(xk)\nabla f(x_k). In the first loop (backward loop), starting from q=f(xk)q = \nabla f(x_k), the algorithm iteratively updates qq using the stored pairs (sj,yj)(s_j, y_j) in reverse chronological order to incorporate curvature information, effectively simulating the product of intermediate matrices. The second loop (forward loop) then computes scalar coefficients αj\alpha_j implicitly by reversing the process and accumulating the search direction pkp_k, ensuring that the approximation satisfies the secant conditions Hkyj=sjH_k y_j = s_j for the stored pairs. This avoids direct matrix operations and matrix-vector multiplications beyond the stored vectors, resulting in a computational cost of O(mn)O(m n) per iteration. In terms of storage, L-BFGS requires only O(mn)O(m n) space to hold the vectors and auxiliary scalars, compared to O(n2)O(n^2) for the full BFGS approximation, making it practical for problems with millions of variables where full storage is infeasible. The trade-off is a potentially slower convergence rate, as the limited history may lead to fewer iterations of superlinear convergence near the solution, but the overall efficiency gains in memory-constrained environments outweigh this for high-dimensional applications. The method was first introduced by Nocedal in 1980 as a limited-storage update for quasi-Newton matrices, and later refined into the full L-BFGS algorithm by Liu and Nocedal in 1989, with extensive numerical validation on large-scale test problems.

Constrained and Damped Variants

The BFGS-B algorithm addresses bound-constrained optimization problems by extending the quasi-Newton framework to incorporate simple bounds on the variables. It employs a gradient projection method to project the search direction onto the defined by the bounds, ensuring that iterates remain within the bounds during line searches. When the projected direction encounters an active bound, recovery steps are activated to adjust the approximation and restore feasibility, preventing indefinite Hessian updates. This approach maintains the superlinear convergence properties of BFGS in the unconstrained case while handling bounds efficiently, as demonstrated in numerical experiments on large-scale problems. The damped BFGS variant introduces modifications to the standard update formula to ensure the Hessian approximation remains , particularly when the condition ykTsk<ηsk2y_k^T s_k < \eta \|s_k\|^2 (with η\eta typically set to 10410^{-4}) is violated to insufficient decrease in the objective function. In such cases, the difference yky_k is scaled by a factor θk=min(1,ηsk2ykTsk)\theta_k = \min\left(1, \frac{\eta \|s_k\|^2}{y_k^T s_k}\right), effectively replacing yky_k with θkyk+(1θk)Bksk\theta_k y_k + (1 - \theta_k) B_k s_k in the secant equation, where BkB_k is the current Hessian approximation. This damping preserves the required for descent directions and has been shown to enhance robustness in nonconvex or ill-conditioned problems. The technique originated in contexts but applies broadly to unconstrained settings. Stochastic BFGS (SBFGS) adapts the method for large-scale tasks involving or , where exact are unavailable or computationally expensive. It incorporates regularization into the Hessian update to mitigate the variance introduced by , often by adding a proximal term or averaging multiple differences to form a more stable yky_k. This averaging stabilizes the quasi-Newton approximation without requiring full-batch computations, enabling efficient training of models like on high-dimensional data. Theoretical analysis confirms almost-sure convergence under mild assumptions on the . Post-2020 has explored hybrid BFGS variants that integrate trust-region strategies to bolster global convergence guarantees, particularly for nonconvex objectives where line-search alone may fail. These hybrids solve trust-region subproblems using the BFGS within an LDLTLDL^T framework, balancing local superlinearity with broader exploration of the search space. While no transformative breakthroughs have emerged, such methods demonstrate improved reliability in parameter estimation for models.

Implementations and Applications

Software Libraries

The library in Python implements the BFGS algorithm via the scipy.optimize.minimize function using the 'BFGS' method, which performs unconstrained minimization by approximating the inverse Hessian with BFGS updates and supports automatic numerical computation if no analytical is provided. It also includes the limited-memory variant through the 'L-BFGS-B' method, which handles bound constraints and uses a compact representation of the Hessian approximation to reduce memory usage for large-scale problems. Key parameters include gtol for the norm tolerance (defaulting to 1e-5) and maxiter for the maximum number of iterations (default 200 times the number of variables); for ill-conditioned cases, the documentation recommends increasing gtol (e.g., to 1e-3) to avoid precision loss in finite-difference approximations. In Julia, the Optim.jl package provides both BFGS and L-BFGS solvers as part of its optimization toolkit, with built-in support for through compatible packages like ForwardDiff.jl to compute efficiently. The standard BFGS uses full quasi-Newton updates for the Hessian approximation, while L-BFGS employs a limited-memory approach with a default of 10 correction vectors and an option to scale the initial inverse Hessian for better stability. Convergence tolerances default to a norm of approximately 1e-8, and preconditioning via user-supplied functions can address ill-conditioning by improving the conditioning of the approximate Hessian during iterations. NLopt, a cross-language library with C++ bindings, incorporates a low-storage variant of L-BFGS (algorithm NLOPT_LD_LBFGS) for gradient-based local optimization of nonlinear functions, either unconstrained or bound-constrained. This implementation, derived from Fortran code by Ladislav Luksan and adapted with NLopt's termination criteria, applies variable-metric updates via Strang's recurrences and allows adjustment of the vector storage size (default determined heuristically) to balance memory and accuracy. Tolerances must be explicitly set by the user, as NLopt defaults all to 0 (no enforcement); common choices include ftol_rel=1e-4 for relative function value changes and xtol_rel=1e-8 for parameters, without specific built-in mechanisms for ill-conditioning beyond the limited-memory design that inherently mitigates full Hessian storage issues. MATLAB's Optimization Toolbox implements BFGS within the fminunc function for unconstrained multivariable minimization, using quasi-Newton updates to approximate the Hessian (default 'bfgs' option) and supporting user-provided for faster convergence. For large-scale or memory-constrained problems, the limited-memory L-BFGS variant ('lbfgs') can be selected with a configurable number of corrections (e.g., 10 by default), which helps manage ill-conditioning by avoiding dense matrix storage. Default tolerances include an optimality threshold of 1e-6 for the projected norm and a step tolerance of 1e-6, with a maximum of 400 iterations; the function also employs cubic or quadratic line searches to ensure sufficient decrease and handle potential Hessian ill-conditioning. Across these libraries, default tolerances like gradient norms around 1e-5 to 1e-8 promote reliable convergence for well-conditioned problems, but users must often tune them higher for ill-conditioned objectives to prevent stalling due to numerical instability in Hessian approximations. Limited-memory variants, such as L-BFGS-B and L-BFGS, are particularly recommended for high-dimensional cases where full BFGS updates could exacerbate conditioning issues through accumulated errors.

Real-World Uses

The BFGS algorithm finds extensive application in for parameter estimation tasks, particularly in maximum likelihood frameworks such as , where it efficiently optimizes non-convex loss functions by approximating the to guide descent directions. In training, variants like L-BFGS accelerate convergence for medium-scale problems by leveraging limited memory updates, enabling faster iteration over gradients in . In , BFGS supports by solving nonlinear programs that balance inventory, transportation, and risk factors, often integrated with meta-heuristics to handle multi-objective scenarios like network design under failure risks. For portfolio management, it enhances mean-variance optimization models by incorporating higher-order constraints and discrepancies, with L-BFGS-B variants providing robust solutions for large asset sets through bounded minimization. Physics simulations, especially , employ L-BFGS within software like for energy minimization, where it iteratively reduces landscapes in biomolecular structures by approximating curvature without full Hessian computation, ensuring stable configurations for subsequent dynamics runs. In , BFGS addresses structural under nonlinear constraints, such as reliability in civil and mechanical systems, by hybridizing with methods like HLRF to compute failure probabilities and iterate toward feasible designs that minimize weight while satisfying stress and displacement limits. Economic modeling leverages BFGS for estimating vector autoregression (VAR) models, particularly time-varying variants, where it optimizes likelihood functions over dynamic parameters to capture evolving relationships in macroeconomic data like GDP fluctuations and policy impacts.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.