Newton's method [@Simpson1740] for the unconstrained optimization of a twice-differentiable real-valued function applies Newton's method for root finding to the gradient of the function, which requires computing the gradient and the Hessian: given $f \in C^2(\mathbb{R}^n)$, then $\nabla f \in C^1(\mathbb{R}^n, \mathbb{R}^n)$; update by $x_{k+1} = x_k - [H_f(x_k)]^{-1} \nabla f(x_k)$. In practice, matrix inversion is converted to solving linear equations to reduce computational complexity: $H_f(x_k) \Delta x_k = -\nabla f(x_k)$. With a initial value near a local extreme where the Hessian is invertible and Lipschitz continuous, Newton's method converges towards a local extreme quadratically, which is much faster than gradient descent. Relaxing the step size by a factor less than one improves convergence of Newton's method, but converges linearly.

Quasi-Newton methods approximate the gradient and Hessian in Newton's method by differences of cached function values.

Algorithm (BFGS): initial guess $\mathbf{x}_0$, Hessian matrix $B_0 = I$.

  1. Obtain a direction $\mathbf{p}_k$: $B_k \mathbf{p}_k = -\nabla f(\mathbf{x}_k)$;
  2. Line search for an acceptable step size $\alpha_k$;
  3. $\mathbf{s}_k = \alpha_k \mathbf{p}_k$;
  4. $\mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{s}_k$;
  5. $\mathbf{y}_k = {\nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k)}$;
  6. Update formula for $B_k$;
  7. Iterate until the norm of the gradient falls under a threshold.


Secant method (割线法, 弦位法) can be thought of as a finite difference approximation of Newton's method, although developed long before the latter: $$x_k = x_{k-1} - f(x_{k-1}) \frac{x_{k-1} - x_{k-2}}{f(x_{k-1}) - f(x_{k-2})}$$ The secant method converges if the initial values are sufficiently close to the root. If the function is twice continuously differentiable and the root in question has multiplicity 1, the order of convergence is the golden ratio $(1+\sqrt{5}) / 2$, between linear and quadratic.

Davidon–Fletcher–Powell formula (DFP) is the first quasi-Newton method to generalize the secant method to a multidimensional problem. [@Davidon1959; @Fletcher and Powell, 1963]

Update formula: $$B_{k+1} = \left (I-\frac {y_k \, \Delta x_k^T} {y_k^T \, \Delta x_k} \right ) B_k \left (I-\frac {\Delta x_k y_k^T} {y_k^T \, \Delta x_k} \right )+\frac{y_k y_k^T} {y_k^T \, \Delta x_k}$$ Or with $H_{k+1} = B_{k+1}^{-1}$: $$H_{k+1} = H_k + \frac {\Delta x_k \Delta x_k^T}{\Delta x_k^{T} \, y_k} - \frac {H_k y_k y_k^T H_k} {y_k^T H_k y_k}$$

Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm

Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) [@Broyden1970; @Fletcher1970; @Goldfarb1970; @Shanno1970]

Update formula: $$B_{k+1} = B_k + \frac {y_k y_k^T}{y_k^{T} \Delta x_k} - \frac {B_k \Delta x_k (B_k \Delta x_k)^T} {\Delta x_k^{T} B_k \, \Delta x_k}$$ Or with $H_{k+1} = B_{k+1}^{-1}$: $$H_{k+1} = \left (I-\frac {\Delta x_k y_k^T } {y_k^T \Delta x_k} \right ) H_k \left (I-\frac { y_k \Delta x_k^T} {y_k^T \Delta x_k} \right )+\frac{\Delta x_k \Delta x_k^T} {y_k^T \, \Delta x_k}$$

Limited-memory BFGS (L-BFGS)

Stochastic Quasi-Newton Methods


Quasi-Newton Methods

Nonlinear Conjugate Gradient Methods

🏷 Category=Computation Category=Optimization