Newton's method, Newton–Raphson method, or method of tangents [@Simpson1740] is an iterative algorithm to find a root of a differentiable real function. At each iteration, the derivative of the function is calculated to obtain a linear approximation of the function, whose root is used as a new approximation to a root of the original function.
Given a nonlinear function $f \in C^1(\mathbb{R})$, we would like to find a solution to the equation $f(x) = 0$. With an initial approximate solution $x_0$, Newton's method updates by $x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)}$, $k \in \mathbb{N}$.
If $f \in C^2(\mathbb{R})$ and $x_0$ is close to a simple root $x^∗$, i.e. $f'(x^∗) \ne 0$, then Newton's method has quadratic convergence: $|x_{k+1} - x^∗| \le c |x_k - x^∗|^2$, where $c > 0$. The initial approximate solution can be obtained by methods with better convergence guarantee but perhaps slower convergence rates, e.g. bisection method or gradient method.
(Proof of local convergence rate.)
Modified Newton method only computes the derivative at the initial step and use it in all iterations: $x_{k+1} = x_k - \frac{f(x_k)}{f'(x_0)}$. Modified Newton method saves gradient computation in ensuing iterations, but has linear convergence instead of quadratic convergence.
Given a nonlinear transformation of the Euclidean n-space, $f \in C^1(\mathbb{R}^n, \mathbb{R}^n)$, we would like to find a solution to the system of nonlinear equations $f(x) = 0$. With an initial approximate solution $x_0$, Newton's method updates by $x_{k+1} = x_k - [\nabla f(x_k)]^{-1} f(x_k)$. To reduce computational complexity, matrix inversion is converted to solving linear equations $\nabla f(x_k) \delta_k = - f(x_k)$ by Gaussian elimination:
LU decomposition can be computed in the same time complexity as the algorithm it uses to multiply two square matrices [@Bunch and Hopcroft, 1974]. Matrix multiplication algorithms have time complexity $O(n^a)$ for matrices of order $n$, where $a \in (2, 3]$: the Strassen algorithm (1969) has an asymptotic rate $a = \log_2 7 \approx 2.807$; the Coppersmith–Winograd algorithm (1990) has an asymptotic rate $a \approx 2.375$, although it is not used for matrices of a practical size. Solving a system of linear equations with a triangular coefficient matrix takes $O(n^2)$ steps of scalar multiplications and additions.
(Local convergence rate.)
Given an immersion from the Euclidean n-space into the Euclidean m-space, $m > n$, i.e. a differentiable map whose differentials are injective everywhere: $f \in C^1(\mathbb{R}^n, \mathbb{R}^m)$, $\text{rank}~\nabla f(x) = n$. This gives an overdetermined system of nonlinear equations: $f(x) = 0$, which may not have a solution. But it can be turned into a non-linear least squares problem: $\min_x \|f(x)\|^2$. Then it can be solved by the Gauss-Newton algorithm [@Gauss1809] which replaces the inverse of Jacobian in the update formula with the Moor-Penrose generalized inverse (in this case, a left pseudoinverse): $x_{k+1} = x_k - J_k^\dagger f(x_k)$, where $J^\dagger = (J^T J)^{-1} J^T$ and $J_k = \nabla f(x_k)$. Note that $\text{rank}~(J^T J) = n$, and thus $J^T J$ is invertible. In practice, it is implemented by solving linear equations: (normal equations) $J_k^T J_k \delta_k = - J_k^T f(x_k)$.
The rate of convergence of the Gauss–Newton algorithm can approach quadratic [@Björck1996].
Given a submersion from the Euclidean n-space into the Euclidean m-space, $m < n$, i.e. a differentiable map whose differentials are surjective everywhere: $f \in C^1(\mathbb{R}^n, \mathbb{R}^m)$, $\text{rank}~\nabla f(x) = m$.
This gives an underdetermined system of nonlinear equations: $f(x) = 0$, which may have infinitely many solutions. In fact, by submersion level set theorem (see Topology), the zero set $f^{-1}(0)$ is a submanifold of dimension $n - m$.
We find a unique solution by the "least-change" criterium, which also replaces the inverse of Jacobian in the update formula with the Moor-Penrose generalized inverse (in this case, a right pseudoinverse): $x_{k+1} = x_k - J_k^\dagger f(x_k)$, where $J^\dagger = J^T (J J^T)^{-1}$.