Optimization

Optimization is the pursuit of optimal points in practical/computational problems. Although calculus easily solves this task when the function is differentiable and unconstrained, optimization mostly deal with constrained problems and continuity is not guaranteed.

Dimensions of optimization settings: (simple vs. complex)

unconstrained vs. constrained;
continuous vs. discrete vs. mixed;
deterministic vs. stochastic (infinite dimensional, dynamic);
single-objective vs. multi-objective vs. multi-agent;

A quantity can be modeled as continuous if your instrument can take on any value its resolution allows. A quantity can be modeled as stochastic if its generating mechanism is inherently stochastic, or you acknowledge ignorance towards its mechanism, or you intentionally introduce ignorance by reducing information.

Subdivision of continuous, deterministic, single-objective optimization problems:

Convex Optimization: linear programming;
smooth optimization: gradient dominated;
non-smooth, non-convex optimization: locally Lipschitz; piecewise smooth;

Concepts

Optimization problem is any problem of minimizing an objective function over a feasible set. Given optimization variable $x \in \mathbb{R}^n$ and real-valued functions: $f_0$, the objective function (cost function); inequality constraint functions $f_i$ and equality constraint functions $h_i$, the standard form of an optimization problem is: $$\begin{align} \min \quad & f_0(x) \\ \text{s.t.} \quad & f_i(x) \le 0 && i = 1,\dots,m \\ & h_i(x) = 0 && i = 1,\dots,p \end{align}$$ The epigraph form of an optimization problem is: $$\begin{align} \min \quad & t \\ \text{s.t.} \quad & f_0(x) - t \le 0 \\ & f_i(x) \le 0 && i = 1,\dots,m \\ & h_i(x) = 0 && i = 1,\dots,p \end{align}$$

Domain of an optimization problem is intersection of the domains of the objective and all constraints: $D = (\cap_i \text{dom}f_i) ∩ (\cap_j \text{dom}h_j)$. Feasible set or constraint set $X$ of an optimization problem is the subset of its domain that satisfy all the constraints. An inequality constraint is active at a feasible point, if the inequality binds: $\exists x \in X: f_i(x) = 0$. Otherwise it is inactive at that point. A constraint is redundant if it doesn't change the feasible set. Active set $\mathcal{A}(x)$ at a feasible point is the set of indices of the equality constraints and the active inequality constraints: $\mathcal{A}(x) = \mathcal{E} \cup \{i \in \mathcal{I} : c_i(x) = 0\}$.

Locally optimal point is a feasible point that optimizes the objective function over the feasible set within one of its neighborhood: $x = \inf { f_0(X \cap B_x(R)) }$.

Optimal value of an optimization problem is the infimum of the image of the feasible set, which takes value of extended real numbers: $p^∗ = \inf \{f_0(X)\}$, where $p$ is for primal. By convention, the infimum of an empty set is infinity. If the image of feasible points can be arbitrarily low, the optimal value is negative infinity. Optimal set of an optimization problem is the intersection of the preimage of its optimal value and its feasible set: $X_{\text{opt}} = f_0^{-1}( p^∗ ) \cap X$. Optimal point is a point of the optimal set.

For $\varepsilon>0$, the ε-suboptimal set of an optimization problem is defined as $X_{\text{opt}} = f_0^{-1}[p^∗ , p^∗ + \varepsilon] \cap X$. An $\varepsilon$-suboptimal point is a point of the $\varepsilon$-suboptimal set.

An optimization problem is unconstrained if there is no constraints: $m = p = 0$. An optimization problem is infeasible, if its feasible set is empty. Informally, two optimization problems are equivalent if from a solution of one, a solution of the other is readily found, and vice versa. It is possible but complicated to give a formal definition of such equivalence.

Ascent direction of a function at a point is a direction in which the function locally strictly increases. Direction of steepest ascent of a function at a point is an ascent direction with the largest directional derivative.

Line search methods select a descent direction and then a step size at each iteration. Trust region methods approximate the objective in a region around the current iterate with a (usually quadratic) model, and then select a direction according to the model; the region size is increased if objective decrease is satisfactory, otherwise it is decreased.

Unconstrained Optimization

Unconstrained optimization methods:

gradient/steepest descent [@Cauchy1847];
gradient methods with momentum: conjugate gradient (CG), heavy ball method [@Polyak1964]; accelerated gradient methods [@Nesterov1983];
coordinate descent (CD);

Unconstrained optimization of finite sums / expectations (see e.g. [@Sra2011; @Bottou2018]):

stochastic gradient descent (SGD) [@Robbins1951];
noise reduction methods:
- iterate averaging [@Polyak1992];
- gradient aggregation: stochastic average gradient (SAG) [@LeRoux2012]; stochastic variance-reduced gradient (SVRG) [@Johnson2013]; SAGA (stochastic average gradient augmented) [@Defazio2014];
- dynamic sample size [@Byrd2012; @Hashemi2014];
second-order methods:
- natural gradient method, Hessian-free inexact Newton methods, Gauss-Newton methods, quasi-Newton methods;
- diagonal scaling: adaptive gradient (AdaGrad) [@Duchi2011], root mean square propagation (RMSprop) [@Tieleman2012], adaptive moment estimation (Adam) [@Kingma2015];
randomized CD [@Nesterov2012], parallel stochastic CD [@LiuJ2013], stochastic dual coordinate ascent (SDCA) [@Shalev-Shwartz2013];

Convex Optimization

Convex function has identical local and global minimum sets. There are NP-hard convex optimization problems. The simplex method has exponential complexity. Interior-point methods can be proved to solve (to a specified accuracy) some convex optimization problems in polynomial time.

Convex optimization methods:

Simplex algorithm, aka simplex method [@Dantzig1947] for linear programs;
Interior-point methods, aka barrier methods [@Karmarkar1984];
Alternating direction method of multipliers (ADMM) (see e.g. [@Boyd2011]);

Conditional gradient, aka Frank-Wolfe [@Frank and Wolfe, 1956], for convex optimization: move along the maximum descent vector of the linearized objective over the constraint set, with a diminishing step size.

Other methods for convex programs: subgradient methods (descent methods); cutting plane methods (outer approximation); simplicial decomposition (inner approximation); proximal methods (regularization).

Interior-point Method

Interior-point methods solve large linear programs with techniques for nonlinear programs and nonlinear equations, where all iterates satisfy the inequality constraints strictly. Primal-dual method is a subclass of interior-point method that solves the primal-dual system of the original problem, which has the most efficient practical approaches. Consider the linear programming problem in standard form: $$\begin{align} \min \quad & c^T x && c \in \mathbb{R}^n \\ \text{s.t.} \quad & A x = b && b \in \mathbb{R}^m \\ & x \ge 0 \end{align}$$ Its dual problem is: $$\begin{align} \max \quad & b^T \lambda && b \in \mathbb{R}^m \\ \text{s.t.} \quad & A^T \lambda + s = c && s \in \mathbb{R}^n \\ & s \ge 0 \end{align}$$ Solutions of the primal and the dual problems are characterized by the KKT conditions: $A^T \lambda + s = c$, $A x = b$, $x \circ s = 0$, and $x^T s \ge 0$. To find solutions $(x^∗, \lambda^∗, s^∗ )$ of the KKT system, primal-dual methods apply variants of Newton's method to the equalities, and modify the search directions and step lengths so that the inequalities are satisfied strictly at every iteration.

Primal-dual feasible set is defined as: $\mathcal{F} = \{(x, \lambda, s) : A x = b, A^T \lambda + s = c, x^T s \ge 0 \}$. Primal-dual strictly feasible set satisfies the inequality strictly: $\mathcal{F}^o = \{(x, \lambda, s) : A x = b, A^T \lambda + s = c, x^T s > 0 \}$. Central path $C(\tau)$ is a parameterized curve of strictly feasible points that satisfies a modified version of the third equality in the KKT conditions: $x \circ s = \tau \mathbf{1}$, where $\tau \in (0, \infty)$. Path-following primal-dual methods restrict the iterates to a neighborhood of the central path and follow the central path to a solution of the linear program. These methods are similar to path-following (aka continuation or homotopy) methods for tracing solution curves of nonlinear equations.

Smooth Optimization

Smooth optimization refers to programs where the objective satisfies certain differentiability property. A general nonlinear constrained optimization problem can be written as [@Nocedal2006, Ch. 15]: $$\begin{align} \min_{x \in \mathbb{R}^n} \quad & f(x) \\ \text{s.t.} \quad & c_i(x) = 0 && i \in \mathcal{E} \\ \quad & c_j(x) \le 0 && j \in \mathcal{I} \end{align}$$ Here, the objective $f(x)$ and the constraints $c_i(x)$ are smooth real-valued functions, and $\mathcal{E}$ and $\mathcal{I}$ are finite index sets of equality and inequality constraints. Quadratic program (QP), for exmaple, is such an optimization problem where the objective function is quadratic and the constraints are linear.

Smooth optimization methods:

Penalty methods: augmented Lagrangian method, aka method of multipliers [@Hestenes1969; @Powell1969].
Active-set methods (~1970s), for small- and medium-sized problems with inequality constraints:
- Gradient projection method, or projected gradient descent (PGD) [@Goldstein1964; @Levitin and Polyak, 1966] for convex sets, e.g. bound constraints.
- Sequential quadratic programming (SQP) methods.
Interior-point methods, aka barrier methods, for nonlinear programming (~1990s).

Techniques for Infeasible Methods

Feasible methods for constrained optimization are algorithms where the starting point and all subsequent iterates satisfy all or some of the constraints. Such methods are necessary when the objective is defined only when some of the constraints are satisfied. Infeasible methods for constrained optimization allow iterates to violate the constraints. These methods often require a balance between reducing the objective and satisfying the constraints, e.g. by using merit function or filter.

Merit function $\phi(x)$ for infeasible methods of constrained optimization combines the objective with measures of constraint violation. A step in an infeasible method will be accepted only if it leads to a sufficient reduction in the merit function. Exact merit function $\phi(x; \mu)$ is a merit function whose local minimizers contains those of the original problem, once the penalty parameter exceeds a certain value. The $l_1$ penalty function is an exact merit function, but is not smooth. Fletcher's augmented Lagrangian for equality-constrained problems is defined as: $\phi_F(x; \mu) = f(x) - \lambda(x)^T c(x) + \frac{\mu}{2}\sum_{i \in \mathcal{E}} c_i(x)^2$, where Lagrange multiplier estimate $\lambda(x) = (J J^T)^{-1} J \nabla f$ and Jacobian of equality constraints $J = \nabla c$. Fletcher's augmented Lagrangian is a smooth and exact merit function, but is expensive to compute. Filter methods accept a step in an infeasible method if the pair of the objective and a measure of infeasibility is not dominated by a previous pair.

Maratos effect [@Maratos1978] refers to the failure to converge rapidly of some algorithms based on merit functions or filters, e.g. an SQP method, because they reject steps that make good progress toward a solution but increase both the objective and the constraint violation. To overcome the Maratos effect, one may add a correction term to decrease the constraint violation. Second-order correction step $\hat p_k$ for an equality-constrained problem, give a step $p_k$ at iterate $x_k$, is the minimum-norm root of the linearized constraints at the current iterate from the proposed new iterate: $\hat p_k = -J_k^\dagger c(x_k + p_k)$, where $J_k^\dagger = J_k^T (J_k J_k^T)^{-1}$ and $J_k = \nabla c(x_k)$. The second-order correction step can reduce the constraint violation at the corrected new iterate to the third order of the distance between the current iterate and the local optimum: if $J_k p_k + c(x_k) = 0$, then $|c(x_k + p_k + \hat p_k)| = \mathcal{O}(|x_k - x^∗|^3)$; note that the current iterate does not need to satisfy the constraints numerically. The cost of computing a second-order correction step includes constraint function evaluation at the proposed new iterate, $c(x_k + p_k)$, and backsolve $(J_k J_k^T) \delta_k = c(x_k + p_k)$; such cost is outweighed by added robustness and efficiency.

Penalty Methods

For constrained optimization, one may replace the original problem by another or a sequence of unconstrained problems, where the objective includes additive terms representing the constraints. Penalty function $\phi(x; \mu)$ for constrained optimization is the original objective plus one additional penalty term for each constraint, and all penalty terms are multiplied by a penalty parameter: $\phi(x; \mu) = f(x) + \mu \sum_{i \in \mathcal{E \cup I}} \varphi(c_i(x))$. Penalty term $\varphi(c_i)$ for a constraint is positive where the constraint is violated and zero otherwise: for $i \in \mathcal{E}$, $\varphi(0) = 0$, $\forall x \ne 0$, $\varphi(x) > 0$; for $i \in \mathcal{I}$, $\forall x \le 0$, $\varphi(x) = 0$, $\forall x > 0$, $\varphi(x) > 0$. Penalty parameter $\mu$ is a positive scalar that penalizes constraint violations with increasing severity, by forcing the minimizer of the penalty function closer to the original feasible region. For example, quadratic penalty uses $\varphi(x) = x^2$ for equality constrains and $\varphi(x) = \max(x, 0)^2$ for inequality constrains; l1 penalty uses $\varphi(x) = |x|$ for equality constrains and $\varphi(x) = \max(x, 0)$ for inequality constrains.

Sequential Quadratic Programming

Sequential quadratic programming (SQP) methods model the original problem by a quadratic programming subproblem at each iterate and define the search direction to be the solution of this subproblem.

Miscellaneous

History of optimization theory:

early 1900s - 1949, prehistory;
1949 - mid 1980s, Fenchel-Rockafellar era: duality theory;
mid 1980s - present, modern era: algorithms; a change in the assumptions underlying the field.

Milestones in optimization.

对近代最优化(通用算法)发展，我经常讲几个里程碑，上世纪40年代中期的线性规划的单纯形算法 (simplex algorithm [@Dantzig1947])， 60年代的拟牛顿法 (quasi-Newton methods, e.g. DFP [@Davidon1959])， 80年代的内点法 (interior-point methods, e.g. [@Karmarkar1984])，我总是把它归结为20年一个阶段。 2005年后，ADMM (alternating direction method of multipliers, see [@Boyd et al., 2011]) 是被学术界最认可的一个重要优化方法之一。 (何炳生)

Duality in general means two different views of the same object. The dual problem of a discrete problem is continuous/convex, which provides a lower bound and other important information for the solution of the discrete primal.

Problem ranking in increasing practical difficulty:

Linear programming (LP) and (convex) quadratic programming (QP): network flows;
Second order cone programming (SOCP);
Semi-definite programming (SDP);
Convex programming (CP): separable, large sum; network flows, monotropic programming, geometric programming;
Non-convex/non-continuous programming (NCP): twice differentiable, quasi-convex programming;
Integer programming (IP);

Others:

Sampling-based, zeroth-order, derivative-free methods: genetic algorithms, simulated annealing.
Bayesian optimization (see e.g. [@Snoek2012]).

References

Nocedal and Wright, 1999, 2006. Numerical Optimization.
Boyd and Vandenberghe, 2004. Convex Optimization.

🏷 Category=Optimization