Optimization is the pursuit of optimal points in practical/computational problems. Although calculus easily solves this task when the function is differentiable and unconstrained, optimization mostly deal with constrained problems and continuity is not guaranteed.

Dimensions of optimization settings: (simple vs. complex)

- unconstrained vs. constrained;
- continuous vs. discrete vs. mixed;
- deterministic vs. stochastic (infinite dimensional, dynamic);
- single-objective vs. multi-objective vs. multi-agent;

A quantity can be modeled as **continuous**
if your instrument can take on any value its resolution allows.
A quantity can be modeled as **stochastic**
if its generating mechanism is inherently stochastic,
or you acknowledge ignorance towards its mechanism,
or you intentionally introduce ignorance by reducing information.

Subdivision of continuous, deterministic, single-objective optimization problems:

- Convex Optimization: linear programming;
- smooth optimization: gradient dominated;
- non-smooth, non-convex optimization: locally Lipschitz; piecewise smooth;

**Optimization problem** is any problem of minimizing an objective function over a feasible set.
Given **optimization variable** $x \in \mathbb{R}^n$ and real-valued functions:
$f_0$, the **objective function** (cost function);
**inequality constraint functions** $f_i$ and **equality constraint functions** $h_i$,
the **standard form** of an optimization problem is:
$$\begin{align}
\min \quad & f_0(x) \\
\text{s.t.} \quad & f_i(x) \le 0 && i = 1,\dots,m \\
& h_i(x) = 0 && i = 1,\dots,p
\end{align}$$
The **epigraph form** of an optimization problem is:
$$\begin{align}
\min \quad & t \\
\text{s.t.} \quad & f_0(x) - t \le 0 \\
& f_i(x) \le 0 && i = 1,\dots,m \\
& h_i(x) = 0 && i = 1,\dots,p
\end{align}$$

**Domain** of an optimization problem
is intersection of the domains of the objective and all constraints:
$D = (\cap_i \text{dom}f_i) ∩ (\cap_j \text{dom}h_j)$.
**Feasible set** or **constraint set** $X$ of an optimization problem
is the subset of its domain that satisfy all the constraints.
An inequality constraint is **active** at a feasible point,
if the inequality binds: $\exists x \in X: f_i(x) = 0$.
Otherwise it is **inactive** at that point.
A constraint is **redundant** if it doesn't change the feasible set.
**Active set** $\mathcal{A}(x)$ at a feasible point
is the set of indices of the equality constraints and the active inequality constraints:
$\mathcal{A}(x) = \mathcal{E} \cup \{i \in \mathcal{I} : c_i(x) = 0\}$.

**Locally optimal point** is a feasible point that optimizes the objective function
over the feasible set within one of its neighborhood: $x = \inf { f_0(X \cap B_x(R)) }$.

**Optimal value** of an optimization problem is the infimum of the image of the feasible set,
which takes value of extended real numbers: $p^∗ = \inf \{f_0(X)\}$, where $p$ is for primal.
By convention, the infimum of an empty set is infinity.
If the image of feasible points can be arbitrarily low, the optimal value is negative infinity.
**Optimal set** of an optimization problem
is the intersection of the preimage of its optimal value and its feasible set:
$X_{\text{opt}} = f_0^{-1}( p^∗ ) \cap X$.
**Optimal point** is a point of the optimal set.

For $\varepsilon>0$, the **ε-suboptimal set** of an optimization problem
is defined as $X_{\text{opt}} = f_0^{-1}[p^∗ , p^∗ + \varepsilon] \cap X$.
An **$\varepsilon$-suboptimal point** is a point of the $\varepsilon$-suboptimal set.

An optimization problem is **unconstrained** if there is no constraints: $m = p = 0$.
An optimization problem is **infeasible**, if its feasible set is empty.
Informally, two optimization problems are **equivalent**
if from a solution of one, a solution of the other is readily found, and vice versa.
It is possible but complicated to give a formal definition of such equivalence.

**Ascent direction** of a function at a point
is a direction in which the function locally strictly increases.
**Direction of steepest ascent** of a function at a point
is an ascent direction with the largest directional derivative.

**Line search** methods select a descent direction and then a step size at each iteration.
**Trust region** methods approximate the objective in a region around the current iterate
with a (usually quadratic) model, and then select a direction according to the model;
the region size is increased if objective decrease is satisfactory, otherwise it is decreased.

Unconstrained optimization methods:

- gradient/steepest descent [@Cauchy1847];
- gradient methods with momentum: conjugate gradient (CG), heavy ball method [@Polyak1964]; accelerated gradient methods [@Nesterov1983];
- coordinate descent (CD);

Unconstrained optimization of finite sums / expectations (see e.g. [@Sra2011; @Bottou2018]):

- stochastic gradient descent (SGD) [@Robbins1951];
- noise reduction methods:
- iterate averaging [@Polyak1992];
- gradient aggregation: stochastic average gradient (SAG) [@LeRoux2012]; stochastic variance-reduced gradient (SVRG) [@Johnson2013]; SAGA (stochastic average gradient augmented) [@Defazio2014];
- dynamic sample size [@Byrd2012; @Hashemi2014];

- second-order methods:
- natural gradient method, Hessian-free inexact Newton methods, Gauss-Newton methods, quasi-Newton methods;
- diagonal scaling: adaptive gradient (AdaGrad) [@Duchi2011], root mean square propagation (RMSprop) [@Tieleman2012], adaptive moment estimation (Adam) [@Kingma2015];

- randomized CD [@Nesterov2012], parallel stochastic CD [@LiuJ2013], stochastic dual coordinate ascent (SDCA) [@Shalev-Shwartz2013];

Convex function has identical local and global minimum sets. There are NP-hard convex optimization problems. The simplex method has exponential complexity. Interior-point methods can be proved to solve (to a specified accuracy) some convex optimization problems in polynomial time.

Convex optimization methods:

- Simplex algorithm, aka simplex method [@Dantzig1947] for linear programs;
- Interior-point methods, aka barrier methods [@Karmarkar1984];
- Alternating direction method of multipliers (ADMM) (see e.g. [@Boyd2011]);

Conditional gradient, aka Frank-Wolfe [@Frank and Wolfe, 1956], for convex optimization: move along the maximum descent vector of the linearized objective over the constraint set, with a diminishing step size.

Other methods for convex programs: subgradient methods (descent methods); cutting plane methods (outer approximation); simplicial decomposition (inner approximation); proximal methods (regularization).

**Interior-point methods** solve large linear programs
with techniques for nonlinear programs and nonlinear equations,
where all iterates satisfy the inequality constraints strictly.
**Primal-dual method** is a subclass of interior-point method
that solves the primal-dual system of the original problem,
which has the most efficient practical approaches.
Consider the linear programming problem in standard form:
$$\begin{align}
\min \quad & c^T x && c \in \mathbb{R}^n \\
\text{s.t.} \quad & A x = b && b \in \mathbb{R}^m \\
& x \ge 0
\end{align}$$
Its dual problem is:
$$\begin{align}
\max \quad & b^T \lambda && b \in \mathbb{R}^m \\
\text{s.t.} \quad & A^T \lambda + s = c && s \in \mathbb{R}^n \\
& s \ge 0
\end{align}$$
Solutions of the primal and the dual problems are characterized by the KKT conditions:
$A^T \lambda + s = c$, $A x = b$, $x \circ s = 0$, and $x^T s \ge 0$.
To find solutions $(x^∗, \lambda^∗, s^∗ )$ of the KKT system,
primal-dual methods apply variants of Newton's method to the equalities,
and modify the search directions and step lengths so that
the inequalities are satisfied strictly at every iteration.

**Primal-dual feasible set** is defined as:
$\mathcal{F} = \{(x, \lambda, s) : A x = b, A^T \lambda + s = c, x^T s \ge 0 \}$.
**Primal-dual strictly feasible set** satisfies the inequality strictly:
$\mathcal{F}^o = \{(x, \lambda, s) : A x = b, A^T \lambda + s = c, x^T s > 0 \}$.
**Central path** $C(\tau)$ is a parameterized curve of strictly feasible points
that satisfies a modified version of the third equality in the KKT conditions:
$x \circ s = \tau \mathbf{1}$, where $\tau \in (0, \infty)$.
**Path-following primal-dual methods** restrict the iterates to a neighborhood of the central path
and follow the central path to a solution of the linear program.
These methods are similar to path-following (aka continuation or homotopy) methods
for tracing solution curves of nonlinear equations.

**Smooth optimization** refers to programs
where the objective satisfies certain differentiability property.
A general nonlinear constrained optimization problem can be written as [@Nocedal2006, Ch. 15]:
$$\begin{align}
\min_{x \in \mathbb{R}^n} \quad & f(x) \\
\text{s.t.} \quad & c_i(x) = 0 && i \in \mathcal{E} \\
\quad & c_j(x) \le 0 && j \in \mathcal{I}
\end{align}$$
Here, the objective $f(x)$ and the constraints $c_i(x)$ are smooth real-valued functions,
and $\mathcal{E}$ and $\mathcal{I}$ are finite index sets of equality and inequality constraints.
**Quadratic program** (QP), for exmaple, is such an optimization problem
where the objective function is quadratic and the constraints are linear.

Smooth optimization methods:

- Penalty methods: augmented Lagrangian method, aka method of multipliers [@Hestenes1969; @Powell1969].
- Active-set methods (~1970s), for small- and medium-sized problems with inequality constraints:
- Gradient projection method, or projected gradient descent (PGD) [@Goldstein1964; @Levitin and Polyak, 1966] for convex sets, e.g. bound constraints.
- Sequential quadratic programming (SQP) methods.

- Interior-point methods, aka barrier methods, for nonlinear programming (~1990s).

**Feasible methods** for constrained optimization are algorithms where
the starting point and all subsequent iterates satisfy all or some of the constraints.
Such methods are necessary when the objective is defined only
when some of the constraints are satisfied.
**Infeasible methods** for constrained optimization allow iterates to violate the constraints.
These methods often require a balance between reducing the objective and satisfying the constraints,
e.g. by using merit function or filter.

**Merit function** $\phi(x)$ for infeasible methods of constrained optimization
combines the objective with measures of constraint violation.
A step in an infeasible method will be accepted
only if it leads to a sufficient reduction in the merit function.
**Exact merit function** $\phi(x; \mu)$ is a merit function
whose local minimizers contains those of the original problem,
once the penalty parameter exceeds a certain value.
The $l_1$ penalty function is an exact merit function, but is not smooth.
**Fletcher's augmented Lagrangian** for equality-constrained problems is defined as:
$\phi_F(x; \mu) = f(x) - \lambda(x)^T c(x) + \frac{\mu}{2}\sum_{i \in \mathcal{E}} c_i(x)^2$,
where Lagrange multiplier estimate $\lambda(x) = (J J^T)^{-1} J \nabla f$ and
Jacobian of equality constraints $J = \nabla c$.
Fletcher's augmented Lagrangian is a smooth and exact merit function, but is expensive to compute.
**Filter methods** accept a step in an infeasible method
if the pair of the objective and a measure of infeasibility is not dominated by a previous pair.

**Maratos effect** [@Maratos1978] refers to the failure to converge rapidly
of some algorithms based on merit functions or filters, e.g. an SQP method,
because they reject steps that make good progress toward a solution
but increase both the objective and the constraint violation.
To overcome the Maratos effect, one may add a correction term to decrease the constraint violation.
**Second-order correction step** $\hat p_k$ for an equality-constrained problem,
give a step $p_k$ at iterate $x_k$, is the minimum-norm root of the linearized constraints
at the current iterate from the proposed new iterate: $\hat p_k = -J_k^\dagger c(x_k + p_k)$,
where $J_k^\dagger = J_k^T (J_k J_k^T)^{-1}$ and $J_k = \nabla c(x_k)$.
The second-order correction step can reduce the constraint violation at the corrected new iterate
to the third order of the distance between the current iterate and the local optimum:
if $J_k p_k + c(x_k) = 0$, then $|c(x_k + p_k + \hat p_k)| = \mathcal{O}(|x_k - x^∗|^3)$;
note that the current iterate does not need to satisfy the constraints numerically.
The cost of computing a second-order correction step includes
constraint function evaluation at the proposed new iterate, $c(x_k + p_k)$,
and backsolve $(J_k J_k^T) \delta_k = c(x_k + p_k)$;
such cost is outweighed by added robustness and efficiency.

For constrained optimization, one may replace the original problem by another or a sequence of
unconstrained problems, where the objective includes additive terms representing the constraints.
**Penalty function** $\phi(x; \mu)$ for constrained optimization
is the original objective plus one additional penalty term for each constraint,
and all penalty terms are multiplied by a penalty parameter:
$\phi(x; \mu) = f(x) + \mu \sum_{i \in \mathcal{E \cup I}} \varphi(c_i(x))$.
**Penalty term** $\varphi(c_i)$ for a constraint
is positive where the constraint is violated and zero otherwise:
for $i \in \mathcal{E}$, $\varphi(0) = 0$, $\forall x \ne 0$, $\varphi(x) > 0$;
for $i \in \mathcal{I}$, $\forall x \le 0$, $\varphi(x) = 0$, $\forall x > 0$, $\varphi(x) > 0$.
**Penalty parameter** $\mu$ is a positive scalar
that penalizes constraint violations with increasing severity,
by forcing the minimizer of the penalty function closer to the original feasible region.
For example, **quadratic penalty** uses $\varphi(x) = x^2$ for equality constrains
and $\varphi(x) = \max(x, 0)^2$ for inequality constrains;
**l1 penalty** uses $\varphi(x) = |x|$ for equality constrains
and $\varphi(x) = \max(x, 0)$ for inequality constrains.

Sequential quadratic programming (SQP) methods model the original problem by a quadratic programming subproblem at each iterate and define the search direction to be the solution of this subproblem.

History of optimization theory:

- early 1900s - 1949, prehistory;
- 1949 - mid 1980s, Fenchel-Rockafellar era: duality theory;
- mid 1980s - present, modern era: algorithms; a change in the assumptions underlying the field.

Milestones in optimization.

对近代最优化(通用算法)发展，我经常讲几个里程碑， 上世纪40年代中期的线性规划的单纯形算法 (simplex algorithm [@Dantzig1947])， 60年代的拟牛顿法 (quasi-Newton methods, e.g. DFP [@Davidon1959])， 80年代的内点法 (interior-point methods, e.g. [@Karmarkar1984])， 我总是把它归结为20年一个阶段。 2005年后，ADMM (alternating direction method of multipliers, see [@Boyd et al., 2011]) 是被学术界最认可的一个重要优化方法之一。 (何炳生)

Duality in general means two different views of the same object. The dual problem of a discrete problem is continuous/convex, which provides a lower bound and other important information for the solution of the discrete primal.

Problem ranking in increasing practical difficulty:

- Linear programming (LP) and (convex) quadratic programming (QP): network flows;
- Second order cone programming (SOCP);
- Semi-definite programming (SDP);
- Convex programming (CP): separable, large sum; network flows, monotropic programming, geometric programming;
- Non-convex/non-continuous programming (NCP): twice differentiable, quasi-convex programming;
- Integer programming (IP);

Others:

- Sampling-based, zeroth-order, derivative-free methods: genetic algorithms, simulated annealing.
- Bayesian optimization (see e.g. [@Snoek2012]).

- Nocedal and Wright, 1999, 2006. Numerical Optimization.
- Boyd and Vandenberghe, 2004. Convex Optimization.