Optimization on Manifold

General form of optimization on manifolds.

Riemannian manifold: $\min_{x \in \mathcal{M}} f(x)$.
Submersion level set: $\min_{x \in \mathbb{R}^n} f(x)$, $\text{s.t.}~\phi(x) = 0$, where $\phi: \mathbb{R}^n \mapsto \mathbb{R}^c$ is a smooth submersion.

General form of optimization methods on Riemannian manifolds: $x_{t+1} = x_t + v_t$, where $v_t$ is the update/step vector (Euclidean); $x_{t+1} = R(x_t, v_t)$ (Riemannian). Riemannian gradient descent on submanifolds with projection as retraction: $x_{t+1} = \Pi(x_t - c_t \nabla f(x_t))$, $c_t > 0$ is stepsize coefficient.

Concepts

Retraction $R(x, v)$ around a point $x_0$ in a $C^k$ manifold $M$, $k \ge 2$, is a $C^{k−1}$ mapping from a neighborhood $U$ of $(x_0, 0)$ in the tangent bundle to the manifold, such that (a) it maps every zero tangent vector to the point of tangent, and (b) the differential of every restricted retraction evaluated at the zero tangent vector is the identity map of the tangent space: $\forall (x, 0) \in U$, (a) $R(x, 0) = x$; (b) $d(R_x)_0 = \text{Id}$, where $R_x = R|_{U \cap T_x M}$ and $d(R_x)_0 : T_0 (T_x M) \cong T_x M \mapsto T_x M$, under the usual identification of $T_0 (T_x M)$ with $T_x M$. Retraction $R(x, v)$ on a $C^k$ manifold $M$, $k \ge 2$, is a $C^{k−1}$ mapping from an open subset of the tangent bundle to the manifold, such that it is a retraction around every point in the manifold: $R: \mathscr{D} \mapsto M$, $\mathscr{D} \subset T M$, $\{(x, 0) : x \in M\} \subset \mathscr{D}$, $\exists (U_i)_{i \in I}$: $\mathscr{D} = \cup_{i \in I} U_i$, $\forall i \in I$, $R|_{U_i}$ is a retraction around some point $x \in M$. Restricted retraction $R_x(v)$ at a point $x \in M$ is the restriction of the retraction to the itersection of its domain with the tangent space at the point: $R_x = R|_{\mathscr{D} \cap T_x M}$.

Retraction of order k

The retraction as defined earlier is a first-order retraction.

Retractions for the same manifold shall be evaluated by (1) their order of approximation and (2) its computational efficiency.

Retractions can be constructed for:

submanifold:
- projection-like retractions [@Absil2012]: projective retractions (minimum distance), orthographic retractions (along normal space at point of tangent);
quotient manifold
and mathematical objects (local coordinates, projections, factorizations)

Constrained Optimization

Constrained optimization (projected updates, uses projection):

(quasi-)Newton methods [@Sra2011, Ch. 11];
noisy stochastic gradient [@GeR2015];
gradient or conditional gradient, plus quadratic program [@Mokhtari2018];

Riemannian Optimization

Riemannian optimization (on intrinsic manifold, use exponential map by default):

smooth:
- Newton, proposes and uses retraction [@Adler2002];
- quasi-Newton:
  - BFGS: on submanifolds [@Gabay1982]; use retraction and vector transport [@Ring2012]; avoid differentiated retraction [@HuangW2018];
  - Broyden class (mixed BFGS and DFP), use retraction and vector transport [@HuangW2015];
- trust-region [@Absil2007]:
  - for low-rank matrix completion (Grassmannian), uses retraction [@Boumal2011];
- conjugate gradient [@Sato2016];
- stochastic variance reduced gradient (SVRG), for C1, L-g-smooth, applied to matrix manifolds:
  - use exponential map and parallel transport [@ZhangHY2016a];
  - use retraction and vector transport [@Sato2019];
- gradient descent and trust-region, global rates of convergence [@Boumal2018];
geodesic convex:
- projected (stochastic) (sub-)gradient, for Lipschitz/bounded subgradient/L-g-smooth, for non-positive curvature, uses projection within manifold [@ZhangHY2016b];
- accelerated gradient descent (AGD), for C1 [@ZhangHY2018];
- simulated annealing, for non-negative sectional curvature [@Goyal2019];
nonsmooth [@Absil2019]:
- gradient sampling, for locally Lipschitz, uses retraction [@Hosseini2017];

Riemannian optimization is mostly applied to matrix manifolds, see [@Absil2008] for a book treatment.

Escaping saddle points (in smooth optimization):

noisy SGD, with projected version for equality-constrained optimization [@GeR2015];
projected gradient or conditional gradient, on convex set [@Mokhtari2018];
perturbed gradient descent on submanifolds [@SunY2018], better rate than [@GeR2015];
perturbed Riemannian gradient descent:
- using exponential map [@SunY2019];
- using retraction [@Criscitiello2019];