Optimization on Manifold

General form of optimization on manifolds.

Riemannian manifold: $\min_{x \in \mathcal{M}} f(x)$.
Regular level set: $\min_{x \in \mathbb{R}^n} f(x)$, $\text{s.t.}~\phi(x) = 0$, where $\phi: \mathbb{R}^n \mapsto \mathbb{R}^c$ with a regular value 0.

General form of optimization methods on Riemannian manifolds: $x_{t+1} = x_t + v_t$, where $v_t$ is the update/step vector (Euclidean); $x_{t+1} = R(x_t, v_t)$ (Riemannian). Riemannian gradient descent on submanifolds with projection as retraction: $x_{t+1} = \Pi(x_t - c_t \nabla f(x_t))$, $c_t > 0$ is stepsize coefficient.

Concepts

Retraction $R(x, v)$ around a point $x_0$ in a $C^k$ manifold $M$ (weaker than a smooth manifold), $k \ge 2$, is a $C^{k−1}$ mapping from a neighborhood $U$ of $(x_0, 0)$ in the tangent bundle to the manifold, such that (a) it maps every zero tangent vector to the point of tangent, and (b) the differential of every restricted retraction evaluated at the zero tangent vector is the identity map of the tangent space: $\forall (x, 0) \in U$, (a) $R(x, 0) = x$; (b) $d(R_x)_0 = \text{Id}$, where $R_x = R|_{U \cap T_x M}$ and $d(R_x)_0 : T_0 (T_x M) \cong T_x M \mapsto T_x M$, under the usual identification of $T_0 (T_x M)$ with $T_x M$. Retraction $R(x, v)$ on a $C^k$ manifold $M$, $k \ge 2$, is a $C^{k−1}$ mapping from an open subset of the tangent bundle to the manifold, such that it is a retraction around every point in the manifold: $R: \mathscr{D} \mapsto M$, $\mathscr{D} \subset T M$, $\{(x, 0) : x \in M\} \subset \mathscr{D}$, $\exists (U_i)_{i \in I}$: $\mathscr{D} = \cup_{i \in I} U_i$, $\forall i \in I$, $R|_{U_i}$ is a retraction around some point $x \in M$. Restricted retraction $R_x(v)$ at a point $x \in M$ is the restriction of the retraction to the intersection of its domain with the tangent space at the point: $R_x = R|_{\mathscr{D} \cap T_x M}$.

Second-order retraction on a $C^k$ manifold, $k \ge 3$, is a retraction on the manifold that has zero initial acceleration: $\forall (x, v) \in T M$, $D_t \gamma'(0) = 0$, where $\gamma: I \mapsto M$, $I = \{t : (x, t v) \in \mathscr{D}\}$, and $\gamma(t) = R_x(t v)$. Retraction can be generalized to higher orders, in several ways. Retraction of order i on a $C^k$ manifold, $1 \le i < k$, is a retraction that induces curves on the manifold whose velocities have zero initial covariant derivatives up to the $(i-1)$-th order: $\forall (x, v) \in T M$, $\forall j \in \{1, \cdots, i - 1\}$, $D_t^j \gamma'(0) = 0$. Alternatively, retraction of order i on a $C^k$ manifold, $1 \le i < k$, is a $C^{k−1}$ mapping from an open subset of the tangent bundle to the manifold, such that it agrees with the exponential map to the k-th order at the zero tangents: $\forall (x, v) \in \mathscr{D}$, $R_x(t v) = \exp_x(t v) + o(t^i)$, in the sense that, $\lim_{t \to 0} d_g(R_x(t v), \exp_x(t v)) / t^i = 0$. The retraction as defined earlier is a first-order retraction; second-order retraction is also consistent with this definition (see [@Absil2012, Prop 3] for a similar notion).

Pullback $R^∗ f$ or $f \circ R$ of a real-valued function on a $C^k$ manifold by a retraction is the real-valued function on the domain of the retraction defined by the pullback of the function as a 0-form, which is equivalent to the composition of the function and the retraction: $R^∗ f: \mathscr{D} \mapsto \mathbb{R}$, $\forall (x, v) \in \mathscr{D}$, $(R^∗ f)_v = f_{R(v)} = f \circ R_x(v)$; or simply, $R^∗ f = f \circ R$. Pullback $R_x^∗ f$ or $f \circ R_x$ at a point is the restriction of the pullpack to the intersection of its domain with the tangent space at that point: $R_x^∗ f = (R^∗ f)|_{\mathscr{D} \cap T_x M}$; or simply $R_x^∗ f = f \circ R_x$. For every twice-differentiable real-valued function and second-order retraction on a $C^k$ manifold, the Hessian operator of the function at every point equals that of its pullback by the retraction at that point evaluated at the zero tangent vector [@Absil2008, Prop 5.5.5]: $\forall x \in M$, $(\text{Hess}~f)_x = \text{Hess}(f \circ R_x)(0)$.

Retractions can be constructed for:

submanifold:
- inverse of a diffeomorphism to the ambient manifold, containing an identity map on the submanifold [@Absil2008, Prop 4.1.2]: matrix decomposition;
- projection-like retractions [@Absil2012]:
  - projective retractions (minimum distance),
  - orthographic retractions (along normal space at point of tangent);
quotient manifold: quotient of a retraction on the domain via horizontal lift, if well-defined [@Absil2008, Prop 4.1.3];
abstract smooth manifold: inverse of local coordinate charts;

Retractions on Stiefel manifolds are mostly based on matrix decompositions. Sphere: normalization, $R(x, v) = \frac{x + v}{\|x + v\|}$. Orthogonal group: let $V = X \Omega$ where $\Omega$ is a skew-symmetric matrix, (1) QR decomposition, via Gram-Schmidt orthonormalization, $R(X, X \Omega) = X \text{qf}(I + \Omega)$, where $\text{qf}(M) = Q$ such that $M = Q R$, $R \in U_+(n)$, the set of upper triangular matrices with positive diagonal entries; (2) polar decomposition, via eigen-decomposition: $R(X, X \Omega) = X (I + \Omega) (I - \Omega^2)^{-1/2}$; (3) Givens rotations: $R(X, X \Omega) = X~\text{Giv}(\Omega)$, where $\text{Giv}(\Omega) = \prod_{i < j} G(i, j, \Omega_{ij})$, $[G(i, j, \theta)]_{(i,j)} = \begin{pmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{pmatrix}$, and $[G(i, j, \theta)]_{k,l} = \delta_{kl}$ otherwise; (4) Cayley transform: $R(X, X \Omega) = X (I - \Omega/2)^{-1}(I + \Omega/2)$; (5) exponential map: $\exp(X, X \Omega) = X e^\Omega$, where matrix exponential $e^\Omega = \sum_{i=0}^\infty \frac{1}{i!} \Omega^i$. Algorithms for accurately evaluating this exponential map (5) have a numerical cost at best similar to those for (2). General Stiefel manifold: (1) QR decomposition, via the modified Gram-Schmidt algorithm, $R(X, V) = \text{qf}(X + V)$, (2) polar decomposition, via eigen-decomposition: $R(X, V) = (X + V) (I_k - V^T V)^{-1/2}$, which involves an eigen-decomposition in $O(k^3)$ time, and $O(nk^2)$ scalar additions and multiplications; the numerical cost is reasonable when $k$ is small. These retractions are all defined for the entire tangent bundle.

Retractions on Grassmann manifolds are mostly based on quotient. Let $\tilde M$ be an open, dense subset of a Euclidean space, $\pi: \tilde M \mapsto M$ be a Riemannian submersion, given a retraction $\tilde R$ on the domain, one can construct a retraction on the quotient manifold $M$ as $R(x, v) = \{\pi(\tilde R(\tilde x, \tilde v_{\tilde x})) : \tilde x \in \pi^{-1}(x)\}$ if it is a mapping, i.e. the right-hand side are singletons. The retraction on the domain could be $\tilde R(\tilde x, \tilde v) = \tilde x + \tilde v^H$, where $\tilde v^H$ is the projection of $\tilde v$ to the horizontal tangent space; this gives $R(x, v) = \pi(\tilde x + \tilde v^H_{\tilde x})$ where $\tilde x \in \pi^{-1}(x)$, or equivalently, $R(\pi(\tilde x), d \pi_{\tilde x} (\tilde v)) = \pi(\tilde x + \tilde v^H)$. In computation, we operate with $(\tilde x, \tilde v)$, and this retraction simply means $\tilde x + \tilde v^H$. Per quotient manifold theorem and Riemannian submersion theorem, given an isometric Lie group $G$ acting on $\tilde M$, if the quotient map $\pi: \tilde M \mapsto \tilde M / G$ takes every point to its orbit, i.e. $\pi(\tilde x) = G \cdot \tilde x$, then $R(x, v)$ is a mapping and thus a retraction on the quotient manifold. Because the general linear group $\text{GL}_k$ on the full-rank manifold $M^∗_{n,k}$ satisfies such condition, we have thus constructed a retraction on the Grassmann manifold $G_{k,n} = M^∗_{n,k} / \text{GL}_k$.

For retractions on rank-k manifolds, see [@Absil2015].

Retractions for the same manifold shall be evaluated by (1) their order of approximation and (2) its computational efficiency.

Constrained Optimization

Constrained optimization (projected updates, uses projection):

(quasi-)Newton methods [@Sra2011, Ch. 11];
noisy stochastic gradient [@GeR2015];
gradient or conditional gradient, plus quadratic program [@Mokhtari2018];

Riemannian Optimization

Riemannian optimization (on intrinsic manifold, use exponential map by default):

smooth:
- Newton, proposes and uses retraction [@Adler2002];
- quasi-Newton:
  - BFGS: on submanifolds [@Gabay1982]; use retraction and vector transport [@Ring2012]; avoid differentiated retraction [@HuangW2018];
  - Broyden class (mixed BFGS and DFP), use retraction and vector transport [@HuangW2015];
- trust-region [@Absil2007]:
  - for low-rank matrix completion (Grassmannian), uses retraction [@Boumal2011];
- conjugate gradient [@Sato2016];
- stochastic variance reduced gradient (SVRG), for C1, L-g-smooth, applied to matrix manifolds:
  - use exponential map and parallel transport [@ZhangHY2016a];
  - use retraction and vector transport [@Sato2019];
- gradient descent and trust-region, global rates of convergence [@Boumal2018];
geodesic convex:
- projected (stochastic) (sub-)gradient, for Lipschitz/bounded subgradient/L-g-smooth, for non-positive curvature, uses projection within manifold [@ZhangHY2016b];
- accelerated gradient descent (AGD), for C1 [@ZhangHY2018];
- simulated annealing, for non-negative sectional curvature [@Goyal2019];
nonsmooth [@Absil2019]:
- gradient sampling, for locally Lipschitz, uses retraction [@Hosseini2017];

Riemannian optimization is mostly applied to matrix manifolds, see [@Absil2008] for a book treatment.

Escaping saddle points (in smooth optimization):

noisy SGD, with projected version for equality-constrained optimization [@GeR2015];
projected gradient or conditional gradient, on convex set [@Mokhtari2018];
perturbed gradient descent on submanifolds [@SunY2018], better rate than [@GeR2015];
perturbed Riemannian gradient descent:
- using exponential map [@SunY2019];
- using retraction [@Criscitiello2019];