Singular Value Decomposition

Polar Decomposition

(Lemma 7.3.1) For any matrix $M \in M_{m,n}$, $m \ge n$, let $M^* M = Y^* \text{diag}(\lambda) Y$ be an eigendecomposition. Here the non-negative eigenvalues λ are unique, and the unitary eigenvectors Y are unique up to rotation within eigenspaces. Given λ and Y, there is an $X \in M_{m,n}$ with orthonormal columns, such that $M = X \text{diag}(\sqrt{\lambda}) Y$. If M has full rank, then X is unique. If M is real, then Y and X can be real.

Polar decomposition (Theorem 7.3.2): For any matrix $M \in M_{m,n}$, $m \ge n$, let positive semidefmite matrix $P = (M^* M)^{1/2}$, there exists a $U \in M_{m,n}$ with orthonormal columns, such that $M = U P$. If M has full rank, then U is unique. If M is real, then P and U can be real.

(Theorem 7.3.4) Let M be a square matrix with polar decomposition M = U P. M is normal iff P U = U P.

Singular Value Decomposition

Theorem (7.3.5) Singular Value Decomposition (SVD): Every matrix can be written as a pairing of orthonormal vectors in its domain and codomain, with nonnegative weights; in other words, every matrix maps the unit n-ball to a q-dim ellipsoid, q = rank(M), where an orthonormal basis is mapped to a set of principal semi-axes and, if n > q, the zero vector; the vectors can be real if the matrix is real:

$\forall M \in M_{m,n}(\mathbb{C})$, $\exists V \in U(m)$, $W \in U(n)$, $\Sigma = \text{diag}\{(\sigma_i)_{i=1}^q\}_{m,n}$: $M = V \Sigma W^H$.
$\forall M \in M_{m,n}(\mathbb{R})$, $\exists V \in O(m)$, $W \in O(n)$, and the rest follows.

Singular values $(\sigma_i)_{i=1}^q$ are the nonnegative values on the diagonal of $\Sigma$, which is uniquely determined as an unordered tuple. The singular values of $M$ are eigenvalues of polar matrix $P$. Left singular vectors $(v_i)_{i=1}^m$ are the columns of $V$ and right singular vectors $(w_i)_{i=1}^n$ are the columns of $W$; they are uniquely determined up to simultaneous unitary transforms for each repeated singular value: $(V_j Q, W_j Q)$, $\forall Q \in U(m_j)$.

SVD of square matrices:

SVD and EVD are equivalent for positive semi-definite matrices: $\forall A \in S_{\ge 0}(n)$, (1) let $A = U \Lambda U^H$ be an unitary EVD, then $A = V \Sigma W^H$ is an SVD with $V = W = U$ and $\Sigma = \Lambda$; (2) let $A = V \Sigma W^H$ be an SVD, then $V = W$, and $A = U \Lambda U^H$ is an unitary EVD, with $U = V$ and $\Lambda = \Sigma$;
SVD and EVD differ by a complex diagonal rotation for normal matrices: let $M = U \Lambda U^H$ be a unitary EVD, then $M = U |\Lambda| (D U^H)$ is an SVD, where $D = \text{diag}(e^{i \theta_k})_{k=1}^n$ and $\theta_k = \arg \lambda_k$ (define $\arg 0 = 0$); its singular values are its eigenvalules rotated to the nonnegative real line, and the singular vector pairs match by the same rotations: $\Sigma = \Lambda D^H$, $W = U D^H$.
For any square matrix, the product of singular values equals the absolute determinant: $|\det A| = \prod_{i=1}^n \sigma_i = \prod_{i=1}^n |\lambda_i|$; but the singular values are not the absolute eigenvalues in general, $\sigma \ne |\lambda|$.
For any real square matrix, its "trace norm" is no less than its trace: $\|A\|_* = \sum_{i=1}^n \sigma_i \ge \sum_{i=1}^n \lambda_i = \text{tr}(A)$. To see this, $\text{tr}(A) = \text{tr}(U \Sigma V^T) = \text{tr}(\Sigma Q) = \sum_{i=1}^n \sigma_i q_{ii} \le \sum_{i=1}^n \sigma_i$, where $Q = V^T U$ is orthogonal and thus $|q_{ii}| \le 1$.
- The "trace norm" equals the trace for positive semi-definite matrices.
Singular value inequality of square product: $\sigma_i(A B^H) \le \min_{j \le i} \sigma_j(A) \sigma_{i-j+1}(B)$, $\forall A, B \in M_{m,n}$. [Sec 7.3, Prob. 18]
Weyl inequalities of singular values [Sec 7.3, Prob. 15] $\sigma_i(A + B) \le \min_{j \le i} \sigma_j(A) + \sigma_{i-j+1}(B)$

Properties of SVD:

Theorem (7.3.7): The nonzero singular values of $M$ are $(\sigma_i)_{i=1}^r$ if and only if the nonzero eigenvalues of $\tilde{M} = [0, M; M^H, 0]$ are $(\sigma_i, -\sigma_i)_{i=1}^r$.
Singular values are invariant under transposition, conjugation, and left and right unitary transforms: $\sigma(M) = \sigma(\overline{M}) = \sigma(M^T) = \sigma(U M V)$, $\forall U \in U(m), V \in U(n)$.
Scaling changes singular values by a factor of its modulus: $\forall c \in \mathbb{C}$, $\sigma(c M) = |c| \sigma(M)$.
Theorem (7.3.9; Interlacing property for singular values)
Theorem (7.3.10; Analog of Courant-Fischer Theorem)

Reduced SVDs

Full SVD refers to the form given in the theorem: orthogonal decompositions for the domain and codomain, paired up to the rank of the matrix. However, not all information is needed in applications, and the full SVD can be reduced to save computation and storage.

Without loss of generality, consider an m-by-n matrix M of rank r, with $m \ge n \ge r$. Thin SVD refers to a computational procedure that provides an orthonormal n-frame for the codomain and an orthogonal decomposition for the domain, paired up to the rank of the matrix. $M = V \Sigma Q^T$, where $V \in V_{n, m}$, $Q \in O(n)$, and $\Sigma = \text{diag}(\sigma)$, $\sigma \in \mathbb{R}^n_{+\downarrow}$. This saves the cost of completing the larger orthogonal matrix.

Compact SVD refers to a computational procedure that provides paired orthonormal vectors for the domain and codomain, with r positive weights: $M = V \Sigma W^T$, where $V \in V_{r, m}$, $W \in V_{r, n}$, and $\Sigma = \text{diag}(\sigma)$, $\sigma \in \mathbb{R}^r_{+\downarrow}$. This saves the cost of orthogonal decompositions of the kernel and the cokernel, which has no effect on the mapping sine they are associated with zero singular values.

Truncated SVD or partial SVD refers to a computational procedure that provides paired orthonormal vectors for the domain and codomain, with the k-largest weights, k < r: $M \approx M_k = V_k \Sigma_k W_k^T$, where $V_k = V (I_k; 0) \in V_{k, m}$, $W_k = W (I_k; 0) \in V_{k, n}$, and $\Sigma_k = \text{diag}(\sigma_i)_{i=1}^k$. This saves the cost of computing the smaller singular values and the paired left and right singular vectors. The outcome does not exactly represent the original matrix, but is the optimal rank-k approximation in Frobenius norm.

Algorithms and Computational Complexity

Full and thin SVD: (coefficients depend on output, see [@Golub2013, Fig 8.6.1] for details.)

full SVD $\mathcal{O}(m^2 n + n^3)$;
thin SVD $\mathcal{O}(m n^2 + n^3)$: R-SVD $6 m n^2 + 20 n^3$;

Truncated SVD [@Halko2011, Sec 1.4, 3.3, 6]:

general dense matrix that fits in fast memory:
- classical algorithms: rank-revealing QR factorization + manipulate the factors, $\mathcal{O}(m n k)$;
- randomized algorithms: $\mathcal{O}(m n \log(k) + (m+n) k^2)$; (stage A, structured random matrix, $\mathcal{O}(m n \log(k) + n k^2)$; stage B, row-extraction, $\mathcal{O}((m + n) k^2)$)
matrix with fast matrix-vector products, e.g. sparse or structured matrices;
- Krylov subspace method, e.g. the Lanczos or Arnoldi algorithm: numerically unstable, $\mathcal{O}(k T_{\text{mult}} + (m + n) k^2)$.
- randomized algorithms: numerically stable, same complexity as above;
general dense matrix stored in slow memory or streamed: cost dominated by transferring the matrix from slow memory;
- standard techniques for low-rank approximation: $\mathcal{O}(k)$ passes;
- randomized algorithms: one pass, or very few (2-4) additional passes for improved accuracy;

Truncated eigenvalue decomposition (EVD) of an Hermitian matrix [@Halko2011, Sec 5.3-5.5]: (general dense matrix that fits in fast memory)

stage A: randomized algorithm to find an orthonormal matrix $Q$ that approximates the range of $A$.
- accelerated technique: $\mathcal{O}(n^2 \log(k) + n k^2)$;
stage B: EVD of $A Q$
- direct method: $2 n^2 k$, for the matrix multiplication $A Q$;
- Nyström method (positive semi-definite): same cost as the direct method, much more accurate;
- row extraction: less accurate, $\mathcal{O}(n k^2)$;

Classical truncated EVD algorithms: (Krylov subspace methods)

Arnoldi algorithm.
Lanczos algorithm for Hermitian matrices: $\mathcal{O}(n^2 k)$, residuals linearly convergent, inherently unstable.
Block variants:
Implicit restart variants:
- implicitly restarted Arnoldi method (IRAM) [@Sorensen1992]: linearly convergent;
- implicitly restarted Lanczos method (IRL) [@Calvetti1994];

Classical SVD algorithms for general dense matrix, two steps:

reduce the matrix to bidiagonal form: $M = P B Q^T$, where P and Q are products of Householder matrices, B is upper diagonal;
- dense matrix: Golub-Kahan, $4 m n^2 - 4 / 3 n^3$ flops, $\mathcal{O}(m n^2)$;
- large matrix: Golub-Kahan-Lanczos, with Lanczos iterations;
SVD of a bidiagonal matrix, to high relative accuracy:
- singular values only: square-root-free method; bisection method, $\mathcal{O}(k n)$;
- singular values and singular vectors: multiple relatively robust representations, $\mathcal{O}(k n)$;

Jacobi SVD algorithm for general dense matrix, for improved accuracy;

One-sided Jacobi SVD algorithm [@Hestenes, 1958];
Preconditioned Jacobi SVD algorithm;
Rank-revealing decomposition: structured matrices;

Applications

Effective rank or numerical rank of a matrix is the number of singular values whose magnitudes are greater than measurement error.

(7.4.1; Nearest rank-$k$ matrix) Given $M \in M_{m,n}, \mathrm{rank}(M) = k$, then $\forall M_1 \in M_{m,n}, \mathrm{rank}(M) = k_1 \le k$, $\min \| M - M_1 \|_2 = \| M - V \Sigma_1 W^H \|_2$, where $\Sigma_1 = \mathrm{diag}\{\sigma_1, \dots, \sigma_{k_1}, 0, \dots, 0\}_{m,n}$.

Theorem (simultaneous singular value decomposition) [@Horn1990, p. 426, Q26]: Two matrices have the same pairs of singular vectors, $A = V \Sigma W^H$ and $B = V \Lambda W^H$ ($\sigma$ and $\lambda$ not necessarily ordered), if and only if $A B^H$ and $B^H A$ are both normal.

Theorem (7.4.10): If $A \in M_{m,n}, B \in M_{n,m}$, $A B$ and $B A$ are positive semi-definite, then $\mathrm{tr}(A B) = \mathrm{tr}(B A) = \sum_{i=1}^q \sigma_i(A) \sigma_{\tau(i)}(B)$.

🏷 Category=Algebra Category=Matrix Analysis