Singular Value Decomposition

Polar Decomposition

(Lemma 7.3.1) For any matrix $M \in M_{m,n}$, $m \ge n$, let $M^* M = Y^* \text{diag}(\lambda) Y$ be an eigendecomposition. Here the non-negative eigenvalues λ are unique, and the unitary eigenvectors Y are unique up to rotation within eigenspaces. Given λ and Y, there is an $X \in M_{m,n}$ with orthonormal columns, such that $M = X \text{diag}(\sqrt{\lambda}) Y$. If M has full rank, then X is unique. If M is real, then Y and X can be real.

Polar decomposition (Theorem 7.3.2): For any matrix $M \in M_{m,n}$, $m \ge n$, let positive semidefmite matrix $P = (M^* M)^{1/2}$, there exists a $U \in M_{m,n}$ with orthonormal columns, such that $M = U P$. If M has full rank, then U is unique. If M is real, then P and U can be real.

(Theorem 7.3.4) Let M be a square matrix with polar decomposition M = U P. M is normal iff P U = U P.

Singular Value Decomposition

Theorem (7.3.5; Singular Value Decomposition): Every matrix can be written as a pairing of orthonormal vectors in its domain and codomain, with nonnegative weights; the vectors can be real if the matrix is real:

$\forall M \in M_{m,n}(\mathbb{C})$, $\exists V \in U(m)$, $W \in U(n)$, $\Sigma = \text{diag}\{(\sigma_i)_{i=1}^q\}_{m,n}$: $M = V \Sigma W^∗$.
$\forall M \in M_{m,n}(\mathbb{R})$, $\exists V \in O(m)$, $W \in O(n)$, and the rest follows.

Left singular vectors $(v_i)_{i=1}^m$ are the columns of $V$. Right singular vectors $(w_i)_{i=1}^n$ are the columns of $W$. Singular values $(\sigma_i)_{i=1}^q$ are the nonnegative values on the diagonal of $\Sigma$. The singular values of $M$ are eigenvalues of polar matrix $P$.

If $M$ is normal, then $M M^∗ = M^∗ M$ implies that $M M^∗$ and $M^∗ M$ have the same eigenvectors. This does not mean $V = W$ in $M = V \Sigma W^∗$, because each corresponding singular vector are different by a factor $e^{i \theta_k}$.

If $M = U \Lambda U^∗$, $\lambda_k = |\lambda_k| e^{i \theta_k}$ (define $\theta_k = 0$ if $\lambda_k = 0$), let $|\Lambda| = \mathrm{diag}\{|\lambda_1|, \dots, |\lambda_n|\}$, $D = \mathrm{diag}\{e^{i \theta_1}, \dots, e^{i \theta_k}\}$, then the SVD of $M$ is $M = U |\Lambda| (U D^∗)^∗$.

Theorem (7.3.7): Given $\tilde{M} = [0, M; M^∗, 0]$, the singular values of $M$ are $\{\sigma_1, \dots, \sigma_q\}$ iff the eigenvalues of $\tilde{M}$ are $\{\sigma_1, \dots, \sigma_q, -\sigma_1, \dots, -\sigma_q, 0, \dots, 0\}$.

$M^∗, M^T, \bar{M}$ have the same singular values with $M$. If $U$ and $V$ are unitary, then $U M V$ and $M$ have the same singular values. $\forall c \in \mathbb{C}, \Sigma(c M) = |c| \Sigma(M)$

Theorem (7.3.9; Interlacing property for singular values)

Theorem (7.3.10; Analog of Courant-Fischer Theorem)

Reduced SVDs

Full SVD refers to the form given in the theorem: orthogonal decompositions for the domain and codomain, paired up to the rank of the matrix. However, not all information is needed in applications, and the full SVD can be reduced to save computation and storage.

Without loss of generality, consider an m-by-n matrix M of rank r, with m > n > r. Thin SVD refers to a computational procedure that provides an orthonormal n-frame for the codomain and an orthogonal decomposition for the domain, paired up to the rank of the matrix. $M = V \Sigma Q^T$, where $V \in V_{n, m}$, $Q \in O(n)$, and $\Sigma = \text{diag}(\sigma)$, $\sigma \in \mathbb{R}^n_{+\downarrow}$. This saves the cost of completing the larger orthogonal matrix.

Compact SVD refers to a computational procedure that provides paired orthonormal vectors for the domain and codomain, with r positive weights: $M = V \Sigma W^T$, where $V \in V_{r, m}$, $W \in V_{r, n}$, and $\Sigma = \text{diag}(\sigma)$, $\sigma \in \mathbb{R}^r_{+\downarrow}$. This saves the cost of orthogonal decompositions of the kernel and the cokernel, which has no effect on the mapping sine they are associated with zero singular values.

Truncated SVD or partial SVD refers to a computational procedure that provides paired orthonormal vectors for the domain and codomain, with the k-largest weights, k < r: $M \approx M_k = V_k \Sigma_k W_k^T$, where $V_k = V (I_k; 0) \in V_{k, m}$, $W_k = W (I_k; 0) \in V_{k, n}$, and $\Sigma_k = \text{diag}(\sigma_i)_{i=1}^k$. This saves the cost of computing the smaller singular values and the paired left and right singular vectors. The outcome does not exactly represent the original matrix, but is the optimal rank-k approximation in Frobenius norm.

Algorithms and Computational Complexity

Full and thin SVD: (coefficients depend on output, see [@Golub2013, Fig 8.6.1] for details.)

full SVD $\mathcal{O}(m^2 n + n^3)$;
thin SVD $\mathcal{O}(m n^2 + n^3)$: R-SVD $6 m n^2 + 20 n^3$;

Truncated SVD [@Halko2011, Sec 1.4, 3.3, 6]:

general dense matrix that fits in fast memory:
- classical algorithms: rank-revealing QR factorization + manipulate the factors, $\mathcal{O}(m n k)$;
- randomized algorithms: $\mathcal{O}(m n \log(k) + (m+n) k^2)$; (stage A, structured random matrix, $\mathcal{O}(m n \log(k) + n k^2)$; stage B, row-extraction, $\mathcal{O}((m + n) k^2)$)
matrix with fast matrix-vector products, e.g. sparse or structured matrices;
- Krylov subspace method, e.g. the Lanczos or Arnoldi algorithm: numerically unstable, $\mathcal{O}(k T_{\text{mult}} + (m + n) k^2)$.
- randomized algorithms: numerically stable, same complexity as above;
general dense matrix stored in slow memory or streamed: cost dominated by transferring the matrix from slow memory;
- standard techniques for low-rank approximation: $\mathcal{O}(k)$ passes;
- randomized algorithms: one pass, or very few (2-4) additional passes for improved accuracy;

Truncated eigenvalue decomposition (EVD) of an Hermitian matrix [@Halko2011, Sec 5.3-5.5]: (general dense matrix that fits in fast memory)

stage A: randomized algorithm to find an orthonormal matrix $Q$ that approximates the range of $A$.
- accelerated technique: $\mathcal{O}(n^2 \log(k) + n k^2)$;
stage B: EVD of $A Q$
- direct method: $2 n^2 k$, for the matrix multiplication $A Q$;
- Nyström method (positive semi-definite): same cost as the direct method, much more accurate;
- row extraction: less accurate, $\mathcal{O}(n k^2)$;

Classical truncated EVD algorithms: (Krylov subspace methods)

Arnoldi algorithm.
Lanczos algorithm for Hermitian matrices: $\mathcal{O}(n^2 k)$, residuals linearly convergent, inherently unstable.
Block variants:
Implicit restart variants:
- implicitly restarted Arnoldi method (IRAM) [@Sorensen1992]: linearly convergent;
- implicitly restarted Lanczos method (IRL) [@Calvetti1994];

Classical SVD algorithms for general dense matrix, two steps:

reduce the matrix to bidiagonal form: $M = P B Q^T$, where P and Q are products of Householder matrices, B is upper diagonal;
- dense matrix: Golub-Kahan, $4 m n^2 - 4 / 3 n^3$ flops, $\mathcal{O}(m n^2)$;
- large matrix: Golub-Kahan-Lanczos, with Lanczos iterations;
SVD of a bidiagonal matrix, to high relative accuracy:
- singular values only: square-root-free method; bisection method, $\mathcal{O}(k n)$;
- singular values and singular vectors: multiple relatively robust representations, $\mathcal{O}(k n)$;

Jacobi SVD algorithm for general dense matrix, for improved accuracy;

One-sided Jacobi SVD algorithm [@Hestenes, 1958];
Preconditioned Jacobi SVD algorithm;
Rank-revealing decomposition: structured matrices;

Applications

Effective rank or numerical rank of a matrix is the number of singular values whose magnitudes are greater than measurement error.

(7.4.1; Nearest rank-$k$ matrix) Given $M \in M_{m,n}, \mathrm{rank}(M) = k$, then $\forall M_1 \in M_{m,n}, \mathrm{rank}(M) = k_1 \le k$, $\min \| M - M_1 \|_2 = \| M - V \Sigma_1 W^∗ \|_2$, where $\Sigma_1 = \mathrm{diag}\{\sigma_1, \dots, \sigma_{k_1}, 0, \dots, 0\}_{m,n}$.

Theorem (simultaneous singular value decomposition) [@Horn1990, p. 426, Q26]: Two matrices have the same pairs of singular vectors, $A = V \Sigma W^∗$ and $B = V \Lambda W^∗$, if and only if $A B^∗$ and $B^∗ A$ are both normal.

Theorem (7.4.10): If $A \in M_{m,n}, B \in M_{n,m}$, $A B$ and $B A$ are positive semi-definite, then $\mathrm{tr}(A B) = \mathrm{tr}(B A) = \sum_{i=1}^q \sigma_i(A) \sigma_{\tau(i)}(B)$.

🏷 Category=Algebra Category=Matrix Theory