Dimensionality reduction, like lossy data compression, are techniques that provide the best approximation of data sets in some criteria, or good approximations with better computational performance.
In this article, matrix $A \in M_{m,n}(\mathbb{R}), m > n$, $\mathrm{rank}(A) = r$, SVD gives $A = V \Sigma W^T = \sum_{i=1}^r \sigma_i v_i w_i^T$, with "effective/numerical rank" $k \ll r$. Let $V_r = (v_1, \dots, v_r)$, $W_r = (w_1, \dots, w_r)$, and $\Sigma_r = \mathrm{diag}(\sigma_1, \dots, \sigma_r)$; define $V_k, W_k, \Sigma_k$ similarly, and let $A_k = V_k \Sigma_k W_k^T$.
Singular value decomposition (SVD) gives the best theoretical low-rank approximate of a data matrix in terms of Frobenius norm (RSS), with cubic time $O(m n^2)$ algorithms.
Interpret $A$ as $m$ objects in an $n$-dimensional vector space, the right singular vectors $W$ gives a rotated basis of the observation space $\mathbb{R}^n$, ordered in variance explained $\{\sigma_1^2, \dots, \sigma_r^2, 0, \dots\}$, and the coordinates are (the rows of) $V \Sigma$.
Sample space (measure space) $\mathbb{R}^m$ can be seen as the dual to the observation space $\mathbb{R}^n$: every vector corresponds to a unique measure of the objects/sample, measures of different names but identical observations on the sample are equivalent; measures that differ by a scaling factor are equivalent up to the unit and positivity; linearly independent measures can generate new measures by linear combination, the maximum number of independent measures the measure space can have equals the sample size $m$. Note that $e_i$ is the dummy measure of object $i$, and $\mathbf{1}$ is the constant measure which is uninteresting.
In the rotated basis $W$, the $n$ vectors/measures in the sample space $\mathbb{R}^m$ are orthogonal and have the largest principal vector norms $\{\sigma_1, \dots, \sigma_r, 0, \dots\}$ among all basis rotations in the observation space. Shorter vectors/metrics in the sample space can be truncated without much loss of information. Principal information $\{\sigma_i, v_i, w_i\}_{i=1}^k$, i.e. length and direction of principal sample vector and the corresponding rotated basis vector, can be used to approximately reconstruct the data with less storage.
Interpret $A$ as a potentially complete/Cartesian mapping $A: X \times Y \to \mathbb{R}$ on categories $X$ and $Y$ with $m$ and $n$ objects in a Euclidean space $\mathbb{R}^r$, such that $A_{i,j} = \langle x_i , y_j \rangle$ are inner products of these cross-category objects. SVD gives coordinates $(V_r; W_r) \Sigma_r^{1/2}$ of these $m+n$ objects in a basis of "pure concepts", ordered in the extent of mapping reconstructed ("strength of concept") $\{\sigma_1, \dots, \sigma_r\}$. The distances of within-category objects computed from these coordinates represent dissimilarity.
Principal components analysis (PCA) of a data matrix provides the directions/subspaces of maximal data variance, called principal components. Scree plot shows in decreasing order the data variances in principal components, i.e. eigenvalues of the data covariance matrix.
PCA can be done by:
Notice that singular values $\sigma_i$ of the data matrix are the principal square roots of the eigenvalues $\sigma_i^2$ of the data covariance matrix.
PCA vs LM LS (least square estimate of linear model): Both optimize RSS, but PCA uses distance in full space, LM LS uses distance in the label direction. Simple example: With Galton's symmetric joint distribution of inter-generation heights, PCA gives the diagonal and counter diagonal; while LM LS always gives a line flatter than the diagonal, crossing at the center of mass.
Table: Distributed PCA algorithms and complexities.
big $m$, small $n$ | big $m$, big $n$ | |
---|---|---|
algorithm | similar to linear regression | iterative (Krylov subspace methods, random projection) |
local storage | $O(n^2)$, outer product | $O(nk+m)$, iterated top eigenvectors + intermediate vector |
local computation | $O(n^3)$, eigen-decomposition | $O(nk+m)$, vector multiplication + concatenation |
communication | $O(nk)$, top eigenvectors | $O(nk+m)$, iterated top eigenvectors + intermediate vector |
Distributed machine learning first distributes $m$ observations, preserving closed-form algorithm (if exists) for performance; when feature set $n$ also scales, algorithm complexity needs to be restricted to be linear in both $m$ and $n$, and typically iterative algorithms are used such as stochastic gradient descent (SGD).
Let $I = \{i_1, \dots, i_p\}$ and $J = \{j_1, \dots, j_q\}$ be two sets of indices for rows and columns of $A$, and $B_I$ and $B_J$ be the matrices of basis vectors corresponding to the index sets. We approximate $A$ with $C U R$, where $C = A B_J$ and $R = B_I^T A$ consist of columns and rows of $A$ respectively, and $U = (B_I^T A B_J)^\dagger$ is the pesudoinverse of the intersection of $C$ and $R$. Now we have $A \approx C U R = A B_J (B_I^T A B_J)^\dagger B_I^T A$.
CUR matrix approximation (CUR; Column, U, Row) provides low-rank approximates of a data matrix comparable with SVD in probability, with quadratic time $O(m n)$ algorithms such as ALGORITHMCUR [@Mahoney2009]. If the data matrix $A$ is sparse, CUR offers better sparsity and interpretability than SVD, because CUR reuses rows and columns from $A$.
Theorem: $\forall \varepsilon, \delta > 0$, if $p = O(k^2 [\log(1/\delta) / \varepsilon^2]^3)$ and $q = O(k \log(1/\delta) / \varepsilon^2)$, then $P\{ \| A - CUR \|_F \le \| A - A_k \|_F + \varepsilon \|A\|_F \} \ge 1-\delta$.