**Dimensionality reduction**, like lossy data compression,
are techniques that provide the best approximation of data sets in some criteria,
or good approximations with better computational performance.

In this article, matrix $A \in M_{m,n}(\mathbb{R}), m > n$, $\mathrm{rank}(A) = r$, SVD gives $A = V \Sigma W^T = \sum_{i=1}^r \sigma_i v_i w_i^T$, with "effective/numerical rank" $k \ll r$. Let $V_r = (v_1, \dots, v_r)$, $W_r = (w_1, \dots, w_r)$, and $\Sigma_r = \mathrm{diag}(\sigma_1, \dots, \sigma_r)$; define $V_k, W_k, \Sigma_k$ similarly, and let $A_k = V_k \Sigma_k W_k^T$.

Singular value decomposition (SVD) gives the best theoretical low-rank approximate of a data matrix in terms of Frobenius norm (RSS), with cubic time $\mathcal{O}(m n^2)$ algorithms.

Interpret $A$ as $m$ objects in an n-dimensional vector space,
the right singular vectors $W$ gives a rotated basis of the **observation space** $\mathbb{R}^n$,
ordered in variance explained $\{\sigma_1^2, \dots, \sigma_r^2, 0, \dots\}$,
and the coordinates are (the rows of) $V \Sigma$.

**Sample space** (**measure space**) $\mathbb{R}^m$
can be seen as the dual to the observation space $\mathbb{R}^n$:
every vector corresponds to a unique measure of the objects/sample,
measures of different names but identical observations on the sample are equivalent;
measures that differ by a scaling factor are equivalent up to the unit and positivity;
linearly independent measures can generate new measures by linear combination,
the maximum number of independent measures the measure space can have equals the sample size $m$.
Note that $e_i$ is the dummy measure of object $i$,
and $\mathbf{1}$ is the constant measure which is uninteresting.

In the rotated basis $W$, the $n$ vectors/measures in the sample space $\mathbb{R}^m$ are orthogonal and have the largest principal vector norms $\{\sigma_1, \dots, \sigma_r, 0, \dots\}$ among all basis rotations in the observation space. Shorter vectors/metrics in the sample space can be truncated without much loss of information. Principal information $\{\sigma_i, v_i, w_i\}_{i=1}^k$, i.e. length and direction of principal sample vector and the corresponding rotated basis vector, can be used to approximately reconstruct the data with less storage.

Interpret $A$ as a potentially complete/Cartesian mapping $A: X \times Y \to \mathbb{R}$ on categories $X$ and $Y$ with $m$ and $n$ objects in a Euclidean space $\mathbb{R}^r$, such that $A_{i,j} = \langle x_i , y_j \rangle$ are inner products of these cross-category objects. SVD gives coordinates $(V_r; W_r) \Sigma_r^{1/2}$ of these $m+n$ objects in a basis of "pure concepts", ordered in the extent of mapping reconstructed ("strength of concept") $\{\sigma_1, \dots, \sigma_r\}$. The distances of within-category objects computed from these coordinates represent dissimilarity.

**Principal components** of a data matrix are the directions/subspaces of maximal data variance,
mutually orthogonal and in decreasing order.
**Principal components analysis** (PCA) refers to the procedure for
finding the principal components of a data matrix.
PCA can be done by:
**R-mode PCA**, eigendecomposition of a data covariance matrix $A^T A = W \Sigma^2 W^T$; or
**Q-mode PCA**, singular value decomposition of a data matrix $A = V \Sigma W^T$.
Notice that singular values $\sigma_i$ of the data matrix are
the principal square roots of the eigenvalues $\sigma_i^2$ of the data covariance matrix.
**Scree plot** shows in decreasing order the data variances in principal components,
i.e. eigenvalues of the data covariance matrix.

PCA vs LM LS (least square estimate of linear model): Both optimize RSS, but PCA uses distance in full space, LM LS uses distance in the label direction. Simple example: With Galton's symmetric joint distribution of inter-generation heights, PCA gives the diagonal and counter diagonal; while LM LS always gives a line flatter than the diagonal, crossing at the center of mass.

Figure: Dimensionality reduction vs. regression: linear and nonlinear methods. Adapted from [@Hastie1989, Figure 1].

Table: Distributed PCA algorithms and complexities.

big $m$, small $n$ | big $m$, big $n$ | |
---|---|---|

algorithm | similar to linear regression | iterative (Krylov subspace methods, random projection) |

local storage | $\mathcal{O}(n^2)$, outer product | $\mathcal{O}(nk+m)$, iterated top eigenvectors + intermediate vector |

local computation | $\mathcal{O}(n^3)$, eigen-decomposition | $\mathcal{O}(nk+m)$, vector multiplication + concatenation |

communication | $\mathcal{O}(nk)$, top eigenvectors | $\mathcal{O}(nk+m)$, iterated top eigenvectors + intermediate vector |

Distributed machine learning first distributes $m$ observations, preserving closed-form algorithm (if exists) for performance; when feature set $n$ also scales, algorithm complexity needs to be restricted to be linear in both $m$ and $n$, and typically iterative algorithms are used such as stochastic gradient descent (SGD).

Let $I = \{i_1, \dots, i_p\}$ and $J = \{j_1, \dots, j_q\}$ be two sets of indices for rows and columns of $A$, and $B_I$ and $B_J$ be the matrices of basis vectors corresponding to the index sets. We approximate $A$ with $C U R$, where $C = A B_J$ and $R = B_I^T A$ consist of columns and rows of $A$ respectively, and $U = (B_I^T A B_J)^\dagger$ is the pesudoinverse of the intersection of $C$ and $R$. Now we have $A \approx C U R = A B_J (B_I^T A B_J)^\dagger B_I^T A$.

CUR matrix approximation (CUR; Column, U, Row) provides low-rank approximates of a data matrix comparable with SVD in probability, with quadratic time $\mathcal{O}(m n)$ algorithms such as ALGORITHMCUR [@Mahoney2009]. If the data matrix $A$ is sparse, CUR offers better sparsity and interpretability than SVD, because CUR reuses rows and columns from $A$.

Theorem: $\forall \varepsilon, \delta > 0$, if $p = \mathcal{O}(k^2 [\log(1/\delta) / \varepsilon^2]^3)$ and $q = \mathcal{O}(k \log(1/\delta) / \varepsilon^2)$, then $P\{ \| A - CUR \|_F \le \| A - A_k \|_F + \varepsilon \|A\|_F \} \ge 1-\delta$.

See Manifold Learning.