Dimensionality Reduction

Dimensionality reduction, like lossy data compression, are techniques that provide the best approximation of data sets in some criteria, or good approximations with better computational performance.

In this article, matrix $A \in M_{m,n}(\mathbb{R}), m < n$, $\mathrm{rank}(A) = r$, and has "effective rank" $k \ll r$ and SVD $A = V \Sigma W^T$. Let $V_r = (v_1, \dots, v_r)$, $W_r = (w_1, \dots, w_r)$, and $\Sigma_r = \mathrm{diag}(\sigma_1, \dots, \sigma_r)$.

PCA and SVD work on two different interpretations of tabular data, but have computationally identical implementation (Q-mode PCA). Interpret $A$ as $m$ objects in an $n$-dimensional vector space, PCA gives a rotated basis $W$ of $\mathbb{R}^n$, ordered in variance explained $\{\sigma_1^2, \dots, \sigma_r^2, 0, \dots\}$, and the coordinates are $V \Sigma$. Interpret $A$ as a potentially complete/Cartesian mapping $A: X \times Y \to \mathbb{R}$ on categories $X$ and $Y$ with $m$ and $n$ objects in a Euclidean space $\mathbb{R}^r$, such that $A_{i,j} = \langle x_i , y_j \rangle$ are inner products of these cross-category objects. SVD gives coordinates $(V_r; W_r) \Sigma_r^{1/2}$ of these $m+n$ objects in a basis of "pure concepts", ordered in the extent of mapping reconstructed ("strength of concept") $\{\sigma_1, \dots, \sigma_r\}$. The distances of within-category objects computed from these coordinates represent dissimilarity.

Principal Components Analysis

Principal components analysis (PCA) of a data matrix provides the directions/subspaces of maximal data variance. Principal components are unit eigenvectors of the data covariance matrix $A^T A$, in decreasing order of the eigenvalues. Scree plot shows the decline of data variances in principal components, i.e. eigenvalues of the data covariance matrix.

PCA is theoretically always possible and essentially unique, because the covariance matrix of any data matrix is positive semi-definite, which can be diagonalized by an orthonormal matrix.

PCA can be done by: R-mode PCA, eigenvalue decomposition of a data covariance matrix; or Q-mode PCA, singular value decomposition of a data matrix. Singular values of the data matrix are the principal square roots of the eigenvalues of the data covariance matrix.

PCA vs LM LS (least square estimate of linear model): Both optimize RSS, but PCA uses distance in full space, LM LS uses distance in the label direction. Simple example: With Galton's symmetric joint distribution of inter-generation heights, PCA gives the diagonal and counter diagonal; while LM LS always gives a line flatter than the diagonal, crossing at the center of mass.

Distributed PCA

Table: Distributed PCA algorithms and complexities.

	big $m$, small $n$	big $m$, big $n$
algorithm	similar to linear regression	iterative (Krylov subspace methods, random projection)
local storage	$O(n^2)$, outer product	$O(nk+m)$, iterated top eigenvectors + intermediate vector
local computation	$O(n^3)$, eigen-decomposition	$O(nk+m)$, vector multiplication + concatenation
communication	$O(nk)$, top eigenvectors	$O(nk+m)$, iterated top eigenvectors + intermediate vector

Distributed machine learning first distributes $m$ observations, preserving closed-form algorithm (if exists) for performance; when feature set $n$ also scales, algorithm complexity needs to be restricted to be linear in both $m$ and $n$, and typically iterative algorithms are used such as stochastic gradient descent (SGD).

Singular Value Decomposition

Singular value decomposition (SVD) gives the best theoretical low-rank approximates of a data matrix, with cubic time algorithms $O(n m \min\{n,m\})$.

Interpret SVD the PCA way

Sample space $\mathbb{R}^m$ can be seen as the dual to the observation space $\mathbb{R}^n$. Given a data matrix $A \in M_{m,n}(\mathbb{R}), m > n$, and $\mathrm{rank}(A) = n$ (no perfect collinearity), rotate basis of the observation space $\mathbb{R}^n$ such that the $n$ objects/vectors in sample space $\mathbb{R}^m$ are orthogonal and have the largest principal vector norms. Each vector in sample space are projections of observations in the corresponding basis, so the first right singular vector $w_1$ have the largest sum of squared projections $\sigma_1 v_1$, and the rest recursively optimize on the residuals. Thus, with $A = V \Sigma W^T = \sum_{i=1}^n \sigma_i v_i w_i^T$, right singular vectors $W$ are the PCA basis, left singular vectors $V$ multiply the singular values $\Sigma$ have row vectors $\sigma_i v_i$ being object coordinates in the rotated basis.

Principal components provide the longest vectors/projections in the sample space among all basis rotations in the observation space. Shorter vectors in the sample space can be truncated without much loss in the observation space, and (principal) information in the sample space can be stored to approximately reconstruct the data, with less storage.

CUR Matrix Approximation

CUR matrix approximation (CUR) provides low-rank approximates of a data matrix comparable with the theoretical best in probability, in quadratic time $O(n m)$.

CUR: Column, U, Row.

🏷 Category=Computation Category=Machine Learning