Dimensionality Reduction

Dimensionality reduction... (lossy data compression)

PCA and SVD work on two different interpretations of tabular data, but have computationally identical implementations (Q-mode PCA):

Interpret a $n \times p$ matrix as $n$ objects in a $p$-dimensional Euclidean space, PCA gives a rotated basis in $\mathbb{R}^p$ ordered in variance explained.
Interpret a $m \times n$ matrix as a (potentially complete/Cartesian) mapping between two categories with $m$ and $n$ objects each, which are inner products of these cross-category objects in a (unifying) Euclidean space where distances of within-category objects represent similarity. SVD gives coordinates of these $m+n$ objects in a basis of "pure concepts" ordered in the extent of mapping reconstructed.

Principal Components Analysis

Principal components analysis (PCA) of a data matrix provides the directions/subspaces of maximal data variance. Principal components are unit eigenvectors of the data covariance matrix $X^T X$, in decreasing order of the eigenvalues. Scree plot shows the decline of data variances in principal components, i.e. eigenvalues of the data covariance matrix.

PCA is theoretically always possible and essentially unique, because the covariance matrix of any data matrix is positive semi-definite, which can be diagonalized by an orthonormal matrix.

PCA can be done by: (R-mode PCA) eigenvalue decomposition of a data covariance matrix; or (Q-mode PCA) singular value decomposition of a data matrix. Singular values of the data matrix are the principal square roots of the eigenvalues of the data covariance matrix.

PCA vs LM LS (least square estimate of linear model): Both optimize RSS, but PCA uses distance in full space, LM LS uses distance in the label direction. Simple example: With Galton's symmetric joint distribution of inter-generation heights, PCA gives the diagonal and counter diagonal; while LM LS always gives a line flatter than the diagonal, crossing at the center of mass.

Distributed PCA

Table: Distributed PCA algorithms and complexities.

	big $n$, small $d$ (features)	big $n$, big $d$
algorithm	similar to linear regression	iterative (Krylov subspace methods, random projection)
local storage	$O(d^2)$, outer product	$O(dk+n)$, iterated top eigenvectors + intermediate vector
local computation	$O(d^3)$, eigen-decomposition	$O(dk+n)$, vector multiplication + concatenation
communication	$O(dk)$, top eigenvectors	$O(dk+n)$, iterated top eigenvectors + intermediate vector

Distributed machine learning first distributes $n$ observations, preserving closed-form algorithm (if exists) for performance; when feature set $d$ also scales, algorithm complexity needs to be restricted to be linear in both $n$ and $d$, and typically iterative algorithms are used such as stochastic gradient descent (SGD).

Singular Value Decomposition

Singular value decomposition (SVD) gives the best theoretical low-rank approximates of a data matrix, with cubic time algorithms $O(n m \min\{n,m\})$.

Interpret SVD the PCA way

Sample space $\mathbb{R}^m$ can be seen as the dual to the observation space $\mathbb{R}^n$. Given a data matrix $X \in M_{m,n}, m > n$ and $\mathrm{rank}(X) = n$ (no perfect collinearity), rotate basis of the observation space $\mathbb{R}^n$ such that the $n$ objects/vectors in sample space $\mathbb{R}^m$ are orthogonal and have the largest principal vector norms. Each vector in sample space are projections of observations in the corresponding basis, so the first right singular vector $w_1$ have the largest sum of squared projections $\sigma_1 v_1$, and the rest recursively optimize on the residuals. Thus, with $X = V \Sigma W^∗ = \sum_{i=1}^n \sigma_i v_i w_i^∗$, right singular vectors $W$ are the PCA basis, left singular vectors $V$ multiply the singular values $\Sigma$ have row vectors $\sigma_i v_i$ being object coordinates in the rotated basis.

Principal components provide the longest vectors/projections in the sample space among all basis rotations in the observation space. Shorter vectors in the sample space can be truncated without much loss in the observation space, and (principal) information in the sample space can be stored to approximately reconstruct the data, with less storage.

CUR Matrix Approximation

CUR matrix approximation (CUR) provides low-rank approximates of a data matrix comparable with the theoretical best in probability, in quadratic time $O(n m)$.

CUR: Column, U, Row.

🏷 Category=Computation Category=Machine Learning