Dimensionality Reduction

PCA and SVD works on two different interpretations of tabular data, but are computationally identical.

How PCA and SVD are related:

Interpret a \( n \times p \) matrix as \(n\) objects in a \(p\)-dimensional Euclidean space, PCA gives a rotated basis in \( \mathbb{R}^p \) ordered in variance explained.
Interpret a \( m \times n \) matrix as a (potentially complete/Cartesian) mapping from two categories with \(m\) and \(n\) objects each, which are inner products of these cross-category objects in a (unifying) \(r\)-dimensional Euclidean space (distance of within-category objects represents similarity). SVD gives coordinates of these \(m+n\) objects in a "concept" basis (directions of pure concepts) ordered in mapping reconstructed.

Interpret SVD the PCA way: m*n matrix, m>n, rank = n. m objects in n space, no perfect collinearity. Rotate in observation space, such that the n vectors in sample space are orthogonal and have largest principal vector norms. Each vector in sample space are projections of observations in corresponding basis, so the first right singular vector have the largest sum of squared projections, and the rest recursively optimize on the residuals. Conclusion: right singular vectors is the PCA basis, left singular vectors multiplies the singular values have row vectors being object coordinates in the rotated basis.

Principal Components Analysis

Assumptions: linearity, orthogonality.

Principal Components are (unit) eigenvectors of the sample covariance matrix;

Principal Components Analysis (PCA): Directions (subspaces) of maximal sample variance -> Top (dominant) eigenvectors of the sample covariance matrix, with directional variances being the associated eigenvalues.

PCA can be done by eigenvalue decomposition of a data variance-covariance matrix or singular value decomposition of a data matrix. Singular values of the data matrix are the principal square roots of eigenvalues of the data variance-covariance matrix. "Spectral decomposition": sample space (R^n) can be seen as the dual to the observation space (R^p), principal components are the longest vectors in the sample space for all basis rotations in the observation space. Non-principal components can be truncated without much loss in the observation space, and (principal) information in the sample space can be stored to approximately reconstruct the data, with less storage.

PCA at scale:

big n (observations), small d (feature set)
- complexities:
  - local storage O(d^2) [outer product],
  - local computation O(d^3) [eigen-decomposition],
  - communication O(dk) [top eigenvectors]
- algorithm: similar to closed-form linear regression
big n, big d
- complexities:
  - local storage O(dk+n) [iterated top eigenvectors + intermediate vector],
  - local computation O(dk+n) [vector multiplication + vector concatenation],
  - communication O(dk+n) [iterated top eigenvectors + intermediate vector]
- algorithm: iterative algorithm (Krylov subspace methods, random projection.)

[Distributed machine learning first distributes n observations, preserving closed-form algorithm (if exists) for performance; when feature set d also scales, algorithm complexity needs to be restricted to be linear in both n and d, and typically iterative algorithms are used such as stochastic gradient descent (SGD).]

PCA vs LM LS (least square estimate of linear model): Both optimize RSS, but PCA uses distance in full space, LM LS uses distance in the label direction. Simple example: With Galton's symmetric joint distribution of inter-generation heights, PCA gives the diagonal and counter diagonal; while LM LS always gives a line flatter than the diagonal, crossing at the center of mass.

Singular Value Decomposition

🏷 Category=Computation Category=Machine Learning