PCA and SVD works on two different interpretations of tabular data, but still with similar underlying idea.
How PCA and SVD are related:
- Interpret a \( n \times p \) matrix as \(n\) objects in a \(p\)-dimensional Euclidean space, PCA gives a rotated basis in \( \mathbb{R}^p \) ordered in variance explained.
- Interpret a \( m \times n \) matrix as a (potentially complete/Cartesian) mapping from two categories with \(m\) and \(n\) objects each, which are inner products of these cross-category objects in a (unifying) \(r\)-dimensional Euclidean space (distance of within-category objects represents similarity).
SVD gives coordinates of these \(m+n\) objects in a "concept" basis (directions of pure concepts) ordered in mapping reconstructed.
Principal Components Analysis
Assumptions: linearity, orthogonality.
Directions (subspaces) of maximal sample variance -> Top (dominant) eigenvectors of the sample covariance matrix, with associated eigenvalues directional variances.
Principal Components: (unit) eigenvectors of the sample covariance matrix
PCA at scale:
- big n (observations), small d (feature set)
- complexities:
- local storage O(d^2) [outer product],
- local computation O(d^3) [eigen-decomposition],
- communication O(dk) [top eigenvectors]
- algorithm: similar to closed-form linear regression
- big n, big d
- complexities:
- local storage O(dk+n) [iterated top eigenvectors + intermediate vector],
- local computation O(dk+n) [vector multiplication + vector concatenation],
- communication O(dk+n) [iterated top eigenvectors + intermediate vector]
- algorithm: iterative algorithm (Krylov subspace methods, random projection.)
[Distributed machine learning first distributes n observations, preserving closed-form algorithm (if exists) for performance; when feature set d also scales, algorithm complexity needs to be restricted to be linear in both n and d, and typically iterative algorithms are used such as stochastic gradient descent (SGD).]
PCA vs LM LS (least square estimate of linear model):
Both optimize RSS, but PCA uses distance in full space, LM LS uses distance in the label direction.
Simple example:
With Galton's symmetric joint distribution of inter-generation heights, PCA gives the diagonal and counter diagonal;
while LM LS always gives a line flatter than the diagonal, crossing at the center of mass.
Singular Value Decomposition
🏷 Category=Computation Category=Machine Learning