Regression

Regression problems in supervised learning:

Learning a physical law: fitting a curve to data.
Pattern recognition: associating handwritten letters to a probability measure over the English alphabet.
Monte Carlo integration: approximating an integral by averaging function values of a random sample.
Learning a concept (PAC learning): approximate a target concept (set) T that minimizes the L2 error.

Methodology:

Probability theory;
linear algebra;
least squares (optimization algorithm);
reproducing kernel Hilbert space (analysis).

Problem

The error (risk) of function f(x) w.r.t. ρ, ε(f), is the \(L^2\) norm of \(f_Y\) on Z: \(\int_Z [f(x)-y]^2~\textrm{d}\rho\). The empirical error (empirical risk) of f(x) w.r.t. a sample \(\vec{z}\), \(\epsilon_{\vec{z}}(f)\), is the mean square error (MSE) of f(x) given a sample \(\vec{z}\). The defect (generalization error) of f(x), \(L_Z(f)\), is the difference between the error and the empirical error of function f(x), \(\epsilon(f) - \epsilon_{\vec{z}}(f)\).

hypothesis space H is a compact subspace of C(X), the Banach space of continuous functions on X with finite sup-norm. The target function \(f_H\) has the minimal error in H. The empirical target function \(f_{H,\vec{z}}\) has the minimal empirical error in H w.r.t. a sample \(\vec{z}\).

The error in H of a function f, \(\epsilon_H(f)\), is the normalized error \(\epsilon(f) - \epsilon(f_H)\) that is non-negative in H. The empirical error in H (estimation error) \(\epsilon_H(f_{\vec{z}})\) is the error in of the empirical target function \(f_{H,\vec{z}}\). It is non-negative and generally positive.

General formulation of a regression problem: For an unknown probability measure ρ on the joint space X × Y, estimate the regression function (conditional expectation) \(f_{\rho}(x) = E_{\rho}(Y|x)\), and determine the minimal sample size m to obtain an ε-approximation (generalization bound) with probability (confidence) 1-δ.

Theorem

Theorem A: For a single function f, an upper bound for the probability (confidence) of its generalization error \(L_Z(f)\) lying beyond any small neighborhood of zero exponentially decreases in sample size m:

\[ P(|L_\mathbf{z}(f)| \ge \epsilon) \le 2\exp\left(-\frac{m \epsilon^2/2}{\sigma^2 + M^2 \epsilon/3}\right) \]

Theorem B: For a hypothesis space H, a confidence on the defects extended from Theorem A is specified, as a function of sample size m and covering number of H.

Theorem C: Inheriting assumptions from Theorem B, a lower bound for the probability (confidence) of the sample error being smaller than an arbitrary positive number is specified, as a function of sample size m and covering number of H.

Theorem C': If the hypothesis space is convex, for a given upper limit, an improved/larger confidence is specified.

Algorithm

Logistic Regression

Logistic function is used as

loss function, evaluation metric of the linear model
probabilistic interpretation of predictions

Receiver operating characteristic (ROC) plot of statistical classifier presents the true positive rate (sensitivity index, d'; recall) against the false positive rate (fall-out, 1 - specificity) as discrimination threshold varies. It is the trajectory of the cumulative distribution function (CDF) of the detection probability against the CDF of false-alarm probability. ROC plots illustrate the performance of a binary classifier.

One-hot-encoding (OHE) feature: using dummy variables to represent categorical variables.

Dense representation: storing the full-dimensional vector despite most components being zero.
Sparse representation: storing (indice, value) pairs.

Feature hashing: an unsupervised learning preprocessing step that reduces feature dimension by hashing principles. Hash tables map categorical features to a fixed number of buckets evenly, reducing feature dimensionality.

Feature hashing is a practical alternative to one-hot-encoding that has better performance. Hash functions can be applied at partition level without communication between nodes.

Collaborative Filtering

Collaborative filtering is about predicting (filtering) implicit attributes of a vector using attributes of many other vectors (collaborative).

Latent factor models uncover latent features that explain observations, examples include pLSA, neural networks, and Latent Dirichlet Allocation. Many such models are induced by Singular Value Decomposition (SVD) of the observations matrix, where users (m) and items (n) are represented as vectors in a low dimensional factor space (f << m, n) such that user u's rating on item i is predicted as x_u' y_i. User vector can be interpreted as preference on each factor; item vector can be interpreted as composition of each factor.

Alternating least squares (ALS) method: {HuKorenVolinsky-ICDM08}

Cost function
- squared L2 distance between preference matrix P and factor model XY, weighted by confidence level, regularized by Hilbert–Schmidt norms of X and Y {equation 3}.
- regularization factor \lambda is data-dependent, determined by cross-validation.
Optimization technique
1. randomly initiate the index/users matrix;
2. recompute the users matrix such that cost function is minimized; [O(f^2N+f^3m)]
3. recompute the index matrix such that cost function is minimized; [O(f^2N+f^3n)]
4. alternate between the two optimization process. (typically ~10 times)

The ALS algorithm scales linearly with the size of data (N).

🏷 Category=Computation Category=Machine Learning