Regression problems in supervised learning:
T
that minimizes the L2 error.Methodology:
The error (risk) of function f(x)
w.r.t. ρ
, ε(f)
, is the \(L^2\) norm of \(f_Y\) on Z
: \(\int_Z [f(x)-y]^2~\textrm{d}\rho\).
The empirical error (empirical risk) of f(x)
w.r.t. a sample \(\vec{z}\), \(\epsilon_{\vec{z}}(f)\), is the mean square error (MSE) of f(x)
given a sample \(\vec{z}\).
The defect (generalization error) of f(x)
, \(L_Z(f)\), is the difference between the error and the empirical error of function f(x)
, \(\epsilon(f) - \epsilon_{\vec{z}}(f)\).
hypothesis space H
is a compact subspace of C(X)
, the Banach space of continuous functions on X
with finite sup-norm.
The target function \(f_H\) has the minimal error in H
.
The empirical target function \(f_{H,\vec{z}}\) has the minimal empirical error in H
w.r.t. a sample \(\vec{z}\).
The error in H of a function f
, \(\epsilon_H(f)\), is the normalized error \(\epsilon(f) - \epsilon(f_H)\) that is non-negative in H
.
The empirical error in H (estimation error) \(\epsilon_H(f_{\vec{z}})\) is the error in of the empirical target function \(f_{H,\vec{z}}\).
It is non-negative and generally positive.
General formulation of a regression problem:
For an unknown probability measure ρ
on the joint space X × Y
,
estimate the regression function (conditional expectation) \(f_{\rho}(x) = E_{\rho}(Y|x)\), and
determine the minimal sample size m
to obtain an ε
-approximation (generalization bound) with probability (confidence) 1-δ
.
Theorem A:
For a single function f
, an upper bound for the probability (confidence) of its generalization error \(L_Z(f)\) lying beyond any small neighborhood of zero exponentially decreases in sample size m
:
\[ P(|L_\mathbf{z}(f)| \ge \epsilon) \le 2\exp\left(-\frac{m \epsilon^2/2}{\sigma^2 + M^2 \epsilon/3}\right) \]
Theorem B:
For a hypothesis space H
, a confidence on the defects extended from Theorem A is specified, as a function of sample size m
and covering number of H
.
Theorem C:
Inheriting assumptions from Theorem B, a lower bound for the probability (confidence) of the sample error being smaller than an arbitrary positive number is specified, as a function of sample size m
and covering number of H
.
Theorem C': If the hypothesis space is convex, for a given upper limit, an improved/larger confidence is specified.
Logistic function is used as
Receiver operating characteristic (ROC) plot of statistical classifier presents the true positive rate (sensitivity index, d'; recall) against the false positive rate (fall-out, 1 - specificity) as discrimination threshold varies. It is the trajectory of the cumulative distribution function (CDF) of the detection probability against the CDF of false-alarm probability. ROC plots illustrate the performance of a binary classifier.
One-hot-encoding (OHE) feature: using dummy variables to represent categorical variables.
Feature hashing: an unsupervised learning preprocessing step that reduces feature dimension by hashing principles. Hash tables map categorical features to a fixed number of buckets evenly, reducing feature dimensionality.
Feature hashing is a practical alternative to one-hot-encoding that has better performance. Hash functions can be applied at partition level without communication between nodes.
Collaborative filtering is about predicting (filtering) implicit attributes of a vector using attributes of many other vectors (collaborative).
Latent factor models uncover latent features that explain observations, examples include pLSA, neural networks, and Latent Dirichlet Allocation.
Many such models are induced by Singular Value Decomposition (SVD) of the observations matrix, where users (m) and items (n) are represented as vectors in a low dimensional factor space (f << m, n) such that user u's rating on item i is predicted as x_u' y_i
. User vector can be interpreted as preference on each factor; item vector can be interpreted as composition of each factor.
Alternating least squares (ALS) method: {HuKorenVolinsky-ICDM08}
\lambda
is data-dependent, determined by cross-validation.The ALS algorithm scales linearly with the size of data (N).