For a given outcome and a given set of predictors, the ideal regression function is the conditional mean function, which derives from their hypothetical or theoretical joint distribution. Here the term "regression" is understood as in "regression toward the mean", coined by Francis Galton. Regression analysis is the estimation of the regression function, where conditional variance and higher order information about the joint distribution are completely ignored.
Regression problems in supervised learning:
Methodology involves: probability theory; optimization algorithms (least squares); linear algebra; reproducing kernel Hilbert space.
General formulation of a regression problem:
Given a sample \(\mathbf{z}\) from an unknown probability measure \(\rho\) on \( Z = X \times Y \), estimate the conditional expectation function \(f_{\rho}(x) = \mathbf{E}_{\rho}(Y|x)\), and determine the minimal sample size \(m\) such that the estimate has generalization error less than \(\varepsilon\) with probability \(1-\delta\).
For a regression function \(f(x)\), its error (or risk) is the \(L^2\)-norm of \(f(X)-Y\): \[ \varepsilon(f) = \int (f(x) - y)^2~\textrm{d}\rho \] its empirical error (or empirical risk) with respect to a sample \(\mathbf{z}\) is the mean square error (MSE): \[ \varepsilon_{\mathbf{z}}(f) = \frac{1}{n} \sum_{i=1}^n (f(x_i) - y_i)^2 \] its defect (or generalization error) is the difference between the error and the empirical error: \[L_\mathbf{z}(f) = \varepsilon(f) - \varepsilon_{\mathbf{z}}(f)\]
Hypothesis space \(H\) is a compact subspace of the Banach space \(C(X)\) of continuous functions on \(X\) with finite sup-norm. The target function has the minimal error in \(H\): \[ f_H = \arg\min_{f \in H} \varepsilon(f) \] Given a sample \(\mathbf{z}\), the empirical target function has the minimal empirical error in \(H\): \[ f_{H,\mathbf{z}} = \arg\min_{f \in H} \varepsilon_{\mathbf{z}}(f) \]
The error in H of a function \(f\) is the normalized error that is non-negative in \(H\): \[ \varepsilon_H(f) = \varepsilon(f) - \varepsilon(f_H)\] The empirical error in H (or estimation error) is the error in H of the empirical target function: \( \varepsilon_H(f_{H,\mathbf{z}}) \).
Theorems from {Cucker2002}, "On The Mathematical Foundations Of Learning".
Theorem A: For a single function \(f\), an upper bound for the probability (confidence) of its generalization error \(L_\mathbf{z}(f)\) lying beyond any small neighborhood of zero decreases exponentially in sample size \(m\):
\[ P(|L_\mathbf{z}(f)| \ge \varepsilon) \le 2 \exp\left(-\frac{m \varepsilon^2/2}{\sigma^2 + M^2 \varepsilon/3}\right) \]
Theorem B: For a hypothesis space \(H\), a confidence on the generalization error extended from Theorem A is specified as a function of sample size \(m\) and covering number of \(H\).
Theorem C: Inheriting assumptions from Theorem B, a lower bound for the confidence of the sample error being smaller than an arbitrary positive number is specified as a function of sample size \(m\) and covering number of \(H\).
Theorem C∗: If the hypothesis space is convex, for a given upper limit, an improved/larger confidence is specified.
Logistic function is used as
Receiver operating characteristic (ROC) plot of statistical classifier presents the true positive rate (sensitivity index, d'; recall) against the false positive rate (fall-out, 1 - specificity) as discrimination threshold varies. It is the trajectory of the cumulative distribution function (CDF) of the detection probability against the CDF of false-alarm probability. ROC plots illustrate the performance of a binary classifier.
One-hot-encoding (OHE) feature: using dummy variables to represent categorical variables.
Feature hashing: an unsupervised learning preprocessing step that reduces feature dimension by hashing principles. Hash tables map categorical features to a fixed number of buckets evenly, reducing feature dimensionality.
Feature hashing is a practical alternative to one-hot-encoding that has better performance. Hash functions can be applied at partition level without communication between nodes.
Collaborative filtering is about predicting (filtering) implicit attributes of a vector using attributes of many other vectors (collaborative).
Latent factor models uncover latent features that explain observations, examples include pLSA, neural networks, and Latent Dirichlet Allocation. Many such models are induced by Singular Value Decomposition (SVD) of the observations matrix, where users (m) and items (n) are represented as vectors in a low dimensional factor space (f << m, n) such that user u's rating on item i is predicted as \(x_u' y_i\). User vector can be interpreted as preference on each factor; item vector can be interpreted as composition of each factor.
Alternating least squares (ALS) method: {HuKorenVolinsky-ICDM08}
The ALS algorithm scales linearly with the size of data (N).