Regression

For a given outcome and a given set of predictors, the ideal regression function is the conditional mean function, which derives from their hypothetical or theoretical joint distribution. Here the term "regression" is understood as in "regression toward the mean", coined by Francis Galton. Regression analysis is the estimation of the regression function, where conditional variance and higher order information about the joint distribution are completely ignored.

Regression problems in supervised learning:

Learning a physical law: fitting a curve to data.
Pattern recognition: associating handwritten letters to a probability measure over the English alphabet.
Monte Carlo integration: approximating an integral by averaging function values of a random sample.
Learning a concept (PAC learning): approximate a target concept (set) $T$ that minimizes the $L^2$ error.

Methodology: probability theory; optimization algorithms (least squares); linear algebra; reproducing kernel Hilbert space (RKHS).

Notations:

$X$: a compact domain or a manifold in Euclidean space; $Y = \mathbb{R}^k$; $Z = X \times Y$, the joint space;
$\rho$: a Borel probability measure on $Z$. Regularity properties are included when needed;
$\vec{z}$: a sample in $Z^m$;
$f_{\rho}(x)$: the regression function of $\rho$, i.e. the conditional expectation function of $y$ on $x$. Assumed to be bounded;
$\varepsilon(f) = \int_Z [f(x)-y]^2 \mathrm{d}\rho$: the error of function $f(x)$ w.r.t. $\rho$, i.e. the $L^2$ norm of $f(x)-y$ on $Z$;
$\varepsilon_{\vec{z}}(f)$: the empirical error of $f$ w.r.t. a sample $\vec{z}$, i.e. the mean square error (MSE) of $f$ given a sample $\vec{z}$;
$L_Z(f) = \varepsilon(f) - \varepsilon_{\vec{z}}(f)$: defect function of $f$, the difference between the error and the empirical error of function $f$ w.r.t. a sample $\vec{z}$;
$C(X)$: the Banach space of continuous functions on $X$ with sup-norm;
$H$: hypothesis space, a compact subspace of $C(X)$;
$N(S,s)$: the covering number of a metric space $S$ with disks at radius $s$, i.e. the minimal number of such disks needed to cover the space. This number is finite if the metric space is compact;
$f_H$: the target function in $H$ that has the minimal error;
$f_{H,\vec{z}}$ (or $f_{\vec{z}}$): the empirical target function in $H$ that has the minimal empirical error w.r.t. a sample $\vec{z}$;
$\varepsilon_H(f)$: the error in $H$ of a function $f$, i.e. the normalized error $\varepsilon(f) - \varepsilon(f_H)$ that is non-negative in $H$;
$\varepsilon_H(f_{\vec{z}})$: the empirical error in $H$ of a function $f$ (or sample error, estimation error), i.e. the error in $H$ of the empirical target function $f_{\vec{z}}$. It is non-negative and generally positive;

Problem

General formulation of a regression problem: Given a sample $\mathbf{z}$ from an unknown probability measure $\rho$ on $Z = X \times Y$, estimate the conditional expectation function $f_{\rho}(x) = \mathbf{E}_{\rho}(Y|x)$, and determine the minimal sample size $m$ such that the estimate has generalization error less than $\varepsilon$ with probability $1-\delta$.

For a regression function $f(x)$, its error (or risk) is the $L^2$-norm of $f(X)-Y$: $\varepsilon(f) = \int (f(x) - y)^2~\textrm{d}\rho$. Its empirical error (or empirical risk) with respect to a sample $\mathbf{z}$ is the mean square error (MSE): $\varepsilon_{\mathbf{z}}(f) = \frac{1}{n} \sum_{i=1}^n (f(x_i) - y_i)^2$. Its defect (or generalization error) is the difference between the error and the empirical error: $L_\mathbf{z}(f) = \varepsilon(f) - \varepsilon_{\mathbf{z}}(f)$.

Hypothesis space $H$ is a compact subspace of the Banach space $C(X)$ of continuous functions on $X$ with finite sup-norm. The target function has the minimal error in $H$: $f_H = \arg\min_{f \in H} \varepsilon(f)$. Given a sample $\mathbf{z}$, the empirical target function has the minimal empirical error in $H$: $f_{H,\mathbf{z}} = \arg\min_{f \in H} \varepsilon_{\mathbf{z}}(f)$.

The error in H of a function $f$ is the normalized error that is non-negative in $H$: $\varepsilon_H(f) = \varepsilon(f) - \varepsilon(f_H)$. The empirical error in H (or estimation error) is the error in H of the empirical target function: $\varepsilon_H(f_{H,\mathbf{z}})$.

Theorem

Theorems from "On The Mathematical Foundations Of Learning" [@Cucker2002].

Theorem A: For a single function $f$, an upper bound for the probability (confidence) of its generalization error $L_\mathbf{z}(f)$ lying beyond any small neighborhood of zero decreases exponentially in sample size $m$: $P(|L_\mathbf{z}(f)| \ge \varepsilon) \le 2 \exp\left(-\frac{m \varepsilon^2/2}{\sigma^2 + M^2 \varepsilon/3}\right)$.

Theorem B: For a hypothesis space $H$, a confidence on the generalization error extended from Theorem A is specified as a function of sample size $m$ and covering number of $H$.

Theorem C: Inheriting assumptions from Theorem B, a lower bound for the confidence of the sample error being smaller than an arbitrary positive number is specified as a function of sample size $m$ and covering number of $H$.

Theorem C∗: If the hypothesis space is convex, for a given upper limit, an improved/larger confidence is specified.

Algorithm

Logistic Regression

Logistic function is used as

loss function, evaluation metric of the linear model
probabilistic interpretation of predictions

Receiver operating characteristic (ROC) plot of statistical classifier presents the true positive rate (sensitivity index, d'; recall) against the false positive rate (fall-out, 1 - specificity) as discrimination threshold varies. It is the trajectory of the cumulative distribution function (CDF) of the detection probability against the CDF of false-alarm probability. ROC plots illustrate the performance of a binary classifier.

One-hot-encoding (OHE) feature: using dummy variables to represent categorical variables.

Dense representation: storing the full-dimensional vector despite most components being zero.
Sparse representation: storing (indice, value) pairs.

Feature hashing: an unsupervised learning preprocessing step that reduces feature dimension by hashing principles. Hash tables map categorical features to a fixed number of buckets evenly, reducing feature dimensionality.

Feature hashing is a practical alternative to one-hot-encoding that has better performance. Hash functions can be applied at partition level without communication between nodes.

Collaborative Filtering

Collaborative filtering is about predicting ("filtering") implicit attributes of a vector using attributes of many other vectors ("collaborative"). A technique for missing data problems.

Latent factor models uncover latent features that explain observations, e.g. pLSA, neural networks, and Latent Dirichlet Allocation. Many such models are derived from singular value decomposition of the observation matrix, where $m$ users and $n$ items are represented as vectors in a f-dimensional factor space with $f \le m, n$, such that user $i$'s rating on item $j$ is predicted as $$. User vectors can be interpreted as preference on each factor; item vectors can be interpreted as composition of each factor.

Alternating least squares (ALS) method: [@HuKorenVolinsky-ICDM08]

Cost function
- squared L2 distance between preference matrix P and factor model XY, weighted by confidence level, regularized by Hilbert–Schmidt norms of X and Y {equation 3}.
- regularization factor $\lambda$ is data-dependent, determined by cross-validation.
Optimization technique
1. randomly initiate the index/users matrix;
2. recompute the users matrix such that cost function is minimized; [$O(f^2 N + f^3 m)$]
3. recompute the index matrix such that cost function is minimized; [$O(f^2 N + f^3 n)$]
4. alternate between the two optimization process (typically ~10 times);

The ALS algorithm scales linearly with the size of data (N).

🏷 Category=Computation Category=Machine Learning