For a given outcome and a given set of predictors,
the ideal **regression function** is the conditional mean function,
which derives from their hypothetical or theoretical joint distribution.
Here the term "regression" is understood as in "regression toward the mean", coined by Francis Galton.
**Regression analysis** is the estimation of the regression function,
where conditional variance and higher order information about the joint distribution
are completely ignored.

Regression problems in supervised learning:

- Learning a physical law: fitting a curve to data.
- Pattern recognition: associating handwritten letters to a probability measure over the English alphabet.
- Monte Carlo integration: approximating an integral by averaging function values of a random sample.
- Learning a concept (PAC learning): approximate a target concept (set) $T$ that minimizes the L2 error.

Methodology involves: probability theory; optimization algorithms (least squares); linear algebra; reproducing kernel Hilbert space.

General formulation of a regression problem: Given a sample $\mathbf{z}$ from an unknown probability measure $\rho$ on $Z = X \times Y$, estimate the conditional expectation function $f_{\rho}(x) = \mathbf{E}_{\rho}(Y|x)$, and determine the minimal sample size $m$ such that the estimate has generalization error less than $\varepsilon$ with probability $1-\delta$.

For a regression function $f(x)$,
its **error** (or **risk**) is the $L^2$-norm of $f(X)-Y$:
$$\varepsilon(f) = \int (f(x) - y)^2~\textrm{d}\rho$$
its **empirical error** (or **empirical risk**) with respect to a sample $\mathbf{z}$
is the mean square error (MSE):
$$\varepsilon_{\mathbf{z}}(f) = \frac{1}{n} \sum_{i=1}^n (f(x_i) - y_i)^2$$
its **defect** (or **generalization error**)
is the difference between the error and the empirical error:
$$L_\mathbf{z}(f) = \varepsilon(f) - \varepsilon_{\mathbf{z}}(f)$$

**Hypothesis space** $H$ is a compact subspace of the Banach space $C(X)$
of continuous functions on $X$ with finite sup-norm.
The **target function** has the minimal error in $H$:
$$f_H = \arg\min_{f \in H} \varepsilon(f)$$
Given a sample $\mathbf{z}$, the **empirical target function** has the minimal empirical error in $H$:
$$f_{H,\mathbf{z}} = \arg\min_{f \in H} \varepsilon_{\mathbf{z}}(f)$$

The **error in H** of a function $f$ is the normalized error that is non-negative in $H$:
$$\varepsilon_H(f) = \varepsilon(f) - \varepsilon(f_H)$$
The **empirical error in H** (or **estimation error**)
is the error in H of the empirical target function:
$\varepsilon_H(f_{H,\mathbf{z}})$.

Theorems from [@Cucker2002], "On The Mathematical Foundations Of Learning".

Theorem A:
For a single function $f$, an upper bound for the probability (**confidence**)
of its generalization error $L_\mathbf{z}(f)$ lying beyond any small neighborhood of zero
decreases exponentially in sample size $m$:

$$P(|L_\mathbf{z}(f)| \ge \varepsilon) \le 2 \exp\left(-\frac{m \varepsilon^2/2}{\sigma^2 + M^2 \varepsilon/3}\right)$$

Theorem B: For a hypothesis space $H$, a confidence on the generalization error extended from Theorem A is specified as a function of sample size $m$ and covering number of $H$.

Theorem C: Inheriting assumptions from Theorem B, a lower bound for the confidence of the sample error being smaller than an arbitrary positive number is specified as a function of sample size $m$ and covering number of $H$.

Theorem C∗: If the hypothesis space is convex, for a given upper limit, an improved/larger confidence is specified.

Logistic function is used as

- loss function, evaluation metric of the linear model
- probabilistic interpretation of predictions

**Receiver operating characteristic** (ROC) plot of statistical classifier
presents the true positive rate (sensitivity index, d'; recall)
against the false positive rate (fall-out, 1 - specificity) as discrimination threshold varies.
It is the trajectory of the cumulative distribution function (CDF) of the detection probability
against the CDF of false-alarm probability.
ROC plots illustrate the performance of a binary classifier.

**One-hot-encoding** (OHE) feature: using dummy variables to represent categorical variables.

- Dense representation: storing the full-dimensional vector despite most components being zero.
- Sparse representation: storing (indice, value) pairs.

**Feature hashing**: an unsupervised learning preprocessing step
that reduces feature dimension by hashing principles.
Hash tables map categorical features to a fixed number of buckets evenly,
reducing feature dimensionality.

Feature hashing is a practical alternative to one-hot-encoding that has better performance. Hash functions can be applied at partition level without communication between nodes.

**Collaborative filtering** is about predicting (filtering) implicit attributes of a vector
using attributes of many other vectors (collaborative).

**Latent factor models** uncover latent features that explain observations,
e.g. pLSA, neural networks, and Latent Dirichlet Allocation.
Many such models are induced by Singular Value Decomposition (SVD) of the observations matrix,
where users (m) and items (n) are represented as vectors in a low dimensional factor space (f << m, n)
such that user u's rating on item i is predicted as $x_u' y_i$.
User vector can be interpreted as preference on each factor;
item vector can be interpreted as composition of each factor.

**Alternating least squares** (ALS) method: {HuKorenVolinsky-ICDM08}

- Cost function
- squared L2 distance between preference matrix P and factor model XY, weighted by confidence level, regularized by Hilbert–Schmidt norms of X and Y {equation 3}.
- regularization factor $\lambda$ is data-dependent, determined by cross-validation.

- Optimization technique
- randomly initiate the index/users matrix;
- recompute the users matrix such that cost function is minimized; [O(f^2
*N+f^3*m)] - recompute the index matrix such that cost function is minimized; [O(f^2
*N+f^3*n)] - alternate between the two optimization process (typically ~10 times);

The ALS algorithm scales linearly with the size of data (N).