For a given outcome and a given set of predictors,
the ideal **regression function** is the conditional mean function,
which derives from their hypothetical or theoretical joint distribution.
Here the term "regression" is understood as in "regression toward the mean", coined by Francis Galton.
**Regression analysis** is the estimation of the regression function,
where conditional variance and higher order information about the joint distribution
are completely ignored.

Regression problems in supervised learning:

- Learning a physical law: fitting a curve to data.
- Pattern recognition: associating handwritten letters to a probability measure over the English alphabet.
- Monte Carlo integration: approximating an integral by averaging function values of a random sample.
- Learning a concept (PAC learning): approximate a target concept (set) $T$ that minimizes the $L^2$ error.

Methodology: probability theory; optimization algorithms (least squares); linear algebra; reproducing kernel Hilbert space (RKHS).

Notations:

- $X$: a compact domain or a manifold in Euclidean space; $Y = \mathbb{R}^k$; $Z = X \times Y$, the joint space;
- $\rho$: a Borel probability measure on $Z$. Regularity properties are included when needed;
- $\vec{z}$: a sample in $Z^m$;
- $f_{\rho}(x)$: the regression function of $\rho$, i.e. the conditional expectation function of $y$ on $x$. Assumed to be bounded;
- $\varepsilon(f) = \int_Z [f(x)-y]^2 \mathrm{d}\rho$: the error of function $f(x)$ w.r.t. $\rho$, i.e. the $L^2$ norm of $f(x)-y$ on $Z$;
- $\varepsilon_{\vec{z}}(f)$: the empirical error of $f$ w.r.t. a sample $\vec{z}$, i.e. the mean square error (MSE) of $f$ given a sample $\vec{z}$;
- $L_Z(f) = \varepsilon(f) - \varepsilon_{\vec{z}}(f)$: defect function of $f$, the difference between the error and the empirical error of function $f$ w.r.t. a sample $\vec{z}$;
- $C(X)$: the Banach space of continuous functions on $X$ with sup-norm;
- $H$: hypothesis space, a compact subspace of $C(X)$;
- $N(S,s)$: the covering number of a metric space $S$ with disks at radius $s$, i.e. the minimal number of such disks needed to cover the space. This number is finite if the metric space is compact;
- $f_H$: the target function in $H$ that has the minimal error;
- $f_{H,\vec{z}}$ (or $f_{\vec{z}}$): the empirical target function in $H$ that has the minimal empirical error w.r.t. a sample $\vec{z}$;
- $\varepsilon_H(f)$: the error in $H$ of a function $f$, i.e. the normalized error $\varepsilon(f) - \varepsilon(f_H)$ that is non-negative in $H$;
- $\varepsilon_H(f_{\vec{z}})$: the empirical error in $H$ of a function $f$ (or sample error, estimation error), i.e. the error in $H$ of the empirical target function $f_{\vec{z}}$. It is non-negative and generally positive;

General formulation of a regression problem: Given a sample $\mathbf{z}$ from an unknown probability measure $\rho$ on $Z = X \times Y$, estimate the conditional expectation function $f_{\rho}(x) = \mathbf{E}_{\rho}(Y|x)$, and determine the minimal sample size $m$ such that the estimate has generalization error less than $\varepsilon$ with probability $1-\delta$.

For a regression function $f(x)$,
its **error** (or **risk**) is the $L^2$-norm of $f(X)-Y$:
$\varepsilon(f) = \int (f(x) - y)^2~\textrm{d}\rho$.
Its **empirical error** (or **empirical risk**) with respect to a sample $\mathbf{z}$
is the mean square error (MSE):
$\varepsilon_{\mathbf{z}}(f) = \frac{1}{n} \sum_{i=1}^n (f(x_i) - y_i)^2$.
Its **defect** (or **generalization error**)
is the difference between the error and the empirical error:
$L_\mathbf{z}(f) = \varepsilon(f) - \varepsilon_{\mathbf{z}}(f)$.

**Hypothesis space** $H$ is a compact subspace of the Banach space $C(X)$
of continuous functions on $X$ with finite sup-norm.
The **target function** has the minimal error in $H$:
$f_H = \arg\min_{f \in H} \varepsilon(f)$.
Given a sample $\mathbf{z}$, the **empirical target function** has the minimal empirical error in $H$:
$f_{H,\mathbf{z}} = \arg\min_{f \in H} \varepsilon_{\mathbf{z}}(f)$.

The **error in H** of a function $f$ is the normalized error that is non-negative in $H$:
$\varepsilon_H(f) = \varepsilon(f) - \varepsilon(f_H)$.
The **empirical error in H** (or **estimation error**)
is the error in H of the empirical target function:
$\varepsilon_H(f_{H,\mathbf{z}})$.

Theorems from "On The Mathematical Foundations Of Learning" [@Cucker2002].

Theorem A:
For a single function $f$, an upper bound for the probability (**confidence**)
of its generalization error $L_\mathbf{z}(f)$ lying beyond any small neighborhood of zero
decreases exponentially in sample size $m$:
$P(|L_\mathbf{z}(f)| \ge \varepsilon) \le
2 \exp\left(-\frac{m \varepsilon^2/2}{\sigma^2 + M^2 \varepsilon/3}\right)$.

Theorem B: For a hypothesis space $H$, a confidence on the generalization error extended from Theorem A is specified as a function of sample size $m$ and covering number of $H$.

Theorem C: Inheriting assumptions from Theorem B, a lower bound for the confidence of the sample error being smaller than an arbitrary positive number is specified as a function of sample size $m$ and covering number of $H$.

Theorem C∗: If the hypothesis space is convex, for a given upper limit, an improved/larger confidence is specified.

Logistic function is used as

- loss function, evaluation metric of the linear model
- probabilistic interpretation of predictions

**Receiver operating characteristic** (ROC) plot of statistical classifier
presents the true positive rate (sensitivity index, d'; recall)
against the false positive rate (fall-out, 1 - specificity) as discrimination threshold varies.
It is the trajectory of the cumulative distribution function (CDF) of the detection probability
against the CDF of false-alarm probability.
ROC plots illustrate the performance of a binary classifier.

**One-hot-encoding** (OHE) feature: using dummy variables to represent categorical variables.

- Dense representation: storing the full-dimensional vector despite most components being zero.
- Sparse representation: storing (indice, value) pairs.

**Feature hashing**: an unsupervised learning preprocessing step
that reduces feature dimension by hashing principles.
Hash tables map categorical features to a fixed number of buckets evenly,
reducing feature dimensionality.

Feature hashing is a practical alternative to one-hot-encoding that has better performance. Hash functions can be applied at partition level without communication between nodes.

**Collaborative filtering** is about predicting ("filtering") implicit attributes of a vector
using attributes of many other vectors ("collaborative").
A technique for missing data problems.

**Latent factor models** uncover latent features that explain observations,
e.g. pLSA, neural networks, and Latent Dirichlet Allocation.
Many such models are derived from singular value decomposition of the observation matrix,
where $m$ users and $n$ items are represented as vectors in a f-dimensional factor space
with $f \le m, n$,
such that user $i$'s rating on item $j$ is predicted as $$.
User vectors can be interpreted as preference on each factor;
item vectors can be interpreted as composition of each factor.

**Alternating least squares** (ALS) method: [@HuKorenVolinsky-ICDM08]

- Cost function
- squared L2 distance between preference matrix P and factor model XY, weighted by confidence level, regularized by Hilbert–Schmidt norms of X and Y {equation 3}.
- regularization factor $\lambda$ is data-dependent, determined by cross-validation.

- Optimization technique
- randomly initiate the index/users matrix;
- recompute the users matrix such that cost function is minimized; [$\mathcal{O}(f^2 N + f^3 m)$]
- recompute the index matrix such that cost function is minimized; [$\mathcal{O}(f^2 N + f^3 n)$]
- alternate between the two optimization process (typically ~10 times);

The ALS algorithm scales linearly with the size of data (N).