**Machine learning** is a technique for computer systems to improve with experience and data.
This article addresses machine learning problems and platforms.
The general theory of machine learning is documented at Learning Theory.

**Supervised learning** builds a conditional probabilistic model
to predict an output (label) based on some inputs (features):
classification if label is categorical, regression if label is quantitative.
**Unsupervised learning** describes the relationship and structure among a set of inputs:
clustering if outcome is categorical, dimensionality reduction if outcome is continuous.
**Semi-supervised learning** use both labeled and unlabeled data
to estimate the conditional probability distribution of label given features.

**Representation learning** discovers useful representations of data.
**Deep learning** builds a multi-layer network of representations,
where more abstract representations are computed via less abstract ones.

**Reinforcement learning** (RL) is the optimization of strategy for a given environment,
where an agent (person, firm, robot, etc.) collects data by taking actions and observing rewards.

Machine learning is often based on probabilistic models of a very large number of random variables, although result may be interpreted as deterministic or probabilistic, depending on the application.

**Supervised probabilistic model** is a machine learning model
that estimates (and evaluates) conditional probability density functions: $p(y \mid x)$.
**Discriminative model** is a supervised probabilistic model with categorical outcome variables.

**Unsupervised probabilistic model** is a machine learning model
that estimates (and evaluates) the joint probability distribution: $p(x)$ or $p(x, y)$.

The number of parameters in a function space grows exponentially in the number of variables: $N^n$. Thus, it is very inefficient to estimate a high-dimensional probability distribution in a generic parametric function space. But oftentimes there is structure in the data, and we can expolit it and model the distribution in a function space with much fewer parameters.

**Structured probabilistic model** or **graphical model** is a probabilistic model
defined as the product of functions of fewer variables, perhaps with a normalizing constant;
the factorizaton can be represented by a graph of the variables, with directed or undirected edges.
For example, a graph with no edge represents probabilistic models with mutually independent variables.
For a given factorizaton or graph, each function can come from a parametric family,
which results in a parametric family of probabilistic models that can be very complex.

**Directed graphical model** is a probabilistic model defined as the product of
marginal distributions or conditional probability distributions, one for each variable;
the factorizaton can be represented by a directed acyclic graph (DAG),
where each node is conditionally dependent on its parent nodes:
$p(x) = \prod_{i=1}^n p(x_i \mid x_{\to i})$,
where $x_{\to i}$ is the subset of variables that points to $x_i$.
**Ancestral sampling** from a directed graphical model
is a Monte Carlo sampling method that samples the model in an ancestral order,
so that each node comes after its parent nodes.
Ancestral sampling is efficient if all the marginal and conditional distributions involved
are easy to sample from.

**Undirected graphical models** is a probabilistic model defined as the product of
non-negative functions of subsets of variables, and a normalizing constant;
the factorizaton can be represented by an undirected graph,
where each clique corresponds to a function: $p(x) = \frac{1}{Z} \prod_{j=1}^m f(c_j)$,
where $c_j$ is a clique and $Z$ is the normalizing constant.
**Clique** in an undirected graph is a subset of nodes whose induced subgraph is complete.

Regression: linear regression, etc.

Classification:

- Linear classifiers:
- Generative model: linear discriminant analysis (LDA), naive Bayes classifier;
- Discriminative model: Logistic regression (logit), support vector machines (SVM), perceptron;

- Isotonic regression;

Clustering:

- k-means clustering;
- hierarchical clustering (dendrogram);
- Gaussian mixture;
- power iteration clustering (PIC);
- latent Dirichlet allocation (LDA);

Dimensionality Reduction:

- Principal component analysis (PCA): find the (orthogonal) directions in a Euclidean space that successively explain the most sample variance (minimize the residual sum of squares);
- Singular value decomposition (SVD);

Tasks:

- $(x|y)$: inverse problems, posterior distribution of parameters;
- $(y_2 | y_1)$: data infill, missing data imputation;
- $(y|x), (x)$, or $(x|y), (y)$: semi-supervised classification;

Standardization of variables is often required, e.g. in case of different units.

Performance measure. Experience.

Capacity, generalization (overfitting and underfitting) and testing. Hyperparameters and validation.

Semi-supervised learning (SSL) is a learning paradigm where some examples have a label but others do not. It can generalize better than supervised learning models, if unlabeled data are plentiful. For classification, SSL essentially performs a constant extension of labels to well-separated support sets. By far the most common use of SSL is image classification.

Regularization techniques:

- manifold regularization [@Belkin2006] e.g. Laplacian regularization imposes the geometry of the probability distribution through a regularization term.
- consistency regularization: output the same output distribution when its inputs are perturbed, e.g. Π-model [@Sajjadi2016], mean teacher [@Tarvainen2017], virtual adversarial training (VAT) [@Miyato2018];
- entropy minimization: output confident predictions on unlabeled data, e.g. EntMin [@Grandvalet2005], pseudo-labeling [@Lee2013];
- traditional regularization: weight decay, MixUp, MixMatch [@Berthelot2019];

With generative models: with VAE [@Kingma2014b], with GAN [@Odena2016]. Latent-feature discriminative models transform data distribution to a multivariate standard Gaussian, and generative semi-supervised models learn to separate data classes (categorical label) from intra-class variabilities (continuous latent variables).

C++/CUDA:

- xgboost: gradient boosting machine (GBM), best GPU performance;
- lightgbm: GBM by Microsoft, best CPU performance;
- Vowpal Wabbit;
- TensorFlow
- Caffe, cuda-convnet2;

JVM (Java, Scala):

- H2O: generalized linear models, gradient boosting machine (also supports random forest), generalized lower rank models, deep neural network;
- Spark: MLlib (not nearly as good);
- Deeplearning4j;

Python: scikit-learn `sklearn`

; `Pylearn2`

, `Theano`

;

R:
`glmnet`

, `randomForest`

, `gbm`

, `e1071`

(interface to libsvm), `caret`

,
and more;

Benchmark for GLM, RF, GBM: For the algorithms it supports, H2O is the fastest and as accurate on data over 10M records that fit in memory of a single machine. Benchmark for GBM

Figure: machine learning algorithm maps: (A) scikit-learn; (B) dlib.