Machine learning is a technique for computer systems to improve with experience and data. This article addresses machine learning problems and platforms. The general theory of machine learning is documented at Learning Theory.
Supervised learning builds a conditional probabilistic model to predict an output (label) based on some inputs (features): classification if label is categorical, regression if label is quantitative. Unsupervised learning describes the relationship and structure among a set of inputs: clustering if outcome is categorical, dimensionality reduction if outcome is continuous. Semi-supervised learning use both labeled and unlabeled data to estimate the conditional probability distribution of label given features.
Representation learning discovers useful representations of data. Deep learning builds a multi-layer network of representations, where more abstract representations are computed via less abstract ones.
Reinforcement learning (RL) is the optimization of strategy for a given environment, where an agent (person, firm, robot, etc.) collects data by taking actions and observing rewards.
Machine learning is often based on probabilistic models of a very large number of random variables, although result may be interpreted as deterministic or probabilistic, depending on the application.
Supervised probabilistic model is a machine learning model that estimates (and evaluates) conditional probability density functions: $p(y \mid x)$. Discriminative model is a supervised probabilistic model with categorical outcome variables.
Unsupervised probabilistic model is a machine learning model that estimates (and evaluates) the joint probability distribution: $p(x)$ or $p(x, y)$.
The number of parameters in a function space grows exponentially in the number of variables: $N^n$. Thus, it is very inefficient to estimate a high-dimensional probability distribution in a generic parametric function space. But oftentimes there is structure in the data, and we can expolit it and model the distribution in a function space with much fewer parameters.
Structured probabilistic model or graphical model is a probabilistic model defined as the product of functions of fewer variables, perhaps with a normalizing constant; the factorizaton can be represented by a graph of the variables, with directed or undirected edges. For example, a graph with no edge represents probabilistic models with mutually independent variables. For a given factorizaton or graph, each function can come from a parametric family, which results in a parametric family of probabilistic models that can be very complex.
Directed graphical model is a probabilistic model defined as the product of marginal distributions or conditional probability distributions, one for each variable; the factorizaton can be represented by a directed acyclic graph (DAG), where each node is conditionally dependent on its parent nodes: $p(x) = \prod_{i=1}^n p(x_i \mid x_{\to i})$, where $x_{\to i}$ is the subset of variables that points to $x_i$. Ancestral sampling from a directed graphical model is a Monte Carlo sampling method that samples the model in an ancestral order, so that each node comes after its parent nodes. Ancestral sampling is efficient if all the marginal and conditional distributions involved are easy to sample from.
Undirected graphical models is a probabilistic model defined as the product of non-negative functions of subsets of variables, and a normalizing constant; the factorizaton can be represented by an undirected graph, where each clique corresponds to a function: $p(x) = \frac{1}{Z} \prod_{j=1}^m f(c_j)$, where $c_j$ is a clique and $Z$ is the normalizing constant. Clique in an undirected graph is a subset of nodes whose induced subgraph is complete.
Regression: linear regression, etc.
Classification:
Clustering:
Dimensionality Reduction:
Tasks:
Standardization of variables is often required, e.g. in case of different units.
Performance measure. Experience.
Capacity, generalization (overfitting and underfitting) and testing. Hyperparameters and validation.
Semi-supervised learning (SSL) is a learning paradigm where some examples have a label but others do not. It can generalize better than supervised learning models, if unlabeled data are plentiful. For classification, SSL essentially performs a constant extension of labels to well-separated support sets. By far the most common use of SSL is image classification.
Regularization techniques:
With generative models: with VAE [@Kingma2014b], with GAN [@Odena2016]. Latent-feature discriminative models transform data distribution to a multivariate standard Gaussian, and generative semi-supervised models learn to separate data classes (categorical label) from intra-class variabilities (continuous latent variables).
C++/CUDA:
JVM (Java, Scala):
Python: scikit-learn sklearn
; Pylearn2
, Theano
;
R:
glmnet
, randomForest
, gbm
, e1071
(interface to libsvm), caret
,
and more;
Benchmark for GLM, RF, GBM: For the algorithms it supports, H2O is the fastest and as accurate on data over 10M records that fit in memory of a single machine. Benchmark for GBM
Figure: machine learning algorithm maps: (A) scikit-learn; (B) dlib.