Machine Learning

Machine learning is a technique for computer systems to improve with experience and data. This article addresses machine learning problems and platforms. The general theory of machine learning is documented at Learning Theory.

Supervised learning builds a probabilistic model to predict or estimate an output (label) based on some inputs (features): classification if label is categorical, regression if label is quantitative. Unsupervised learning describes the relationships and structure among a set of inputs: dimensionality reduction if outcome is continuous, clustering if outcome is categorical. Semi-supervised learning use both labeled and unlabeled data to estimate the conditional probability distribution.

Representation learning discovers useful representations of data. Deep learning builds a multi-layer network of representations, where more abstract representations are computed via less abstract ones.

Reinforcement learning (RL) is the optimization of strategy for a given environment, where an agent (person, firm, robot, etc.) collects data by taking actions and observing rewards.

Probabilistic Modeling

Machine learning is often based on probabilistic models of a very large number of random variables, although result may be interpreted as deterministic or probabilistic, depending on the application.

Supervised probabilistic model is a machine learning model that estimates (and evaluates) conditional probability density functions: $p(y \mid x)$. Discriminative model is a supervised probabilistic model with categorical outcome variables.

Unsupervised probabilistic model is a machine learning model that estimates (and evaluates) the joint probability distribution: $p(x)$ or $p(x, y)$.

The number of parameters in a function space grows exponentially in the number of variables: $N^n$. Thus, it is very inefficient to estimate a high-dimensional probability distribution in a generic parametric function space. But oftentimes there is structure in the data, and we can expolit it and model the distribution in a function space with much fewer parameters.

Structured Probabilistic Model

Structured probabilistic model or graphical model is a probabilistic model defined as the product of functions of fewer variables, perhaps with a normalizing constant; the factorizaton can be represented by a graph of the variables, with directed or undirected edges. For example, a graph with no edge represents probabilistic models with mutually independent variables. For a given factorizaton or graph, each function can come from a parametric family, which results in a parametric family of probabilistic models that can be very complex.

Directed graphical model is a probabilistic model defined as the product of marginal distributions or conditional probability distributions, one for each variable; the factorizaton can be represented by a directed acyclic graph (DAG), where each node is conditionally dependent on its parent nodes: $p(x) = \prod_{i=1}^n p(x_i \mid x_{\to i})$, where $x_{\to i}$ is the subset of variables that points to $x_i$. Ancestral sampling from a directed graphical model is a Monte Carlo sampling method that samples the model in an ancestral order, so that each node comes after its parent nodes. Ancestral sampling is efficient if all the marginal and conditional distributions involved are easy to sample from.

Undirected graphical models is a probabilistic model defined as the product of non-negative functions of subsets of variables, and a normalizing constant; the factorizaton can be represented by an undirected graph, where each clique corresponds to a function: $p(x) = \frac{1}{Z} \prod_{j=1}^m f(c_j)$, where $c_j$ is a clique and $Z$ is the normalizing constant. Clique in an undirected graph is a subset of nodes whose induced subgraph is complete.

Problems

Regression: linear regression, etc.

Classification:

Linear classifiers:
- Generative model: linear discriminant analysis (LDA), naive Bayes classifier;
- Discriminative model: Logistic regression (logit), support vector machines (SVM), perceptron;
Isotonic regression;

Clustering:

k-means clustering;
hierarchical clustering (dendrogram);
Gaussian mixture;
power iteration clustering (PIC);
latent Dirichlet allocation (LDA);

Dimensionality Reduction:

Principal component analysis (PCA): find the (orthogonal) directions in a Euclidean space that successively explain the most sample variance (minimize the residual sum of squares);
Singular value decomposition (SVD);

Standardization of variables is often required, e.g. in case of different units.

alt: Typical ML pipeline

Programming Tools

C++/CUDA:

xgboost: gradient boosting machine (GBM), best GPU performance;
lightgbm: GBM by Microsoft, best CPU performance;
Vowpal Wabbit;
TensorFlow
Caffe, cuda-convnet2;

JVM (Java, Scala):

H2O: generalized linear models, gradient boosting machine (also supports random forest), generalized lower rank models, deep neural network;
Spark: MLlib (not nearly as good);
Deeplearning4j;

Python: scikit-learn sklearn; Pylearn2, Theano;

R: glmnet, randomForest , gbm, e1071 (interface to libsvm), caret, and more;

Benchmarks

Benchmark for GLM, RF, GBM: For the algorithms it supports, H2O is the fastest and as accurate on data over 10M records that fit in memory of a single machine. Benchmark for GBM

Figure: machine learning algorithm maps: (A) scikit-learn; (B) dlib.

🏷 Category=Computation Category=Machine Learning