This article addresses machine learning problems and platforms. The general theory of machine learning is documented at Learning Theory.

Figure: machine learning algorithm maps: (A) scikit-learn; (B) dlib.

Problems

Learning problems can be roughly categorized as either supervised or unsupervised. Supervised learning builds a statistical model to predict or estimate an output (label) based on some inputs (features): classification if label is categorical, regression if label is quantitative. Unsupervised learning describes the relationships and structure among a set of inputs: dimensionality reduction, clustering. Other areas of machine learning: Reinforcement learning is concerned with maximizing the reward of a given agent (person, business, etc).

Regression

linear regression

Classification

  • Linear classifiers:
    • Generative model: linear discriminant analysis (LDA), naive Bayes classifier;
    • Discriminative model: Logistic regression (logit), support vector machines (SVM), perceptron;
  • Isotonic regression;

Clustering

  • k-means clustering;
  • hierarchical clustering (dendrogram);
  • Gaussian mixture;
  • power iteration clustering (PIC);
  • latent Dirichlet allocation (LDA);

Standardization is required in case of different units.

Dimensionality Reduction

  • Principal component analysis (PCA): find the (orthogonal) directions in a Euclidean space that successively explain the most sample variance (minimize the residual sum of squares);
  • Singular value decomposition (SVD);

alt: Typical ML pipeline

Programming Tools

C++/CUDA:

JVM (Java, Scala):

  • H2O: generalized linear models, gradient boosting machine (also supports random forest), generalized lower rank models, deep neural network;
  • Spark: MLlib (not nearly as good);
  • Deeplearning4j;

Python: scikit-learn sklearn; Pylearn2, Theano;

R: glmnet, randomForest , gbm, e1071 (interface to libsvm), caret, and more;

Benchmarks

Benchmark for GLM, RF, GBM: For the algorithms it supports, H2O is the fastest and as accurate on data over 10M records that fit in memory of a single machine. Benchmark for GBM


🏷 Category=Computation Category=Machine Learning