Note: Contents in bold are included in Coursera Machine Learning lectures. A few topics are not identified: regularized regression, neural networks, and anomaly detection.

  • Feature extraction and transformation
  • Basic statistics: summary statistics, correlations, hypothesis testing
  • Anomaly detection: k-NN (k-Nearest Neighbors)
  • Neural networks: perceptron, convolutional neural network
  • Optimization: stochastic gradient descent, limited-memory BFGS (L-BFGS, Broyden–Fletcher–Goldfarb–Shanno)

Figure: scikit-learn machine learning algorithm map. dlib has an alternative map.

Problems

Learning problems can be roughly categorized as either supervised or unsupervised. Supervised learning builds a statistical model to predict or estimate an output (label) based on some inputs: classification if label is categorical, regression if label is quantitative. Unsupervised learning describes the relationships and structure among a set of inputs: dimensionality reduction, clustering.

Other areas of machine learning: Reinforcement learning is concerned with maximizing the reward of a given agent (person, business, etc).

Regression

linear regression

Classification

  • Linear classifiers:
    • Generative model: linear discriminant analysis (LDA), naive Bayes classifier;
    • Discriminative model: Logistic regression (logit), support vector machines (SVM), perceptron;
  • Isotonic regression;

Clustering

  • k-means clustering;
  • hierarchical clustering (dendrogram);
  • Gaussian mixture;
  • power iteration clustering (PIC);
  • latent Dirichlet allocation (LDA);

Standardization is required in case of different units.

Dimensionality Reduction.

  • singular value decomposition (SVD);
  • principal component analysis (PCA): find the direction (or orthogonal directions) in a Euclidean space that explain the most sample variance (minimize the residual sum of squares);

alt: Typical ML pipeline

Programming Tools

Machine Learning:

  • R: glmnet, randomForest , gbm, e1071 (interface to libsvm), caret, and more.
  • Python: scikit-learn sklearn
  • H2O: GLM (Generalized linear models), GBM (Gradient boosting machine; also supports random forest), GLRM (generalized lower rank models), deep neural network.
  • xgboost: Gradient boosting machine.
  • Vowpal Wabbit
  • Spark: MLlib

H2O scales the best (fastest without lesser accuracy) for the algorithms it supports on data over ~10M records and as long as it fits in memory of a single machine. (Benchmark for GLM, RF, GBM)

Deep Learning:

  • Python: Pylearn2, Theano
  • Java: Deeplearning4j
  • C++/CUDA: Caffe, cuda-convnet2
  • TensorFlow

🏷 Category=Computation Category=Machine Learning