Data Mining (DM) is the ad hoc application of Machine Learning (ML) algorithms to extracting knowledge or patterns from apparently unstructured data. To utilize ML algorithms for DM, one has to abstract the problem in their domain into a set of features.
Figure: MMDS Course Overview [@Leskovec2014]
Example data mining workflow in Apache Spark:
Distributed file systems (DFS) and MapReduce: tools for creating parallel algorithms.
Search engine technologies: PageRank, link-spam detection, hubs-and-authorities.
Graphs mining: social network graphs.
An item is an elementary object; a basket is a set of items, aka an itemset. The frequent itemsets problem is to find itemsets that appear in many baskets.
Algorithms: association rules, market-baskets, A-Priori Algorithm and its improvements, FP-growth (frequent pattern) algorithm.
Recommendation systems make recommendations based upon previously collected data. One common technique is collaborative filtering: alternating least squares (ALS).
Finding Similar Documents: minhashing and locality-sensitive hashing.
singular-value decomposition (SVD), latent semantic indexing.
perceptrons,
support-vector machines (SVM): gradient descent, stochastic gradient descent, limited-memory BFGS (L-BFGS).
nearest neighbor
Software Platforms:
Figure: Marketing workflow on KNIME Analytics Platform