Data Mining (DM) is the ad hoc application of Machine Learning (ML) algorithms to extracting knowledge or patterns from apparently unstructured data. To utilize ML algorithms for DM, one has to abstract the problem in their domain into a set of features.
Figure: MMDS Course Overview [@Leskovec2014]
Example data mining workflow in Apache Spark:
Distributed file systems (DFS) and MapReduce: tools for creating parallel algorithms.
Search engine technologies: PageRank, link-spam detection, hubs-and-authorities.
Graphs mining: social network graphs.
An item is an elementary object; a basket is a set of items, aka an itemset. The frequent itemsets problem is to find itemsets that appear in many baskets.
Algorithms: association rules, market-baskets, A-Priori Algorithm and its improvements, FP-growth (frequent pattern) algorithm.
Recommendation systems make recommendations based upon previously collected data. One common technique is collaborative filtering: alternating least squares (ALS).
Finding Similar Documents: minhashing and locality-sensitive hashing.
singular-value decomposition (SVD); CUR matrix approximation (CUR; column, U, row), e.g. ALGORITHMCUR [@Mahoney2009]; latent semantic indexing;
Although SVD gives the best theoretical low rank approximates of any data matrix with cubic time algorithms $O(nmmin{n,m})$, CUR takes quadratic time $O(n*m)$ to achieve approximates comparable in probability.
perceptrons,
support-vector machines (SVM): gradient descent, stochastic gradient descent, limited-memory BFGS (L-BFGS).
nearest neighbor
Software Platforms:
Figure: Marketing workflow on KNIME Analytics Platform