Data Mining (DM) is the ad hoc application of machine learning (ML) algorithms to extracting knowledge or patterns from apparently unstructured data. To utilize ML algorithms for DM, one has to abstract the problem in their domain into a set of features.

CourseOverview Figure: MMDS Course Overview {Leskovec2014}

Example data mining workflow in Apache Spark:

  1. HDFS read: load Tweets as raw JSON
  2. SQL
    1. prepare: convert to SQL table with schema detected
    2. query: SQL querying (summary statistics)
  3. Streaming ML
    1. feature extraction: convert tweet content to feature vector
    2. train: Machine learning (k-means clusters training)
    3. apply: save model to file -> Streaming (clustering)
  4. HDFS write

Handling Off-memory Data

Distributed file systems (DFS) and MapReduce: tools for creating parallel algorithms.

Graph Data

Search engine technologies: PageRank, link-spam detection, hubs-and-authorities.

Graphs mining: social network graphs.

Stream Data

  1. Stream processing: dealing with data that arrives so fast it must be processed immediately or lost.
  2. web advertising

Discovering Frequent Itemsets

An item is an elementary object; a basket is a set of items, aka an itemset. The frequent itemsets problem is to find itemsets that appear in many baskets.

Algorithms: association rules, market-baskets, A-Priori Algorithm and its improvements, FP-growth (frequent pattern) algorithm.

Clustering

Similarity search: minhashing and locality-sensitive hashing.

Dimensionality reduction

singular-value decomposition (SVD), latent semantic indexing.

Machine Learning

perceptrons,

support-vector machines (SVM): gradient descent, stochastic gradient descent, limited-memory BFGS (L-BFGS).

nearest neighbor

recommendation systems: making recommendations based upon previously collected data.

Resources

Software Platforms:

  • Orange: component-based data mining and machine learning software suite, visual programming and Python bindings.
  • RapidMiner, a leading open-source system for knowledge discovery and data mining, with proprietary professional edition.
  • KNIME Analytics Platform, extensible open source data mining platform implementing the data pipelining paradigm (based on eclipse).

Figure: Marketing workflow on KNIME Analytics Platform