Distance Measures

LS families of Hash functions S-curves

  1. Jaccard distance of sets = 1 - Jaccard similarity
  2. Cosine distance of vectors = angle between the vectors
    • conveniently substituted with the cosine value
  3. Edit distance of strings = number of inserts and deletes to change one string into the other.
    • equivalent definition: |x|+|y|-2*|LCS(x,y)|, where LCS is the longest common subsequence.
  4. Hamming distance of bit vectors with the same length = the number of bits where they differ.

Theory

Pipeline: Shingling -> Minhashing -> Locality-Sensitive Hashing

Shingling

k-shingle: a sequence of k characters that appears in the document.

  • May hash long shingles to tokens.

Minhashing

When documents are represented as shingle sets, an intuitive definition of document similarity is Jaccard similarity: Sim(C1,C2) == |C1vC2| / |C1^C2|.

If represent shingles -> dummy variables [billions of Booleans], then shingle sets -> Boolean Matrix (Sparse), and Jaccard similarity -> Column similarity: Sim(C1,C2) == # of (1,1) / [# of (1,1) (1,0) and (0,1)]

Minhash function: minhash value h(C) of a Boolean vector C is the first row with value 1, after a specific permutation of rows.

  • Property: P{ h(C1)=h(C2) } = Sim(C1, C2)
  • Signature: the vector of hash values under independent hash functions. [hundreds of integers]

Implementation without actual permutation: iterate through all rows of a Boolean vector for hash values, record the minimum hash value ("row number after permutation") where row value is 1.

Locality-Sensitive Hashing (LSH)

Comparing every pair of document signatures for millions of documents is overwhelming. LSH generates a short list of pairs that actually needs similarity check. [small false positives, negligible false negatives.]

Candidate pairs: a pair of signature vectors where the fraction of agreeing rows exceeds a similarity threshold t.

Locality-sensitive hashing:

  1. divide signature rows into b bands of r rows;
  2. hash each band to a hash table with as many buckets as possible (so that bucket sharing means identical values in band);
  3. if a pair of vectors falls in the same bucket in any band, it is seen as a candidate pair.

Property: Probability of signature vectors with similarity s hashed into the same bucket at least in one band: P(s) = 1-(1-s^r)^b.

For acceptable false positives and negatives, similarity threshold can be chosen as t ~ (1/b)^{1/r}.

Applications

  • entity resolution (deduplication)
    • matching customer records
  • fingerprint matching
  • similar news articles

🏷 Category=Computation Category=Data Mining