LS families of Hash functions S-curves
|x|+|y|-2*|LCS(x,y)|
, where LCS is the longest common subsequence.Pipeline: Shingling -> Minhashing -> Locality-Sensitive Hashing
k-shingle: a sequence of k characters that appears in the document.
When documents are represented as shingle sets, an intuitive definition of document similarity is Jaccard similarity: Sim(C1,C2) == |C1vC2| / |C1^C2|.
If represent shingles -> dummy variables [billions of Booleans], then shingle sets -> Boolean Matrix (Sparse), and Jaccard similarity -> Column similarity: Sim(C1,C2) == # of (1,1) / [# of (1,1) (1,0) and (0,1)]
Minhash function: minhash value h(C) of a Boolean vector C is the first row with value 1, after a specific permutation of rows.
Implementation without actual permutation: iterate through all rows of a Boolean vector for hash values, record the minimum hash value ("row number after permutation") where row value is 1.
Comparing every pair of document signatures for millions of documents is overwhelming. LSH generates a short list of pairs that actually needs similarity check. [small false positives, negligible false negatives.]
Candidate pairs: a pair of signature vectors where the fraction of agreeing rows exceeds a similarity threshold t
.
Locality-sensitive hashing:
b
bands of r
rows;Property: Probability of signature vectors with similarity s
hashed into the same bucket at least in one band: P(s) = 1-(1-s^r)^b
.
For acceptable false positives and negatives, similarity threshold can be chosen as t ~ (1/b)^{1/r}
.