Locality Sensitive Hashing
Locality Sensitive Hashing#
Minhashing a characteristic matrix (M) of document shingles effectively addresses the component of large-scale document similarity tasks that impacted by a high dimensionality of unique tokens used to characterize a document. This is done by compressing documents into much smaller signature matrices (S)–the manner by which this is done depends on how the two main stages (i.e., n-permutations and sampling of M) are done. Common choices of permuation are: (1) a true permutation of all k rows in a characteristic matrix, (2) a hashing of k row values into a new “permutation”–each resulting to tradeoffs between the quality similarity approximations, and computational complexity. The same tradeoffs are present in choosing a sampling step: (1) taking all k rows of of the characterisitic matrix, (2) taking only m of k rows of the characteristic matrix, or (3) dividing k into groups of m rows and applying the permutation step in each group.
Despite gains argued by Minhashing, it may still be a challenge to find the pairs with the greatest similarity efficiently. Suppose we have
Note: The above approximation pertains to many-to-many similarity search problems. A one-to-many version at best is
To address the approximate quadratic
In the next few sections, we shall:
Discuss the theory of locality-sensitive functions
Illustrate the use of such functions in the context of documents Jaccard similarity.