Locality Sensitive Hashing
Locality Sensitive Hashing#
Minhashing a characteristic matrix (M) of document shingles effectively addresses the component of large-scale document similarity tasks that impacted by a high dimensionality of unique tokens used to characterize a document. This is done by compressing documents into much smaller signature matrices (S)–the manner by which this is done depends on how the two main stages (i.e., n-permutations and sampling of M) are done. Common choices of permuation are: (1) a true permutation of all k rows in a characteristic matrix, (2) a hashing of k row values into a new “permutation”–each resulting to tradeoffs between the quality similarity approximations, and computational complexity. The same tradeoffs are present in choosing a sampling step: (1) taking all k rows of of the characterisitic matrix, (2) taking only m of k rows of the characteristic matrix, or (3) dividing k into groups of m rows and applying the permutation step in each group.
Despite gains argued by Minhashing, it may still be a challenge to find the pairs with the greatest similarity efficiently. Suppose we have \(n\) documents and want to find all similar documents in such a corpus. We take all pairwise combinations of \(n\) documents–resulting to \(n \choose 2\) interactions which approximates to a \(O(n^2)\) complexity for very high values of \(n\). See below:
Note: The above approximation pertains to many-to-many similarity search problems. A one-to-many version at best is \(O(n)\) in time complexity.
To address the approximate quadratic \(O(n^2)\) complexity–or in the case of a one-to-many comparison a linear \(O(n)\) complexity, we can instead focus on documents (i.e. sets, items) that are most likely similar–known as candidate pairs. This is what locality-sensitive hashing (a.k.a. nearest-neighbor search) proposes–a bucketing together of likely similar objects to allow for a localized (i.e., selective) comparison.
In the next few sections, we shall:
Discuss the theory of locality-sensitive functions
Illustrate the use of such functions in the context of documents Jaccard similarity.