Theory of LSF#

While this chapter focuses heavily on document similarity–that is, documents expressed as a space of sets of k-shingles therefore making Jaccard distance the appropriate metric, the concepts can be adapted to a different space and/or a different metric for distance (e.g., Euclidian space and distance). The choice of space and, hence, distance measure are dependent on the similarity taks being solved. The core requirements for a family of functions to be considered locality-sensitive are:

  1. The functions must be statistically independent–that is, the probability for some response to be given by two or more functions \(P(A \cap B)\) can be estimated to be the product of the probabilities of each of the functions yielding that response, \(P(A)P(B)\).

  2. The functions must–with high probability–make similar pairs to be candidate pairs as opposed to dissimilar pairs.

  3. The functions must be efficient in two aspects:

    • candidate pairs are identified faster than an exhaustive comparison of all pairs.

    • functions can be combined (i.e., according to som OR- and/or AND-construction) to reduce false positives and false negatives.

Furthermore, quoting directly from [RU11], if we let “\(d_1 < d_2\) be two distances according to some distance measure d. A family \(F\) of functions is said to be \((d_1, d_2, p_1, p_2)\)-sensitive if for every \(f\) in \(F\):

  1. If \(d(x, y) ≤ d_1\), then the probability that \(f(x) = f(y)\) is at least \(p_1\).

  2. If \(d(x, y) ≥ d_2\), then the probability that \(f(x) = f(y)\) is at most \(p_2\).

Now, we recall that the minhash family of functions estimate the Jaccard similarity–which is the probability that two sets are similar, \(h(S_1) = h(S_2)\), given their items. Relating it to the above definition, we can view such similarity as to be indicative of two sets being close in distance on a space of sets. In such a space, the probability that two sets are similar–that is, their Jaccard similarity, \(\text{SIM}(S_1, S_2)\) can be transformed into the Jaccard distance measure by \(1 - \text{SIM}(S_1, S_2)\). Hence, the minhash family of functions is said to be \((d_1, d_2, 1 - d_1, 1 - d_2)\)-sensitive where \(d\) is the Jaccard distance.