{ "cells": [ { "cell_type": "markdown", "id": "c0cdff9b", "metadata": {}, "source": [ "# Theory of LSF" ] }, { "cell_type": "markdown", "id": "bfc85e26", "metadata": {}, "source": [ "While this chapter focuses heavily on document similarity--that is, documents expressed as a space of sets of k-shingles therefore making Jaccard distance the appropriate metric, the concepts can be adapted to a different space and/or a different metric for distance (e.g., Euclidian space and distance). The choice of space and, hence, distance measure are dependent on the similarity taks being solved. The core requirements for a family of functions to be considered locality-sensitive are:\n", "\n", "1. The functions must be statistically independent--that is, the probability for some response to be given by two or more functions $P(A \\cap B)$ can be estimated to be the product of the probabilities of each of the functions yielding that response, $P(A)P(B)$.\n", "2. The functions must--with high probability--make similar pairs to be candidate pairs as opposed to dissimilar pairs.\n", "3. The functions must be efficient in two aspects:\n", " * candidate pairs are identified faster than an *exhaustive* comparison of all pairs.\n", " * functions can be combined (i.e., according to som OR- and/or AND-construction) to reduce false positives and false negatives.\n", " \n", "Furthermore, quoting directly from {cite:ps}`rajaraman2011mining`, if we let \"$d_1 < d_2$ be two distances according to some distance measure d. A family $F$ of functions is said to be $(d_1, d_2, p_1, p_2)$-*sensitive* if for every $f$ in $F$:\n", "\n", "1. If $d(x, y) ≤ d_1$, then the probability that $f(x) = f(y)$ is at least $p_1$.\n", "2. If $d(x, y) ≥ d_2$, then the probability that $f(x) = f(y)$ is at most $p_2$.\n", "\n", "\n", "Now, we recall that the minhash family of functions estimate the Jaccard similarity--which is the probability that two sets are similar, $h(S_1) = h(S_2)$, given their items. Relating it to the above definition, we can view such similarity as to be indicative of two sets being close in distance on a space of sets. In such a space, the probability that two sets are similar--that is, their Jaccard similarity, $\\text{SIM}(S_1, S_2)$ can be transformed into the Jaccard distance measure by $1 - \\text{SIM}(S_1, S_2)$. Hence, the minhash family of functions is said to be $(d_1, d_2, 1 - d_1, 1 - d_2)$-*sensitive* where $d$ is the Jaccard distance." ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }