{ "cells": [ { "cell_type": "markdown", "id": "3b57fccd", "metadata": {}, "source": [ "# Locality Sensitive Hashing" ] }, { "cell_type": "markdown", "id": "600ef551", "metadata": {}, "source": [ "Minhashing a characteristic matrix (M) of document shingles effectively addresses the component of large-scale document similarity tasks that impacted by a high dimensionality of unique tokens used to characterize a document. This is done by compressing documents into much smaller signature matrices (S)--the manner by which this is done depends on how the two main stages (i.e., n-permutations and sampling of M) are done. Common choices of permuation are: (1) a true permutation of all k rows in a characteristic matrix, (2) a hashing of k row values into a new \"permutation\"--each resulting to tradeoffs between the quality similarity approximations, and computational complexity. The same tradeoffs are present in choosing a sampling step: (1) taking all k rows of of the characterisitic matrix, (2) taking only m of k rows of the characteristic matrix, or (3) dividing k into groups of m rows and applying the permutation step in each group.\n", "\n", "Despite gains argued by Minhashing, it may still be a challenge to find the pairs with the greatest similarity efficiently. Suppose we have $n$ documents and want to find all similar documents in such a corpus. We take all pairwise combinations of $n$ documents--resulting to $n \\choose 2$ interactions which approximates to a $O(n^2)$ complexity for very high values of $n$. See below:\n", "\n", "$$\n", "{n \\choose 2} = \\frac{n!}{(n-2!)2!} = \\frac{n(n-1)(n-2)!}{(n-2)!2!} = \\frac{n(n-1)}{2} \\approx n^2\n", "$$\n", "\n", "*Note: The above approximation pertains to many-to-many similarity search problems. A one-to-many version at best is $O(n)$ in time complexity.*\n", "\n", "To address the approximate quadratic $O(n^2)$ complexity--or in the case of a one-to-many comparison a linear $O(n)$ complexity, we can instead focus on documents (i.e. sets, items) that are most likely similar--known as *candidate pairs*. This is what locality-sensitive hashing (a.k.a. nearest-neighbor search) proposes--a bucketing together of likely similar objects to allow for a localized (i.e., selective) comparison.\n", "\n", "In the next few sections, we shall:\n", "1. Discuss the theory of locality-sensitive functions\n", "3. Illustrate the use of such functions in the context of documents Jaccard similarity." ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }