alis.similarity.minhash_lsh.LSH#

class alis.similarity.minhash_lsh.LSH(signature)[source]#

The LSH class for a many-to-many document similarity task.

Attributes
signature2-D np.array

document minhash signatures with dimension n (samples) by m (signature size)

bandsint

number of bands

rint

number of rows per band derived from bands

hash_functionslist, default=None

a list of hash functions with size equivalent to the number of bands. If None, the native python hash function is applied.

band_dictdict

dictionary with band labels as keys and (set/doc index, signature band) tuples as values

band_bucketsdict

a dictionary with hash bucket as keys and a list of similar document indices as values

Methods
——-

Methods

__init__(signature)

Initialize class

get_buckets([hash_functions, compute])

This method implementes the map-reduce step of the traditional banding technique.

make_bands(bands)

Takes in the desired number of bands as a parameter and returns a dictionary with band labels as keys and dask.bag of (set/document index, signature band) tuples

plot_thresh([return_thresh, display_thresh, ax])

Plots the threshold plot according to number of bands.