alis.similarity.minhash_lsh.LSH.get_buckets#

LSH.get_buckets(hash_functions=None, compute=False)[source]#

This method implementes the map-reduce step of the traditional banding technique. Specifically, signature slices of each band are hashed using hash_functions (map). The document indices are then grouped according to their hash values.

Parameters
hash_functionslist, default=None

a list of hash functions with size equivalent to the number of bands. If None, the native python hash function is applied.

Returns
band_buckets - dict

a dictionary with hash bucket as keys and a list of similar document indices as values