alis.similarity.minhash_lsh.LSH.get_buckets
alis.similarity.minhash_lsh.LSH.get_buckets#
- LSH.get_buckets(hash_functions=None, compute=False)[source]#
This method implementes the map-reduce step of the traditional banding technique. Specifically, signature slices of each band are hashed using hash_functions (map). The document indices are then grouped according to their hash values.
- Parameters
- hash_functionslist, default=None
a list of hash functions with size equivalent to the number of bands. If None, the native python hash function is applied.
- Returns
- band_buckets - dict
a dictionary with hash bucket as keys and a list of similar document indices as values