alis.similarity.minhash_lsh.LSH
alis.similarity.minhash_lsh.LSH#
- class alis.similarity.minhash_lsh.LSH(signature)[source]#
The LSH class for a many-to-many document similarity task.
- Attributes
- signature2-D np.array
document minhash signatures with dimension n (samples) by m (signature size)
- bandsint
number of bands
- rint
number of rows per band derived from bands
- hash_functionslist, default=None
a list of hash functions with size equivalent to the number of bands. If None, the native python hash function is applied.
- band_dictdict
dictionary with band labels as keys and (set/doc index, signature band) tuples as values
- band_bucketsdict
a dictionary with hash bucket as keys and a list of similar document indices as values
- Methods
- ——-
Methods
__init__
(signature)Initialize class
get_buckets
([hash_functions, compute])This method implementes the map-reduce step of the traditional banding technique.
make_bands
(bands)Takes in the desired number of bands as a parameter and returns a dictionary with band labels as keys and dask.bag of (set/document index, signature band) tuples
plot_thresh
([return_thresh, display_thresh, ax])Plots the threshold plot according to number of bands.