Minhash LSH#

The traditional LSH appraoch to a hypothetical many-to-many document similarity task. The objective is to bucket similar documents together. The implementation is done through the LSH class which leverages on dask.bag functionality and methods to parallelize the banding technique. Specifically, the map (hash function) and reduce (bucketing) tasks.

Note: importing the model automatically initializes a dask client.

BY: Mike Dorosan, 2022

LSH(signature)

The LSH class for a many-to-many document similarity task.