alis.feature_extraction.MinhashLSH
alis.feature_extraction.MinhashLSH#
- class alis.feature_extraction.MinhashLSH(shingle_size, num_shingle_bucket, num_hash, hash_size=None, stop_words=None, seed=1337)[source]#
Base class definition for extraction of the minhash signature given a dask bag of text data.
- Attributes
- shingle_sizeint
Shingle size to use for hashed word shingle extraction
- num_shingle_bucketint
The number defining the bucket size for word shingles. This is equal to 2**n - 1
- num_hashint
Number of randomized hash functions to use in minhash signature extraction.
- hash_sizeint, default=None
Range of hte hash function. If not specified, this defaults to 2**32
- seedint, default=1337
Random seed to use during random number generation
Methods
__init__
(shingle_size, num_shingle_bucket, ...)Initialize the Minhash LSH signature extractor
transform
(db_text)Return a dask bag containing the minhash signatures of documents in the given dask bag of text