alis.feature_extraction.MinhashLSH#

class alis.feature_extraction.MinhashLSH(shingle_size, num_shingle_bucket, num_hash, hash_size=None, stop_words=None, seed=1337)[source]#

Base class definition for extraction of the minhash signature given a dask bag of text data.

Attributes
shingle_sizeint

Shingle size to use for hashed word shingle extraction

num_shingle_bucketint

The number defining the bucket size for word shingles. This is equal to 2**n - 1

num_hashint

Number of randomized hash functions to use in minhash signature extraction.

hash_sizeint, default=None

Range of hte hash function. If not specified, this defaults to 2**32

seedint, default=1337

Random seed to use during random number generation

Methods

__init__(shingle_size, num_shingle_bucket, ...)

Initialize the Minhash LSH signature extractor

transform(db_text)

Return a dask bag containing the minhash signatures of documents in the given dask bag of text