alis.feature_extraction.MinhashLSH.__init__#

MinhashLSH.__init__(shingle_size, num_shingle_bucket, num_hash, hash_size=None, stop_words=None, seed=1337)[source]#

Initialize the Minhash LSH signature extractor

Parameters
shingle_sizeint

Shingle size to use for hashed word shingle extraction

num_shingle_bucketint

The number defining the bucket size for word shingles. This is equal to 2**n - 1

num_hashint

Number of randomized hash functions to use in minhash signature extraction.

hash_sizeint, default=None

Range of hte hash function. If not specified, this defaults to 2**32

stop_wordsiterable of str, default=None

List of stop words to be used. By default, uses the English stopwords defined by sklearn

seedint, default=1337

Random seed to use during random number generation