alis.feature_extraction.hashed_word_shingles#

alis.feature_extraction.hashed_word_shingles(text, k, n, stop_words=None)[source]#

Return the list of word k-shingles from the given text based on a given stop words then hases it into a bucket with range 0 to 2**n - 1.

We define a shingle to be a stop word followed by the next k-1 words regardless of whether the next words were stop words or not.

Parameters
textstr

String of text whose word shingles are to be extracted

kint

Shingle size

nint

The number defining the bucket size 2**n - 1

stop_wordsiterabe of str, default=None

List of stop words to be used. By default, uses the English stopwords defined by sklearn

Returns
shinglesiterable of int

A list containing the extracted word shingles in hashed representation.