Filtering Streams
Contents
Filtering Streams#
Another common use case is filtering data in a stream. In this case we have a set \(S\) of data that we want to filter for, but the size of \(S\) is too large to fit in memory. Additionally, it is assumed that the elements in \(S\) are only a small fraction of the universal set, such that it will be a big benefit just to be able to quickly verify that a value isn’t part of \(S\). In that case we can use a Bloom filter.
Bloom Filter#
A Bloom filter is a probabilistic data structure that is used for testing whether a value is a member of a set [Blo70]. Instead of getting absolute True or False values when testing for membership, the Bloom filter instead returns True or False with the following meaning:
True - Value is “possibly” in set.
False - Value is definitely not in set.
In other words the Bloom filter may return a false positive, but never a false negative. In return for this ambiguity the Bloom filter uses far less space than a hash table or a hash set, and is easier to keep in memory for fast lookups. There is a tradeoff between the space needed by the Bloom filter and the probability of getting a false positve, the larger the space utilized the less chance for getting false positives.
#Setup bloom filter and filtering functions
size = 100000
bloomFilter = np.zeros(size, dtype=int)
#Filter for 10% of words
filterSet = np.random.choice(words, int(len(words) / 10))
for key in filterSet:
#Hash twice using different methods, then set the index for both to 1.
index1 = hashInt(key, buckets=size)
index2 = hashInt(key, buckets=size, hashType='md5')
bloomFilter[index1] = 1
bloomFilter[index2] = 1
def inBloomFilter(val):
#Hash twice, then verify that all values of the hash indexes are 1
index1 = hashInt(val, buckets=size)
index2 = hashInt(val, buckets=size, hashType='md5')
if bloomFilter[index1] == 1 and bloomFilter[index2] == 1:
return True
else:
return False
word_list = []
n = 0
while n < 1000000:
word = np.random.choice(words)
word_list.append(word)
n += 1
word_iter = iter(word_list)
n = 0
bloomYes = 0
actualYes = 0
for word in word_iter:
if inBloomFilter(word):
bloomYes += 1
if word in filterSet:
actualYes += 1
n += 1
if n % 100000 == 0:
print('Iterations', n, 'Maybe in Filter Set:', bloomYes, 'Really in Filter Set:', actualYes)
Iterations 100000 Maybe in Filter Set: 21455 Really in Filter Set: 9572
Iterations 200000 Maybe in Filter Set: 42892 Really in Filter Set: 19080
Iterations 300000 Maybe in Filter Set: 64396 Really in Filter Set: 28566
Iterations 400000 Maybe in Filter Set: 85674 Really in Filter Set: 37980
Iterations 500000 Maybe in Filter Set: 106912 Really in Filter Set: 47438
Iterations 600000 Maybe in Filter Set: 128308 Really in Filter Set: 56984
Iterations 700000 Maybe in Filter Set: 149718 Really in Filter Set: 66486
Iterations 800000 Maybe in Filter Set: 171132 Really in Filter Set: 76124
Iterations 900000 Maybe in Filter Set: 192420 Really in Filter Set: 85549
Iterations 1000000 Maybe in Filter Set: 213840 Really in Filter Set: 95106
Probability of False Positives#
Let \(n\) be the length of the bit-array, \(m\) be the size of set \(S\), and \(k\) be the number of hash functions we’re using. The probability of getting a false positive is \((1-e^\frac{-km}{n})^k\). For our example above, with \(n=100000\), \(m=23588\), and \(k=2\), the probabity of a false positive is \(14.14\%\)
Exercises#
What is the probablity of a false positive for a bloom filter with bit-array length 50000, filter set size of 3000, and using 5 hash functions?