{ "cells": [ { "cell_type": "markdown", "id": "160766ca", "metadata": {}, "source": [ "# Counting Distinct Elements in a Stream\n", "\n", "Another common problem is counting the number of distinct elements that we have seen in the stream so far. Again the assumption here is that the universal set of all elements is too large to keep in memory, so we'll need to find another way to count how many distinct values we've seen. If we're okay with simply getting an estimate instead of the actual value, we can use the Flajolet-Martin (FM) algorithm.\n", "\n", "## Flajolet-Martin Algorithm\n", "\n", "In the FM algorithm we hash the elements of a strea into a bit string. A bit string is a sequence of zeros and ones, such as 1011000111010100100. A bit string of length $N$ can hold $2^N$ possible combinations. For the FM algorithm to work, we need $N$ to be large enough such that the bit string will have more possible combinations than there are elements in the universal set. This basically means that there should be no possible collisions when we has the elements into a bit string. \n", "\n", "The idea behind the FM algorithm is that the more distinct elements we see, the higher the likelihood that one of their hash values will be \"unusual\". The specific \"unusualness\" we will exploit here is that the bit string ends in many consecutive 0s.\n", "\n", "For example, the bit string 1011000111010100100 ends with 2 consecutive zeros. We call this value of 2 the *tail length* of the bit string. Now let $R$ be the maximum tail length of that we have seen of any hashed bit string of the stream. The estimate of the number of distinct elements using FM is simply $2^R$." ] }, { "cell_type": "code", "execution_count": 19, "id": "fcaf16f6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "65536\n" ] } ], "source": [ "#Library function for non-streams\n", "def flajoletMartin(iterator):\n", " max_tail_length = 0\n", " \n", " for val in iterator:\n", " bit_string = bin(hash(hashlib.md5(val.encode('utf-8')).hexdigest()))\n", " \n", " i = len(bit_string) - 1\n", " tail_length = 0\n", " while i >= 0:\n", " if bit_string[i] == '0':\n", " tail_length += 1\n", " else:\n", " #neatly handles the '0b' prefix of the binary string too. \n", " #Just break when we see \"b\"\n", " break\n", " \n", " i -= 1\n", " \n", " if tail_length > max_tail_length:\n", " max_tail_length = tail_length\n", " \n", " return (2**max_tail_length)\n", "\n", "testList = []\n", "n=0\n", "while n < 100000:\n", " n += 1\n", " testList.append(np.random.choice(words))\n", "\n", "print(flajoletMartin(iter(testList)))" ] }, { "cell_type": "markdown", "id": "65d22af7", "metadata": {}, "source": [ "## Using Multiple Hash Functions\n", "\n", "We can improve the estimate further by using multiple hash functions. With multiple hash functions we first split hash functions into groups, get the maximum tail length for each hash function, then get the average for each group. Lasly, we get the median over all the averages and that will be our estimate.\n", "\n", "This way we can get estimates that aren't just powers of 2. If the correct count is between two large powers of 2, for example 7000, it will be impossible to get a good estimate using just one hash function." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }