Moments#

Computing the “moments” of a stream involves getting the distribution of the different elements in a stream, and is a generalization of the previous problem of counting distinct elements. Let \(m_i\) be the number of occurences of element \(i\) in a stream. The kth-order moment of a stream is the sum over all \(i\) of \((m_i)^k\), or \(\sum_i(m_i)^k\).

The 0th order moment of a stream is the sum of \((m_i)^0\), or the sum of 1 for each element \(i\) found in the stream, or simply the count of distinct elements in the stream. We can use the Flajolet-Martin algorithm detailed above for this.

The 1st moment is the sum of \(m_i\)’s, or simply the length of the stream.

The 2nd moment is the sum \((m_i)^2\). This measures how uneven the distribution of the elements of the stream is, and is also called the surprise number. Again, if the we have the capacity to store \(m_i\) for each element of the stream then this is easy to calculate. However if the number of elements if too large we need to look for alternatives such as approximation. One way to estimate the 2nd moment of the stream is to use the Alon-Matias-Szegedy (AMS) algorithm.

Alon-Matias-Szegedy Algorithm for Second Moments#

In the AMS algoritm we calculate a number of variables \(X\) as we go through the stream. The more \(X\)’s we calculate, the more accurate our estimate will be. Each element \(X\) contains:

  1. \(X.element\) - an element of the universal set found in the stream.

  2. \(X.value\) - the number of times we have seen \(X.element\) in the stream since we have started tracking it.

The estimate of the 2nd moment is \(n \times (2 \times X.value − 1)\), where \(n\) is the length of the stream. We improve the final estimate by averaging the estimate of all \(X\)’s. One thing to note is we do not start tracking all the \(X\) variables at the start of the stream. Instead, we start tracking each \(X\) variable at random points as we go through the stream.

As an example, suppose we have a stream of length \(n=15\) and having values a,b,c,b,d,a,c,d,a,b,d,c,a,a,b. \(m_a=5\), \(m_b=4\), \(m_c=3\), and \(m_d=3\). The surprise number for this stream is therefore \(52 + 42 + 32 + 32 = 59\). Let’s calculate 3 variables \(X_1\), \(X_2\), and \(X_3\), starting at 3rd, 8th, and 13th positions respectively at “random”.

  • At position 3 we find c, so we set \(X_1.element\) to c and \(X_1.value\) to 1. We’ll encounter two more c as we go through the stream so the final value of \(X_1.value\) is 3.

  • At position 8 we find d, so \(X_2.element=d\) and \(X_2.value=1\). At the end of the stream $X_2.value=2, since we only count starting from position 8.

  • At position 13 is a, so \(X_3.element=a\). The final value of \(X_3.value\) is 2.

We calculate the estimate as \(n \times (2 \times X.value − 1)\), so that’s \(15×(2×3−1)=75\) for \(X_1\) and \(15×(2×2−1)=45\) for both \(X_2\) and \(X_3\). The average of the estimates is \((75+45+45) \div 3 = 55\), which is close to the actual value of \(59\).

Higher Order Elements#

We can estimate higher-order elements similar to how we estimate 2nd-order elements. The only difference is instead of \(n \times (2 \times X.value − 1)\) for the estimate, the formula is now \(n \times (X.value^k−(X.value−1)^k)\).

Below we provide an alonMatiasSzegedy() function that takes as argument the moment value to approximate for, the number of variables to track, and the probability of an element in an array becoming a variable.

def alonMatiasSzegedy(iterator, k_moment=2, num_variables=3, var_prob=0.1):
    variables = {}
    n = 0
  
    for val in iterator:
        n += 1
        
        if val not in variables:
            if len(variables) < num_variables and random.random() < var_prob:
                variables[val] = 1
        else:
            variables[val] += 1


    amsMoment = 0
    for w in variables:
        amsMoment +=   n * (variables[w]**k_moment - ((variables[w] - 1)**k_moment))

    amsMoment /= len(variables)

    return amsMoment

testList = []
n=0
while n < 100000:
    n += 1
    testList.append(np.random.choice(words))

test_iterator = iter(testList)

print(alonMatiasSzegedy(test_iterator))
233333.33333333334

Exercise#

  1. Using the example above, with \(X_1.value=3\), \(X_2.value=2\), and \(X_3.value=2\), what is the 5th moment?