Sampling Data in a Stream#

We start with the general problem of sampling data. We want to be able to select a subset of the stream to make calculations easier, but also have the results be statistically representative of the stream as a whole. One example might be a search engine wanting to answer the question of “What fraction of the typical user’s queries were repeated over the past month?”. Assume also that due to space constraints, we wish to store only \(\frac{1}{10}\)th of stream elements.

An easy but incorrect approach would be to just generate a random number from 0 to 9 for each incoming data, and only consider data when the random number is equal to 0. However this will give us the wrong answer as for each user we would be drastically reducing the fraction of repeated queries too. If a user has \(s\) queries which were done exactly once, and \(d\) queries which were done exactly twice, and no queries were done more than 2 times. The correct answer for the fraction of repeated queries for this user is \(\frac{d}{s+d}\). However using the sampling method just detailed, the fraction we will get is \(\frac{d}{10s + 19d}\).

The right way would be to only consider \(\frac{1}{10}\)th of all users, but consider all queries for those randomly selected users. Assume further that the possible list of users is so large that isn’t feasible to store which users should have their queries tracked and which users shouldn’t. We can instead use a hash function that maps the user to 1 of 10 buckets. If the user hashes to the first bucket, then this user is someone whose queries we would want to track. Note that we do not actually need to save the users into the buckets, we just need to know at runtime which bucket the user goes to using the fast hash function. This saves us from having to keep track of all users in the universal set.

n = 0
data_list = []

while n < 1000000:
    userId = np.random.choice(userIds)
    word = np.random.choice(words)
    
    data_list.append((userId, word))
    n += 1
    
data_iterator = iter(data_list)
samples = {} 
n = 0

for val in data_iterator:
    userId = val[0]
    word = val[1]
    
    #hash to 10 buckets. Process only if hash points to first bucket
    if hashInt(userId, 10) == 0:
        if userId not in samples: 
            samples[userId] = {}
        
        if word not in samples[userId]:
            samples[userId][word] = 0
        
        samples[userId][word] += 1
        
    #Print fraction of repeated queries every 100000 elements
    n += 1
    if n % 100000 == 0:
        fractionRepeats = 0
        for userId in samples:
            repeats = 0
            
            for word in samples[userId]:
                if samples[userId][word] > 1:
                    repeats += 1
    
            fractionRepeats += repeats / len(samples[userId])

        fractionRepeats /= len(samples)
        print("Iterations:", n, 'Average Repeat Per User:', fractionRepeats)
Iterations: 100000 Average Repeat Per User: 8.48824378236143e-05
Iterations: 200000 Average Repeat Per User: 0.0004026243221023727
Iterations: 300000 Average Repeat Per User: 0.0003966167804915212
Iterations: 400000 Average Repeat Per User: 0.0004724339706863245
Iterations: 500000 Average Repeat Per User: 0.00086515906940257
Iterations: 600000 Average Repeat Per User: 0.0010044780481139246
Iterations: 700000 Average Repeat Per User: 0.0012795931598440701
Iterations: 800000 Average Repeat Per User: 0.0015721791194151934
Iterations: 900000 Average Repeat Per User: 0.0018303686818008796
Iterations: 1000000 Average Repeat Per User: 0.002072822397857278

Fixed Size Samples#

For infinite or very large streams, our set of samples will also eventually become too large for our working storage. For cases like this we may want to fix a maximum number of samples we will keep, and randomly replace older samples with newer ones over time. One technique to do this will be to do Reservoir Sampling.

Let \(s\) be the maximum number of samples we want to keep, and \(n\) be the number of samples we have been tracking from the start.

  1. While \(n <= s\), keep sampling normally.

  2. A new element we want to samples arrives such that \(n > s\). With probability \(\frac{s}{n}\), keep the new sample. Otherwise discard it.

  3. If we’re keeping the new sample, discard one of the older samples at random.

After \(n\) elements, the sample contains each element seen so far with probability \(\frac{s}{n}\). For our user query sample above, it may still be better to think of replacing users instead of their individual queries.

We implement the fixed size sampling algorithm detailed above for \(s=20\) while going through our words list.

samples = []
maxSamples = 20 #s
samplesTracked = 0 #n

for w in words:
    samplesTracked += 1
    
    if len(samples) < maxSamples:
        samples.append(w)
    else:
        if random.random() < (maxSamples / samplesTracked):
            randIndex = random.randint(0, maxSamples-1)
            samples[randIndex] = w

print("Samples:", len(samples))
print(samples)
Samples: 20
['wrencher', 'wonderfulness', 'Mantuan', 'dimorphic', 'disturbance', 'Sparassis', 'unstow', 'peastone', 'downline', 'manneristically', 'spelterman', 'darned', 'calathiform', 'oblongly', 'excavationist', 'oneirocrit', 'subloreal', 'unspiritedly', 'binomially', 'thioarsenic']