Most Common Recent Elements#

Previous counting problems have assumed a sliding window that holds the tail-end of a stream for processing. However sometimes we want to consider all the elements in the stream from the start while also distinguising between really old elements and data that is more recent. For example, we want to find out what movies are popular currently. While “currently” is a relatively ambiguous term, we do know that we want to discount movies that were popular decades ago, like The Godfather, and emphasize more recent movies like Avengers: Endgame. Imagine that we have a stream of movies being watched coming in according to increasing timestamp. A simple way to get the get most common movies would be to calculate the count for each movie, possibly even with some of the techniques above (i.e., counting ones). However, this will only give us an approximation and won’t satisfy our preference to discount older movies. One way to do this is through the technique of Decaying Windows.

Decaying Windows#

In decaying windows, we exponentially decay the count for each movie by a constant \(c\). \(c\) will have a really small value, like \(10^{-6}\) or \(10^{-9}\). The value for each element is decayed by multiplying the current count by \((1-c)\) at every time step. Given a stream \(a_1, a_2,...,a_t\), the exponentially decaying window of this stream is

(1)#\[\begin{equation} \sum_{i=0}^{t-1}a_{t-i}(1-c)^i \end{equation}\]

When a new element \(a_{t+1}\) is added into the stream all we need to do is:

  1. Multiply the current sum by \(1-c\).

  2. Add \(a_{t+1}\).

In our movie example, we imagine a separate stream for each movie with a 1 each time that movie is watched, and a 0 each time another movie is watched. The decaying sum of the 1’s measures the current popularity of the movie. To save space, we can completely disregard unpopular movies by dropping movies whose count falls below a certain threshold, such as 0.5. When new data arrives at the stream we do the following:

  1. For each movie whose score we are currently tracking, multiply the score by \((1-c)\).

  2. If the data is for movie \(M\), add 1 to \(M\)’s score if we are currently tracking it, or start tracking \(M\) and initialize it with a score of 1 if we are not.

  3. Drop any movies we are tracking if the score is less than our threshold of 0.5.

Due to the exponential decay, the sum of all scores is \(\frac{1}{c}\). There cannot be more than \(\frac{2}{c}\) movies with score of 0.5 or more, otherwise the sum of scores would be greater than \(\frac{1}{c}\). Therefore \(\frac{2}{c}\) is the maximum number of scores we would have to track.

Let’s provide an example using the MovieLens 100k dataset, a list of movie ratings from September 20, 1997 to April 22, 1998. First, let’s map movies to their release timestamp so we can easily sort them by when they were released.

movieStream = {}
movieNames = {}

for line in open('ml-100k/u.item', encoding = "ISO-8859-1"):
    tokens = line.strip().split('|')
    movieNames[tokens[0]] = tokens[1]
    #print(tokens[0], tokens[1])
    
for line in open('ml-100k/u.data'):
    tokens = line.strip().split()
    movie = tokens[1]
    ts = tokens[3]
    
    #handle duplicate timestamps
    while ts in movieStream:
        ts += 'a'
    
    movieStream[ts] = movie

#Sort by timestamp and make it an iterator
movie_list = []
for key in sorted(movieStream.keys()):
    movie_list.append(movieStream[key])
    
movie_iter = iter(movie_list)

Next we count movie rating with decaying windows. Play with the variable \(c\) below by setting it to 0 to see which movies would be most popular by simple raw counts.

scores = {}

c = 0.0001

n = 0

for movie in movie_iter:
    
    remove = set()
    
    for key in scores:
        scores[key] *= (1-c)
        if scores[key] < 0.5 and key != movie:
            remove.add(key)

    
    for rem in remove:
        del scores[rem]
    
    if movie in scores:
        scores[movie] += 1
    else:
        scores[movie] = 1
        
    n += 1
    if n == 100000:
        break

sortedScores = {k: v for k, v in sorted(scores.items(), reverse=True, key=lambda item: item[1])}
count = 0
for movie in sortedScores:
    print(movieNames[movie], sortedScores[movie])
    count += 1
    if count == 10:
        break
Titanic (1997) 67.28333046382443
Contact (1997) 55.42960303711523
Air Force One (1997) 54.49994366275277
Star Wars (1977) 50.654564738716964
Full Monty, The (1997) 49.25542831929427
English Patient, The (1996) 46.814044430751856
Good Will Hunting (1997) 45.61329962426032
Liar Liar (1997) 44.01656947009622
Scream (1996) 43.58306822429063
Fargo (1996) 40.69582727523209

We find that Titanic is the most popular movie by our method, even though by raw count it wouldn’t even hit the top 10. This is because Titanic was released late 1997 so the reviews for it would be more recent.

Exercises#

  1. What would the most popolar movie be if we weren’t using decaying windows?