Aarhus University Seal

Best-Ever Algorithm for Huge Streams of Data

by Kasper Green Larsen

Finding heavy hitters in a data stream, also known as elephants or frequent items, is one of the most practically important and core problems in the study of streaming algorithms.

The most basic form of the problem is simple: given a long stream of elements, the goal is to report all frequent items in the stream using an algorithm with very low memory consumption, much smaller than the stream length. In practice, heavy hitters algorithms have been used to find popular destination addresses and heavy bandwidth users by AT&T, to find popular search query terms at Google, and to answer so-called "iceberg queries" in databases (to name a few of many applications). In the theoretical study of streaming algorithms, finding heavy hitters has played a key role as a subroutine in solving many other problems such as entropy estimation, sampling, duplicate detection and norm estimation.

In this talk, I will present the high-level ideas of a new heavy hitters algorithm, the ExpanderSketch, developed together with Jelani Nelson (Harvard), Huy Nguyen (Northeastern) and Mikkel Thorup (Copenhagen). The ExpanderSketch simultaneously dominates all previous algorithms in terms of asymptotic space, update time, query time and error probability. While our algorithm might not be practical due to large constants, we also developed a simple practical algorithm that outperforms the previous state-of-the-art algorithms by at least an order of magnitude.

Our new algorithm was recently featured by Quanta magazine and wired.com under the catchy title "Best-Ever Algorithm for Huge Streams of Data"