ALCOMFT-TR-03-18
|

|
Marianne Durand and Philippe Flajolet
Loglog Counting of Large Cardinalities
INRIA.
Work packages 1 and 5.
July 2003.
Abstract: Using an auxiliary memory smaller than the size of this abstract,
the LogLog algorithm
makes it possible to estimate in a single pass and within a few percents
the number of different words in the whole of Shakespeare's works.
In general the LogLog algorithm makes use of m ``small bytes'' of
auxiliary memory in order to estimate in a single pass
the number of distinct elements (the ``cardinality'')
in a file, and it does so with an accuracy that is of the order of 1/\sqrt{m}.
The ``small bytes'' to be used in order to count cardinalities till
Nmax comprise about loglog Nmax bits,
so that cardinalities well in the range of billions can be determined
using one or two kilobytes of memory only. The basic version of
the LogLog algorithm is
validated by a complete analysis. An optimized version,
super-LogLog, is also
engineered and tested on real-life data. The algorithm parallelizes
optimally.
Postscript file: ALCOMFT-TR-03-18.ps.gz (221 kb).
System maintainer Gerth Stølting Brodal <gerth@cs.au.dk>