ALCOMFT-TR-01-74
|

|
Stefan Burkhardt and Juha Kärkkäinen
Better Filtering with Gapped q-Grams
MPI.
Work packages 1, 4 and 5.
May 2001.
Abstract: The q-gram filter is a popular filtering method for approximate
string matching. It compares substrings of length q (the
q-grams) in the pattern and the text to identify the text areas
that might contain a match. A
generalization of the method is to use gapped q-grams,
subsets of q characters in some fixed non-contiguous shape,
instead of contiguous substrings. Although mentioned a few times in
the literature, this generalization has never been studied in any
depth. In this paper, we report the first results from a study on
gapped q-grams. We show that gapped q-grams can provide orders
of magnitude
faster and/or more efficient filtering than contiguous q-grams.
The performance, however, depends on the shape of the q-grams.
The best shapes are rare and often possess no apparent
regularity. We show how to recognize good shapes and demonstrate
with experiments their advantage over both contiguous and average
shapes. We concentrate here
on the k mismatches problem, but also outline an approach for
extending the results to the more common k differences problem.
Postscript file: ALCOMFT-TR-01-74.ps.gz (102 kb).
System maintainer Gerth Stølting Brodal <gerth@cs.au.dk>