ALCOMFT-TR-02-43

ALCOM-FT
 

Juha K\"arkk\"ainen
Computing the Threshold for q-Gram Filters
MPI. Work packages 1 and 4. May 2002.
Abstract: A popular and much studied class of filters for approximate string matching is based on finding common q-grams, substrings of length q, between the pattern and the text. A variation of the basic idea uses gapped q-grams and has been recently shown to provide significant improvements in practice. A major difficulty with gapped q-gram filters is the computation of the so-called threshold which defines the filter criterium. We describe the first general method for computing the threshold for q-gram filters. The method is based on a carefully chosen precise statement of the problem which is then transformed into a constrained shortest path problem. In its generic form the method leaves certain parts open but is applicable to a large variety of q-gram filters and may be extensible even to other classes of filters. We also give a full algorithm for a specific subclass. For this subclass, the algorithm has been implemented and used succesfully in an experimental comparison.
Postscript file: ALCOMFT-TR-02-43.ps.gz (100 kb).

System maintainer Gerth Stølting Brodal <gerth@cs.au.dk>