ALCOMFT-TR-02-43
|

|
Juha K\"arkk\"ainen
Computing the Threshold for q-Gram Filters
MPI.
Work packages 1 and 4.
May 2002.
Abstract: A popular and much studied class of filters for approximate string
matching is based on finding common q-grams, substrings of length
q, between the pattern and the text. A variation of the basic idea uses
gapped q-grams and has been recently shown to provide
significant improvements in practice. A major difficulty with gapped
q-gram filters is the computation of the so-called threshold
which defines the filter criterium. We describe the first general
method for computing the threshold for q-gram filters. The method is
based on a carefully chosen precise statement of the problem which is
then transformed into a constrained shortest path problem.
In its generic form the method leaves certain parts open
but is applicable to a large variety of q-gram filters and may be
extensible even to other classes of filters. We also give a full
algorithm for a specific subclass. For this subclass, the algorithm
has been implemented and used succesfully in an experimental
comparison.
Postscript file: ALCOMFT-TR-02-43.ps.gz (100 kb).
System maintainer Gerth Stølting Brodal <gerth@cs.au.dk>