ALCOMFT-TR-01-43
|

|
Philippe Flajolet, Yves Guivarc'h, Wojciech Szpankowski and Brigitte Vallée
Hidden Pattern Statistics
INRIA.
Work packages 1 and 4.
April 2001.
Abstract: We consider the sequence comparison problem, also known as
``hidden pattern'' problem, where
one searches for a given subsequence
in a text (rather than a string understood as
a sequence of consecutive symbols).
A characteristic parameter is the number of occurrences of a given pattern w
of length m as a subsequence in a random text of length n generated
by a memoryless source. Spacings between letters of the pattern
may either be constrained or not in order to define valid occurrences.
We determine the mean and the
variance of the number of occurrences, and establish a Gaussian limit law.
These results are obtained via combinatorics on words, formal
language techniques, and methods of analytic combinatorics based on
generating functions and convergence of moments. % methods.
The motivation to study this problem comes from an attempt at finding a
reliable
threshold for intrusion detections, from textual data processing
applications, and from molecular biology.
Postscript file: ALCOMFT-TR-01-43.ps.gz (81 kb).
System maintainer Gerth Stølting Brodal <gerth@cs.au.dk>