ALCOMFT-TR-01-158

ALCOM-FT
 

Mireille Régnier, Alexandre Lifanov and Vsevolod Makeev
Three variations on word counting
INRIA. Work package 4. June 2001.
Abstract: We address the problem of assessing statistical significance of pattern occurrence frequency in biopolymer sequences; to this end we investigate the distribution of pattern occurrences in random sequences, under Bernoulli and Markov model.

We demonstrate how explicit known formulae for expectation and variance are modified in different pattern counting schemes customary in the computational biology, and provide dedicated fast procedures. We discuss several approximations, notably the widely used approximation of the variance by the expectation. Finally, we discuss sensi tivity of statistical evaluation to the counting scheme or the probability model. A special attention is paid to changes in Z-scores that may introduce false positive. We provide new criteria to estimate a priori validity of various approximations or sensitivity, as well as simple optimized procedures.

We consider three applications. First, several searching algorithms for regulatory sites make use of statistics on consensus words in a single-stranded DNA text. Second, we propose a new counting scheme to count consensus words on both strands in double stranded DNA. Our third application is the counting of profiles, especially PROSITE regular expressions.

Postscript file: ALCOMFT-TR-01-158.ps.gz (103 kb).

System maintainer Gerth Stølting Brodal <gerth@cs.au.dk>