SuffixTree

This module wraps a C implementation of a suffix tree.

The python bindings are mainly intended for experimentation with suffix tree based algorithms. Consequently, they are not particularly space or time efficient, but instead easily extensible.

This article is translated to Serbo-Croatian by Anja Skrba from Webhostinggeeks.com, into Estonian by Karolin Lohmus, and into Latvian by Catherine Desroches.

Usage

To build a simple suffix tree — a suffix tree of a single string — import the SuffixTree class and instantiate it with the string:

from suffix_tree import SuffixTree
stree = SuffixTree('mississippi') 

You can now traverse the tree through a number of iterators: leaves, innerNodes, preOrderNodes, and postOrderNodes, that traverses all leaves, all inner nodes, all nodes in pre- and post-order, respectively:

>>> for l in stree.leaves:
...     print l.pathLabel
...
mississippi$
ississippi$
issippi$
ippi$
i$
ssissippi$
ssippi$
sissippi$
sippi$
ppi$
pi$
$

For more low-level access, the root node is available as the stree.root attribute.

The string parameter to the suffix tree is available through the string property::

>>> stree.string[2:4]
'ss'

A generalised suffix tree — a suffix tree representing a set of strings — can be constructed using the GeneralisedSuffixTree class:

from suffix_tree import GeneralisedSuffixTree
stree = GeneralisedSuffixTree(['mississippi','sippissi']) 

In addition to the functionality of the simple suffix tree, the generalised suffix tree has a function for extracting all shared substrings of the set of strings, or all shared substrings of a given minimal length

for shared in st.sharedSubstrings():
    print '-'*70
    for seq,start,stop in shared:
        print seq, '['+str(start)+':'+str(stop)+']',
        print sequences[seq][start:stop],
        print sequences[seq][:start]+'|'+sequences[seq][start:stop]+\
              '|'+sequences[seq][stop:]
print '='*70

for shared in st.sharedSubstrings(2):
    print '-'*70
    for seq,start,stop in shared:
        print seq, '['+str(start)+':'+str(stop)+']',
        print sequences[seq][start:stop],
        print sequences[seq][:start]+'|'+sequences[seq][start:stop]+\
              '|'+sequences[seq][stop:]
print '='*70 

The nodes have attributes for the path label (pathLabel) — the label on the path from the root to the node — for the edge label (edgeLabel) — the label on the edge from the parent to the node — for accessing the first and last child of the node (firstChild and lastChild), for accessing the left and right sibling (prev and next) and a few more. See the on-line help for more details.

Each node has an associated dictionary, making it easy to annotate the nodes for algorithmic prototyping, as illustrated below in a piece of code taken from the implementation of the generalised suffix tree:

for n in self.postOrderNodes:
    if n.isLeaf:
        seq,idx = self._translateIndex(n.index)
        n.pathIndices = [(seq, idx)]
        n.sequences = [seq]
    else:
        pathIndices = [] ; sequences = []
        c = n.firstChild
        while c is not None:
            pathIndices += c.pathIndices
            sequences += c.sequences
            c = c.next

        seqsInSubtree = {}
        for s in sequences:
            seqsInSubtree[s] = 1

        n.pathIndices = pathIndices
        n.sequences = [s for s in seqsInSubtree]

Here, the nodes are annotated with pairs of sequence times index for the individual sequences in the generalised suffix tree, and with the set of sequences represented in the nodes' sub-trees.

The generalised tree is build using a concatenation of all the input strings. The path labels are therefore not exactly the labels you'd expect, but using the annotation the expected labels can be extracted.

A final word of warning: The python bindings really need the cyclic garbage collector to be able to clean up suffix trees. The tree needs to reference the nodes, so they wont be deleted with their annotation, and the nodes need to reference the tree so it wont disappear while the script references the node.

Unfortunately, I have not been able to make that work. Since we really need to clean up the trees when we prototype, I have chosen the following workaround: the tree reference the nodes, but the nodes do not reference the tree. It is therefore the script-writers responsibility to make sure that the tree is not garbage collected while nodes in the tree are referenced. The alternative would result in trees never being garbage collected. I hope to fix this problem soon.

Installation

To install the python binding for suffix trees, download the source code and unpack it (using tar xzf suffix_tree-a.b.c.tar.gz, where a.b.c is the version number). Then run the setup script:

python setup.py install 

This will build the C library and the python bindings, and will install the module.

See python setup.py --help for more details.

An older version (versions 1.0 and 1.1), also downloadable from here, was based on a a suffix tree implementation by Dotan Tsadoc and Shlomo Yona. If you use this version, be aware that the API is slightly changed.

Contact

Thomas Mailund, <mailund@birc.au.dk>, Bioinformatics Research Center, University of Aarhus.

Time-stamp: "2006-01-26 21:55:19 mailund"