2014.09.29 | Katrine Østerlund Rasmussen

 Date Wed 08 Oct Time 14:15 — 15:00 Location Building 5335, Nygaard-395

Title

Computing Distance Statistics in Trees: A Key Problem in Phylogeny Research

Abstract

Many case studies in Ecology examine the interactions between groups of species that live in the same region. A very common problem that appears in these studies is the following: given a group of species R, we want to measure how closely related these species are. The standard way to do this is by using a phylogenetic tree; this is the tree that represents the evolution history of species. Each tree node corresponds to a different kind of species; the leaves represent species that exist today, while internal nodes represent ancestor species of the past. The tree edges indicate how species have evolved from each other, and the (positive) weights of the edges represent some kind of distance between populations e.g. amount of time since a speciation event.

Therefore, the problem that we want to solve can be formulated as follows: given a tree T and a set R of its leaf nodes (that represent the species that we want to examine) we want to measure what is the "distance" in T between the nodes in R. There are several ways to calculate some distance value between these nodes; we can consider the cost of the min-weight Steiner tree between these nodes, the average path cost etc. . By calculating one of these measures we get a value for the set R, say d(R). However, this value by itself is not enough to show if the species in R are closely related or not; to find that out we need to have a point of reference. Thus, we would like to compute the mean value of d(.) among all possible sets of leaves in T that consist of the same number of elements as R. Comparing this mean with d(R) would then solve the problem. Also, other than the mean, ecologists are interested to compute higher order statistics for d(.) among all subsets of leafs that have the same size as R.

In this talk, we present several efficient algorithms for computing statistics of distance measures in a tree. We present results for a variety of measures that are used in biological case studies, but we also describe a large collection of interesting open problems that are related with this topic. The presented results are part of a collaboration project between MADALGO and the Ecoinformatics group at AU.