MADALGO Theory Seminar: Jesper Sindahl Nielsen, MADALGO

2014.11.24 | Katrine Østerlund Rasmussen

Date Wed 03 Dec
Time 14:15 15:00
Location Nygaard-327


Top-k Term-Proximity in Succinct Space


Let D = {T_1 , T_2 , . . . , T_D } be a collection of D string documents of n characters in total, that are drawn from an alphabet set Σ = [σ]. The top-k document retrieval problem is to preprocess D into a data structure that,
given a query (P[1..p], k), can return the k documents of D most relevant to the pattern P . The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in T_d . For example, it can be the term frequency (i.e., the number of occurrences of P in T_d ), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in T_d ), or a pattern-independent importance score of T_d such as PageRank. Linear space and optimal query time solutions already exist for this problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In
this talk we present the first sub-linear space data structure for the term-proximity relevance function, which uses only o(n) bits on top of any compressed suffix array of D and solves queries in time O((p + k) polylog n).