Maximum Coverage Representative Skyline

ABSTRACT

Malene S. Søholm
Thesis defended 15-10-2015

The skyline query is a preference-based query used in database systems for finding interesting data points. The query returns the set of data points which are not dominated by any other data point in the dataset. However, the skyline may become very large especially for high-dimensional data, possibly overwhelming the user with the size of the returned result set.

Several approaches have tried to solve this issue by returning only k skyline points according to some criteria, e.g. ranking and representative skylines. However, the approaches are either too focused on the individual properties of the skyline points, disregarding the diversity of the skyline, they are not scale-invariant, leading to inconsistent results, or they retrieve the entire skyline even for a small value of k.

In this thesis, a novel representative skyline is introduced, namely the maximum coverage representative skyline, which returns the set of k skyline points collectively dominating the largest area of the data space. A (1−1/e)-approximate greedy algorithm is proposed for retrieving the maximum coverage representative skyline, which is both scale-invariant, diverse, and does not require computing the entire skyline for small k. The algorithm is evaluated on real and synthetic datasets both with regards to efficiency and representativeness, showing that the maximum coverage representative skyline is a promising alternative to the current approaches.