Current projects

RELAX - Relaxed Semantics Across the Data Analytics Stack

Relaxed Semantics Across the Data Analytics Stack, is a Doctoral Network (DN) project that has received funding from the European Union’s HORIZON-MSCA-2021-DN-01 call under the Marie Skłodowska-Curie grant agreement 101072456.

The project has a duration of four years starting 1 March 2023. Twelve Doctoral Candidates (DC) perform research in the project, together with their academic and industrial supervisors. In the Data-Intensive Systems research group, we host two doctoral projects, starting 1 October 2023 and 1 November 2023, respectively.

Main objectives

To develop the principles of robust data analytic algorithms in the face of uncertain, inaccurate and/or biased data. The targets are

(i) to understand the interplay between imperfections of the data and imperfections of the computation in order to tune one within the boundaries allowed by the other;
(ii) to design methodologies and algorithms to ensure robust decision-making, in particular by ensuring qualitative properties such as uncertainty, reproducibility, and explainability.

To develop new algorithms for indexing and summarisation building upon the data attributes explored in WP1. The targets are

(i) to develop algorithms and indexing structures that explore application-tailored trade-offs between accuracy, speed and performance;
(ii) to investigate algorithms for compression, summarisation and approximation based on controlling the precision and quality of data.

To develop new algorithms and new coordination and synchronisation models to support asynchronous and incremental AI and ML and large data processing for better performance, better freshness and without loss of accuracy, when compared to contemporary barrier-based synchronous approaches that suffer from scalability.
To develop and organize a bespoke network-wide joint training programme.

To develop and implement the communication strategy, including social media and web presence, traditional media announcements, identifying and taking part in outreach activities, exploitation of results, managing open source software and open data.

Find more information on the official project home page at https://blogs.qub.ac.uk/relax-dn/

Future Cropping - Data Analysis to Support Precision Farming

The overall aim of Future Cropping is to utilize the potential benefits of precision farming and data communication, by integrating large amounts of data from agro- and environmental technologies with area and climate data. Based on these data, a decision tools will be developed as well as new technologies to support real time decision making in the field.

Farmers already have large amounts of data available. E.g. yield monitoring data, GPS registrations and data on soil types and climate. Machinery equipped with sensors is expected to harvest even more data during the daily farm management. And data may be of great value if you know how to use it.

Future Cropping will develop a decision tool that collects data and combine data from different sources. The tool will support real time decision making adapted to each specific part of a field and thereby optimize crop management.

The aim of Future Cropping is to increase crop yield and quality without increasing the environmental impact. The overall aim of Future Cropping is to utilize the potential benefits of precision farming and data communication, by integrating large amounts of data from agro- and environmental technologies with area and climate data.

The project is supported by the INNO+ program of the Innovation Fund Denmark. It runs for five years and the overall budgets sums to 99,990 m DKK.

Find more information on the official project home page at https://futurecropping.dk/en/about-future-cropping/

Effective, efficient and robust clustering models for molecular dynamics simulations

Understanding proteins at the molecular level requires insight into the patterns their moving elements show. Clustering is a way to extract these patterns from simulation data in computational chemistry, but existing clustering methods are limited in that they reduce input data to handle the complexity of the problem, require long and expensive simulation trajectories, and are sensitive to parameter choices. In this project we address these shortcomings and create clustering concepts and algorithms that allow full detail tracking of molecular movement, reliably determine patterns that will help to understand and parse information from the large amount of data available from very long molecular dynamics simulation. The new concepts will also permit to maximize the knowledge retainable from relatively shorter trajectories, and assist the domain expert in determining the most suitable parameters. The project thereby contributes to the data mining research field in computer science with new concepts and algorithms in moving cluster detection, and to the computational biology/chemistry fields with new findings on the behavior of proteins at much reduced computational cost.

This project is funded by Villum Fonden in the Villum Synergy program.

FounData

The FounData (Creating tools and methods for Data Foundations) project is funded by the AUFF NOVA scheme. It funds PhD candidate Simon Enni in our group, and a postdoctoral scholar in Susanne Bødker's group Computer Mediated Activity.

In this digital age, more and more data reflect different aspects of our lives. We witness an increased
interest in using this data from public and private sources to foster innovation, decision making and
commercial growth. However, these promising ideas have not fully materialized in practice, and we claim
that a core reason is our lack of understanding of the available data, and of data analysis products. Existing
methods require substantial expertise in data analysis, which severely limits practical use. As a result,
communication between data producers, analysts, and data consumers tends to be difficult and biased. For
example, public infrastructure hearings rely on standard data products that many stakeholders experience
as challenging to relate to, to interpret in the intended manner, and to criticize on an informed basis.
FounData creates new methods that provide an understanding of the benefits and limitations of data. Our
interdisciplinary effort bridges technical solutions in computer science for analysis of data and for designing
interfaces and computer supported collaborative work, thereby creating a novel research direction at the
heart of data analysis and utilization.

Data Science on the Desktop

Data science develops methods for analyzing data collected from applications as diverse as business decision making, genome analysis, or movie recommendations. The promise is that this will allow us to understand the (digital) world around us to a much better degree. The current solutions, however, are still designed for single core computers. In modern laptops or desktop computers, however, we have access to several cores, and can even use the graphics card for non-graphical computations. In this project, we exploit the enormous potential for fast and scalable data science by creating solutions that make use of the compute power available in standard computers. We develop general strategies that allow us to make existing solutions much faster by using more of the available hardware resources. We build and evaluate prototypes to demonstrate dramatically faster data science results on standard computers.

The project is funded by the Independent Research Fund Denmark.

Data Leak Prevention

Leak of sensitive information from unstructured text documents is a costly problem both for government and for industrial institutions. Traditional approaches for data leak prevention are commonly based on the hypothesis that sensitive information is reflected in the presence of distinct sensitive words. However, for complex sensitive information, this hypothesis may not hold. We detect complex sensitive information in text documents by learning the semantic and syntactic structure of text documents. Our approach is based on natural language processing methods for paraphrase detection, and uses recursive neural networks to assign sensitivity scores to the semantic components of the sentence structure. We focus on interactive detection of sensitive information where users evaluate real documents, alter documents or prepare free text, and subject it to sensitive information detection. This allows adapting the approach to a particular sensitive information type and application domain, thereby providing effective support of document redaction prior to publication.

Analyzing Big Data on Modern Hardware

Data mining supports the automatic analysis of Big Data, the large volumes of data from diverse sources that are continuously being generated in e-commerce, monitoring applications, social networks and many other applications. Existing data mining models and algorithms, however, focus largely on the computational model of single-threaded processing. This is in sharp contrast to the actual computational model in modern hardware, where multi-core CPUs and graphics cards (GPUs) that support general purpose computation are now standard. In this project, we exploit the enormous potential for fast analysis of Big Data by creating algorithms for active clustering, anytime learning or parallel data mining. Part of this project is supported by a Villum Foundation postdoc block stipend.

Parallel algorithms for skyline and skycube computation on multicore CPUs and GPUs

Multi-criteria decision making can be a challenging tasks when a number of criteria are presented on different scales. A prototypical example is the choice of a hotel, where criteria could be e.g. price and distance to beach. Depending on user preferences, the optimal choice may very greatly. Known as skyline queries in database research or Pareto front in operations research, we can identify those data items that are optimal with respect to any (unknown) user preference function. Such skylines are useful in greatly reducing the decision space by removing data items that are clearly not competitive. Retrieving the skyline, however, is computationally costly, and can be prohibitive for large data sets, and data sets with many attributes (high-dimensional data).

Multicore CPUs and cheap co-processors such as GPUs create opportunities for vastly accelerating database queries. However, given the differences in their threading models, expected granularities of parallelism, and memory subsystems, effectively utilising all cores with all co-processors for an intensive query is very difficult. We propose algorithmic solutions that provide an order of magnitude improvement on multicore CPUs and GPUs.

eData - Scalable and Effective Anomaly Detection and Correction of Automatically Collected Data in e-Science

The shift towards e-Science and the use of computational techniques in the sciences brings with it the need for effective analyses of very large collections of often complex scientific data.

Much of today’s data handling is part of automated processing, ranging from measuring relevant variables, to data collection, to integration, to analysis, and finally to decision making. The term e-science (electronic or enhanced science) is used to denote such data intensive and computationally intensive work in collaborative science. Analytical results and decision making are based on various processes, each of which might introduce errors.

This project aims to provide a foundation for e-Science by contributing novel techniques that enable scientists to detect and correct anomalies in source data, in an on-line, interactive, lineage-preserving, and semi-automatic manner. This contrasts traditional algorithms that operate in a batch manner on static data and require data mining expertise for their use. We propose a new paradigm that allows domain experts to tap into the full potential of data mining by inventing scalable algorithms that build on insights from most notably the area of subspace clustering to offer effective foundations for anomaly detection and correction that render subsequent analyses robust in the context of imperfect source data.

eData is a projected funded by the Danish Council for Independent Research - Technology and Production Sciences. More information can be found on the project webpage.

WallViz

The WallViz project uses highly interactive, wall-sized visualizations to help decision makers handle massive collections of data and improve decision making from massive collections of data.

We collaborate with researchers from the University of Copenhagen (DIKU) who are experts in visualization and user-centered design. The project is coordinated by project manager Kasper Hornbæk from the Human-Centered Computing group.

The focus of our group is on data management and data mining issues in the context of interactive exploration and decision making. We propose solutions for decision making, providing entry points for the analysis efficiently and incrementally, selecting points expected to be of high interest to the user. We group data to provide an overview over large collections of data, and identify anomalies or outliers for further inspection.

The project is supported by the Strategic Research Council. More information can be found on the project webpage.

Previous projects

GEOCROWD

GEOCROWD is an initial training network that aims to advance the state-of-the-art in managing large amounts of semantically rich, user-generated geospatial data in a web setting. The project offers full-time support for a dozen young, initial-stage scientists. More information can be found on the project webpage.

REDUCTION

REDUCTION : Reducing Environmental Footprint based on Multi-Modal Fleet management Systems for Eco-Routing and Driver Behaviour Adaptation
REDUCTION is a collaborative research project funded by the European Comission; 7th Framework Programme Call 7 (2011-2014). REDUCTION aims at combining vehicular and ICT technologies for collecting and analyzing historic and real-time data about driving behaviour, routing information, and the associated carbon emissions measurements.

Cloud infrastructure

The cloud infrastructure project develops new data management infrastructure for cloud computing environments. More information can be found on the project webpage.

MOVE

MOVE is an Action of the COST Programme (European Cooperation in Science and Technology) for knowledge extraction from massive amounts of data about moving objects. More information can be found on the project webpage.

TimeCenter

TimeCenter is an international center for the support of temporal database applications on traditional and emerging DBMS technologies. More information can be found on the project webpage.

Projects