Projects

Current projects

Data Science on the Desktop

Data science develops methods for analyzing data collected from applications as diverse as business decision making, genome analysis, or movie recommendations. The promise is that this will allow us to understand the (digital) world around us to a much better degree. The current solutions, however, are still designed for single core computers. In modern laptops or desktop computers, however, we have access to several cores, and can even use the graphics card for non-graphical computations. In this project, we exploit the enormous potential for fast and scalable data science by creating solutions that make use of the compute power available in standard computers. We develop general strategies that allow us to make existing solutions much faster by using more of the available hardware resources. We build and evaluate prototypes to demonstrate dramatically faster data science results on standard computers.

The project is funded by the Independent Research Fund Denmark.

FounData

The FounData (Creating tools and methods for Data Foundations) project is funded by the AUFF NOVA scheme. It funds PhD candidate Simon Enni in our group, and a postdoctoral scholar in Susanne Bødker's group Computer Mediated Activity.

In this digital age, more and more data reflect different aspects of our lives. We witness an increased
interest in using this data from public and private sources to foster innovation, decision making and
commercial growth. However, these promising ideas have not fully materialized in practice, and we claim
that a core reason is our lack of understanding of the available data, and of data analysis products. Existing
methods require substantial expertise in data analysis, which severely limits practical use. As a result,
communication between data producers, analysts, and data consumers tends to be difficult and biased. For
example, public infrastructure hearings rely on standard data products that many stakeholders experience
as challenging to relate to, to interpret in the intended manner, and to criticize on an informed basis.
FounData creates new methods that provide an understanding of the benefits and limitations of data. Our
interdisciplinary effort bridges technical solutions in computer science for analysis of data and for designing
interfaces and computer supported collaborative work, thereby creating a novel research direction at the
heart of data analysis and utilization.

DocSea

The Industrial PhD project DocSea (Automatic document ranking and search techniques for a Discovery Engine) researches novel search, ranking and recommendation strategies for scientific document corpora. We address document indexing and retrieval in an environment characterized by highly specialized terminology under rapid development, and with complex semantic meanings and connections. This easens scientific research in R&D and provides partner publisher search enginges a competitive edge.

The Industrial PhD candidate Manuel R. Ciosici is enrolled in our graduate school GSST and employed at Unsilo A/S.

Data Leak Prevention

Leak of sensitive information from unstructured text documents is a costly problem both for government and for industrial institutions. Traditional approaches for data leak prevention are commonly based on the hypothesis that sensitive information is reflected in the presence of distinct sensitive words. However, for complex sensitive information, this hypothesis may not hold. We detect complex sensitive information in text documents by learning the semantic and syntactic structure of text documents. Our approach is based on natural language processing methods for paraphrase detection, and uses recursive neural networks to assign sensitivity scores to the semantic components of the sentence structure. We focus on interactive detection of sensitive information where users evaluate real documents, alter documents or prepare free text, and subject it to sensitive information detection. This allows adapting the approach to a particular sensitive information type and application domain, thereby providing effective support of document redaction prior to publication.

Analyzing Big Data on Modern Hardware

Data mining supports the automatic analysis of Big Data, the large volumes of data from diverse sources that are continuously being generated in e-commerce, monitoring applications, social networks and many other applications. Existing data mining models and algorithms, however, focus largely on the computational model of single-threaded processing. This is in sharp contrast to the actual computational model in modern hardware, where multi-core CPUs and graphics cards (GPUs) that support general purpose computation are now standard. In this project, we exploit the enormous potential for fast analysis of Big Data by creating algorithms for active clustering, anytime learning or parallel data mining. Part of this project is supported by a Villum Foundation postdoc block stipend.

Future Cropping - Data Analysis to Support Precision Farming

The overall aim of Future Cropping is to utilize the potential benefits of precision farming and data communication, by integrating large amounts of data from agro- and environmental technologies with area and climate data. Based on these data, a decision tools will be developed as well as new technologies to support real time decision making in the field.

Farmers already have large amounts of data available. E.g. yield monitoring data, GPS registrations and data on soil types and climate. Machinery equipped with sensors is expected to harvest even more data during the daily farm management. And data may be of great value if you know how to use it.

Future Cropping will develop a decision tool that collects data and combine data from different sources. The tool will support real time decision making adapted to each specific part of a field and thereby optimize crop management.

The aim of Future Cropping is to increase crop yield and quality without increasing the environmental impact. The overall aim of Future Cropping is to utilize the potential benefits of precision farming and data communication, by integrating large amounts of data from agro- and environmental technologies with area and climate data.

The project is supported by the INNO+ program of the Innovation Fund Denmark. It runs for five years and the overall budgets sums to 99,990 m DKK.

Find more information on the official project home page at https://futurecropping.dk/en/about-future-cropping/

Parallel algorithms for skyline and skycube computation on multicore CPUs and GPUs

Multi-criteria decision making can be a challenging tasks when a number of criteria are presented on different scales. A prototypical example is the choice of a hotel, where criteria could be e.g. price and distance to beach. Depending on user preferences, the optimal choice may very greatly. Known as skyline queries in database research or Pareto front in operations research, we can identify those data items that are optimal with respect to any (unknown) user preference function. Such skylines are useful in greatly reducing the decision space by removing data items that are clearly not competitive. Retrieving the skyline, however, is computationally costly, and can be prohibitive for large data sets, and data sets with many attributes (high-dimensional data).

Multicore CPUs and cheap co-processors such as GPUs create opportunities for vastly accelerating database queries. However, given the di fferences in their threading models, expected granularities of parallelism, and memory subsystems, e ffectively utilising all cores with all co-processors for an intensive query is very difficult. We propose algorithmic solutions that provide an order of magnitude improvement on multicore CPUs and GPUs.

 

 

eData - Scalable and Effective Anomaly Detection and Correction of Automatically Collected Data in e-Science

The shift towards e-Science and the use of computational techniques in the sciences brings with it the need for effective analyses of very large collections of often complex scientific data.

Much of today’s data handling is part of automated processing, ranging from measuring relevant variables, to data collection, to integration, to analysis, and finally to decision making. The term e-science (electronic or enhanced science) is used to denote such data intensive and computationally intensive work in collaborative science. Analytical results and decision making are based on various processes, each of which might introduce errors.

This project aims to provide a foundation for e-Science by contributing novel techniques that enable scientists to detect and correct anomalies in source data, in an on-line, interactive, lineage-preserving, and semi-automatic manner.This contrasts traditional algorithms that operate in a batch manner on static data and require data mining expertise for their use. We propose a new paradigm that allows domain experts to tap into the full potential of data mining by inventing scalable algorithms that build on insights from most notably the area of subspace clustering to offer effective foundations for anomaly detection and correction that render subsequent analyses robust in the context of imperfect source data.

eData is a projected funded by the Danish Council for Independent Research - Technology and Production Sciences. More information can be found on the project webpage.

WallViz

The WallViz project uses highly interactive, wall-sized visualizations to help decision makers handle massive collections of data and improve decision making from massive collections of data.

We collaborate with researchers from the University of Copenhagen (DIKU) who are experts in visualization and user-centered design. The project is coordinated by project manager Kasper Hornbæk from the Human-Centered Computing group.

The focus of our group is on data management and data mining issues in the context of interactive exploration and decision making. We propose solutions for decision making, providing entry points for the analysis efficiently and incrementally, selecting points expected to be of high interest to the user. We group data to provide an overview over large collections of data, and identify anomalies or outliers for further inspection.

The project is supported by the Strategic Research Council. More information can be found on the project webpage.

Previous projects

GEOCROWD

GEOCROWD is an initial training network that aims to advance the state-of-the-art in managing large amounts of semantically rich, user-generated geospatial data in a web setting. The project offers full-time support for a dozen young, initial-stage scientists. More information can be found on the project webpage.

REDUCTION

REDUCTION : Reducing Environmental Footprint based on Multi-Modal Fleet management Systems for Eco-Routing and Driver Behaviour Adaptation
REDUCTION is a collaborative research project funded by the European Comission; 7th Framework Programme Call 7 (2011-2014). REDUCTION aims at combining vehicular and ICT technologies for collecting and analyzing historic and real-time data about driving behaviour, routing information, and the associated carbon emissions measurements. More information can be found on the project webpage.

Cloud infrastructure

The cloud infrastructure project develops new data management infrastructure for cloud computing environments. More information can be found on the project webpage.

MOVE

MOVE is an Action of the COST Programme (European Cooperation in Science and Technology) for knowledge extraction from massive amounts of data about moving objects. More information can be found on the project webpage.

Streamspin

StreamSpin invents data management technology for web sites offering mobile services. More information can be found on the project webpage.

TimeCenter

TimeCenter is an international center for the support of temporal database applications on traditional and emerging DBMS technologies. More information can be found on the project webpage.