eData

Scalable and Effective Anomaly Detection and Correction of Automatically Collected Data in e-Science

The shift towards e-Science and the use of computational techniques in the sciences brings with it the need for effective analyses of very large collections of often complex scientific data.

Much of today’s data handling is part of automated processing, ranging from measuring relevant variables, to data collection, to integration, to analysis, and finally to decision making. The term e-science (electronic or enhanced science) is used to denote such data intensive and computationally intensive work in collaborative science. Analytical results and decision making are based on various processes, each of which might introduce errors.

This project aims to provide a foundation for e-Science by contributing novel techniques that enable scientists to detect and correct anomalies in source data, in an on-line, interactive, lineage-preserving, and semi-automatic manner.This contrasts traditional algorithms that operate in a batch manner on static data and require data mining expertise for their use. We propose a new paradigm that allows domain experts to tap into the full potential of data mining by inventing scalable algorithms that build on insights from most notably the area of subspace clustering to offer effective foundations for anomaly detection and correction that render subsequent analyses robust in the context of imperfect source data.

Challenges arise from

The complexity of today's scientific workflows
Varieties of potential error sources and thereby diverse outlier profiles
Large number of dimensions in typical data collections
Representation of outlier properties to domain experts for

We focus on developing novel outlier detection approaches that

Identify relevant subspace projections in high dimensional data
Scale to large scale data collections
Integrate seamlessly into the eScience lifecycle
Handle dynamic data incrementally and efficiently
Are applicable with little or no data mining expertise
Allow for feedback and updating by domain experts

This project is supported by the Danish Council for Independent Research | Technology and Production Sciences from 2011 to 2014

People involved:

Assoc. Prof. Ira Assent
Prof. Christian S. Jensen

Revised 01.09.2025

Sofia Hedegaard Rasmussen