eData

Scalable and Effective Anomaly Detection and Correction of Automatically Collected Data in e-Science

The shift towards e-Science and the use of computational techniques in the sciences brings with it the need for effective analyses of very large collections of often complex scientific data.

Much of today’s data handling is part of automated processing, ranging from measuring relevant variables, to data collection, to integration, to analysis, and finally to decision making. The term e-science (electronic or enhanced science) is used to denote such data intensive and computationally intensive work in collaborative science. Analytical results and decision making are based on various processes, each of which might introduce errors.

This project aims to provide a foundation for e-Science by contributing novel techniques that enable scientists to detect and correct anomalies in source data, in an on-line, interactive, lineage-preserving, and semi-automatic manner.This contrasts traditional algorithms that operate in a batch manner on static data and require data mining expertise for their use. We propose a new paradigm that allows domain experts to tap into the full potential of data mining by inventing scalable algorithms that build on insights from most notably the area of subspace clustering to offer effective foundations for anomaly detection and correction that render subsequent analyses robust in the context of imperfect source data.

Challenges arise from