Scalable and Effective Anomaly Detection and Correction of Automatically Collected Data in e-Science

The shift towards e-Science and the use of computational techniques in the sciences brings with it the need for effective analyses of very large collections of often complex scientific data.

Much of today’s data handling is part of automated processing, ranging from measuring relevant variables, to data collection, to integration, to analysis, and finally to decision making. The term e-science (electronic or enhanced science) is used to denote such data intensive and computationally intensive work in collaborative science. Analytical results and decision making are based on various processes, each of which might introduce errors.

This project aims to provide a foundation for e-Science by contributing novel techniques that enable scientists to detect and correct anomalies in source data, in an on-line, interactive, lineage-preserving, and semi-automatic manner.This contrasts traditional algorithms that operate in a batch manner on static data and require data mining expertise for their use. We propose a new paradigm that allows domain experts to tap into the full potential of data mining by inventing scalable algorithms that build on insights from most notably the area of subspace clustering to offer effective foundations for anomaly detection and correction that render subsequent analyses robust in the context of imperfect source data.

Challenges arise from

  • The complexity of today's scientific workflows
  • Varieties of potential error sources and thereby diverse outlier profiles
  • Large number of dimensions in typical data collections
  • Representation of outlier properties to domain experts for

We focus on developing novel outlier detection approaches that

  • Identify relevant subspace projections in high dimensional data
  • Scale to large scale data collections
  • Integrate seamlessly into the eScience lifecycle
  • Handle dynamic data incrementally and efficiently
  • Are applicable with little or no data mining expertise
  • Allow for feedback and updating by domain experts

This project is supported by the Danish Council for Independent Research | Technology and Production Sciences from 2011 to 2014

People involved:

  • Assoc. Prof. Ira Assent
  • Prof. Christian S. Jensen