Assurance and traceability of data and operations
Managing and integrating large-scale data as well as performing complex analytics require assurances on the quality of the data and the processing steps performed. We are investigating several, complementary approaches for such assurances.
- Assurances on the sources and transformations of data: Provenance
- Description and Validation of the data formats: Schema
- Models for changing the data and the queries: Transactions/Lifecycle models
From a conceptual point of view, tracing and evaluating has a great amount of similarity with (data) provenance. In collaboration with University of Ghent, we developed a model of information diffusion based on the W3 PROV standard, providing an ontology for provenance at different granularity levels. Using this model, we can combine explicit, fine-grained provenance from diffusion path and content-oriented, coarse-grained provenance from message similarity and create a complementary and more precise view in the diffusion process.
From a technical point of view, computing provenance can be challenging, in particular for data stream systems. We developed an approach that is based on operator instrumentation that combines high accuracy with moderate runtime overhead.
Describing data utilizing Schema information allows validation of input data, optimization of queries as well as data storage and simplification/support when formulating queries. While schema as central part of the relational model, other data models do not offer a similar level of data description. We developed a schema for data streams and proposal for RDF constraints; in both cases, we showed how validity can be checked efficiently and how the descriptions can be used for optimizations.