MS.7. Provenance Management for Data Warehousing

Business Intelligence applications require increasingly large numbers of loosely coupled data sources (both corporate and on the web) to be loaded within and accross data warehouses. In order to be able to trust conclusions drawn from the data in these warehouses, detailed information about the data's origin is required. This kind of information is often referred to as data provenance or data lineage. The provenance of a data item can include information about the processes and data sources that lead to its creation, as well as the timeliness of the data, or annotions by curation experts testifying to the trust and accurracy of the data.

Although many approaches for modeling, managing, and tracking provenance information have recently been proposed, many issues are still to be addressed. For example, while many proposals focus on how to represent the provenance of a data item, these proposals neglect to propose a provenance query language by which the provenance of a data item can be retrieved (e.g., when a decision maker wants to check the quality of the data items on which certain business intelligence conlcusions are based). In addition, existing proposals focus on a single (often relational) data model, and hence cannot cope with the integration of heterogeneous data sources where one wants to combine data represented in multiple models (e.g., corporate data in relational data warehouses with semi-structured data on the web).

This topic deals with the design of suitable provenance models for managing and automatically tracking provenance information in next-generation data warehouses, as well as the the design of query languages for quering the provenance information stored in such data warehouses.

Main Advisor at Université Libre de Bruxelles (ULB)
Co-advisor at Poznan University of Technology (PUT)