II.9. Integrating evolving data sources by means of semi-automatic reparation of ETL workflows

A data warehouse (DW) architecture has been developed in order to: (1) provide a framework for the integration of multiple heterogeneous, distributed, and autonomous external data sources (EDSs) spread across a company and (2) provide means for advanced data analysis, called On-Line Analytical Processing (OLAP). The DW architecture typically is composed of four layers, i.e., (1) an external data source layer that represents production systems being integrated, (2) an Extraction-Translation-Loading (ETL) layer responsible among others for extracting data from EDSs, (3) a repository layer (a data warehouse) that stores the integrated and summarized data, and (4) an analytical layer.

An inherent feature of EDSs is their evolution in time with respect to their contents and structures (schemas). As reported in numerous publications, structures of data sources change frequently. For example, the Wikipedia schema changed on average every 9-10 days during the last 4 years.

Structural changes must be propagated to the DW architecture. They are difficult to handle and manage since they have an impact on multiple layers of the DW architecture. First, structural changes have an impact on the ETL layer that must be redesigned and redeployed, i.e. repaired. Second, they have an impact on a data warehouse schema that must be modified in order to follow changes in EDSs. DW schema changes result, in turn, in changes that have to be made to analytical applications. For these reasons, developing a technology for handling structural changes of EDSs and managing the evolution of the DW architecture is of high practical importance.

The research and technological developments in the area of handling structural changes of EDSs in the DW architecture has mainly focused on managing changes in a DW schema, whereas handling and incorporating structural changes to the ETL layer received so far little attention from the research community.

At Poznan University of Technology, the problem of handling the evolution of EDSs in the ETL layer is currently being researched. Their approach to repairing and ETL workflow in response to EDSs structural changes applies the Case Based Reasoning technique. This, however is just one of a few possible approaches. Other approaches may apply process mining algorithms, machine learning techniques, and ontologies to repair an ETL workflow.
The goal of this project is to develop a framework for automatic or semi-automatic repairing of ETL workflows in response to structural changes in EDSs. In details, the goal is divided into the following sub-goals:

  • applying process mining algorithms to the reparation of and ETL workflow,
  • applying other machine learning techniques to the reparation of and ETL workflow,
  • augmenting the two aforementioned techniques with ontologies,
  • extending the existing E-ETL prototype with the developed techniques,
  • creating proof of concept scenarios,
  • assessing the proposed techniques with respect to some accuracy and performance measures.

Main Advisor at Poznan University of Technology (PUT)
Co-advisor at Université Libre de Bruxelles (ULB)