II.1. Optimization of user-defined ETL tasks

An ETL process has to finish in a limited time window, otherwise, the data warehouse remains unavailable for its users. Due to the huge volumes of data manipulated by typical data warehouses, the optimisation of ETL execution is a crucial problem. Some techniques used to reduce the execution time of an ETL process are the following: batch load via a direct path, dropping indexes and integrity constraints in the DW before loading data, sorting and aggregating data in operating system files, separating inserts from updates and deletes, parallel loading, and applying advanced ELT architectures. Commercial ETL development tools do not offer any support for ETL optimisation and few research solutions to this problem have been proposed so far. They mostly focus on restructuring ETL processes in order to minimise the volume of processed data and on heuristics for searching the space of possible ETL executions, in order to find the most efficient one. These solutions are still not mature enough to be applied in practice.

Moreover, the optimization of user-defined ETL tasks is much more difficult to handle and has not been addressed yet. Such tasks may be implemented in multiple programming languages, mostly the procedural ones. Assessing execution costs of such tasks is a non-trivial problem. At the first approach, the characteristics of the tasks should be black-box tested, however, more advanced cost execution cost estimations are necessary. For this reason, there is an evident need for further investigation of this problem.

The aim of this topic is to: (1) propose a method for assessing execution performance of various user-defined ETL tasks, (2) develop a method for defining cost functions for user-defined ETL tasks, (3) develop a prototype ETL optimizer, (4) experimentally evaluate the proposed solution.

Main Advisor at Poznan University of Technology (PUT)
Co-advisor at Technische Universität Dresden (TUD)