ID.2. Discovering Data Transformations from Schema Transformations

Classical information systems only require to design the data schema. This, however, is not enough for data warehouse systems since they require the means of orchestrating the data flow from the available sources to the target data warehouse (this data flow is known as the ETL process). Accordingly, the data warehouse design consists of two main tasks; the design of the multidimensional schema and the ETL process to load it. Lots of efforts have been devoted to semi-automatically derive the multidimensional schema from the data sources. Surprisingly enough, the design of the ETL process has been traditionally assumed to be carried out manually and thus, most current works do not relate these two tasks that are clearly tied: the data warehouse schema is obtained by transforming the source schemas into a specific, appropriate target schema, whereas the ETL shows data transformations.

The goal of this topic is to benefit from the schema transformations used in deriving the data warehouse schema to discover the ETL constructs. To do so, semantics play an important role and for this reason the interpretation of the schema should drive the design of the ETL process, as it is done in the data exchange area, which purely uses expensive reasoning algorithms. In data exchange, the schema expressivity is limited to make the process feasible. However, this is not enough for a typical data warehouse environment with large amounts of data. Therefore, we must limit the use of these (expensive) reasoning tasks and complement it with traditional database techniques relating schema and data.

Main Advisor at Universitat Politècnica de Catalunya (UPC)
Co-advisor at Université Libre de Bruxelles (ULB)