II.3. Real‐Time ETL for Streaming Data

Traditionally, the Extract‐Transform‐Load (ETL) process for populating data warehouses has been performed in a batch fashion. Lately, there has been an increasing interest in doing the load in real‐time (or rather near‐real‐time) to avoid or greatly reduce the delay between the time an event happens and the time when the event becomes represented in the data warehouse. To achieve this, some users have ended up with two ETL flows: A quick‐and‐dirty for real‐time loading and a slow and detailed for traditional batch loading where consistency is ensured, etc. This is, however, making maintenance harder. The wide‐spread use of sensors does also lead to massive amounts of streaming data that novel BI solutions should support. In this topic, the vision is to develop a framework that enables real‐time ETL processing for streaming data coming from thousands or millions of sensors. To achieve this, new real‐time operators must be defined and implemented. The ETL framework should integrate with the model‐based BI system from Topic BDA.4 and include support for planning and optimizing the entire ETL flows to efficiently handle very large amounts of streaming data. This also includes intelligently deciding when to perform the (possibly expensive) updates of the underlying models and when to keep the stream data elsewhere. There will thus be a great interaction between the ETL framework and the underlying storage engine.

Main Advisor at Aalborg Universitet (AAU)
Co‐advisor at Poznan University of Technology (PUT)