II.2. Light-weight Data Integration for Data Lake Architectures

With the rise of Big Data and its new capabilities to easily generate data in a large-scale manner, the issue of data quality and data integration is more important than ever before. Beside the continued growth of data volume and its increasing heterogeneity there are also changes from the data management side, where new principles such as data lakes arise, that allows to easily ingest, transform, and analyze data in a flexible and agile manner.

These data lake systems have two conflicting goals: The first is that they should have as little limitations as possible on what can be published, allowing many formats, from different users, domains and with different degree of schema, i.e., they should be “free-for-all”. On the other hand, they should be optimized for data reusability, i.e., make it as easy as possible for a user to retrieve datasets fitting a specific analytical scenario and allow processing of the data with minimal effort.

To tackle these two contradicting goals data lakes have to be integrated for the sake of having integrated data, as the possible reuse scenarios are not known beforehand. Furthermore, in classical data integration is usually considered as one-to-one integration between two well-defined schemata, i.e. sets of relations that describe the same domain. In a data lake, there is a large number of mostly unrelated datasets that usually have very few corresponding attributes if at all. Still, subsets of them describe the same domains and could be reused together if they were properly consolidated. So while there is reuse and recombination potential, making all the different datasets obey to a global schema is unfeasible. Still, we argue that some integration tasks can and should be tackled a priori, to improve the usefulness of a data lake as a whole.

We therefore want to investigate existing and develop new light-weight data integration techniques that can be applied to new dataset which should be stored in a data lake. These techniques should prevent the growth of heterogeneity in the data lake over time and foster commonalities between datasets without an explicit target schema. To evaluate the concepts invented in this topic we simulate a data lake by relying on our Dresden Web table Corpus consisting of 125M tables.

Main Advisor at Technische Universität Dresden (TUD)
Co‐advisor at Université Libre de Bruxelles (ULB)