II.5. Data Integration and ETL for Semantic Data

In recent years, more and more semantic data has become freely available on the Web; websites are annotated with RDF markup, data collections are offered for download, and even interfaces for structured queries over such data can be used free of charge. One of the reasons why semantic data has become so successful is that publishing and making the data available is low effort and does not rely on a sophisticated schema. Instead, various standard ontologies and self‐defined extensions can be used.

Being an advantage of the Semantic Web paradigm that the data format is highly flexible, this is a disadvantage during the ETL process where the schema plays an important role. In addition, the schema of semantic data is often not known before but encoded as part of the data set itself. Furthermore, many sources have been automatically generated by converting other data formats into RDF or by information extraction techniques, and hence yield errors. Thus, in addition to the heterogeneities that ETL for traditional data has to deal with, additional challenges arise for semantic data, especially regarding cleansing and duplicate detection.

The aim of this topic is to develop an approach that enables the ETL process for semantic data despite the above mentioned problems by (1) developing scalable data integration techniques that can handle multiple semantic data sources, (2) implementing an appropriate environment to facilitate the ETL process, and (3) evaluating the proposed solutions.

Main Advisor at Aalborg Universitet (AAU)
Co‐advisor at Université Libre de Bruxelles (ULB)