LSP.4. Analyse and Explore the Web of Data

The Web consists of a huge number of documents together with large amounts of HTML tables containing relational-style data and schema-like descriptions such as labelled and typed columns. Although, these web tables are a valuable resource of easily accessible structured information, there is an evident lack of support of integrating, reusing and analysing this data in an ad-hoc manner. The topic aims this problem on the data consumption and the data pre-processing side. On the data consumption side a database system is needed that combines structured and unstructured query processing and enables seamless SQL queries over both RDBMS and the Web of Data. Therefore, we would like to extend the open-source RDBMS Postgres in order process SQL queries with unknown attributes that are matched to a set of web table candidates. The number of web tables matching a missing attribute can be huge, leading to a heavy increase of the intermediate results in the query plan. This results in new optimization problems that should be tackled within this topic. On the data pre-processing side we have to consider that most web tables are arbitrarily structured which prevents their reuse in analytical applications. We therefore envision to develop a framework for extracting first normal form relations from partially structured documents such as spread sheets and HTML tables. This framework should implement a pipeline of abstract extraction phases, each one cleaning or removing layout artefacts and denormalizations. Since the web tables should be extracted from the public available Common Crawl Corpus with a size of 81 TB, the extraction framework must be able to process data on a very large scale.

Main Advisor at Technische Universität Dresden (TUD)
Co-advisor at Poznan University of Technology (PUT)