II.6. Web Table Normalization

Web tables have become an important resource for applications such as entity augmentation (paper). A major challenge for automatically identifying and reusing these tables is the fact that many of them are not in a relational form. Instead most tables published in the Web are intended for humans containing metadata and structural information that is only available implicitly via positioning and styling. Further these tables are often intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of Web table applications. In this topic we therefore aim to transform partially structured documents, e.g. spreadsheets or HTML pages, into first normal form relations. Therefore, we have to analyze and define a complete set of denormalizations and irregularities that appear typically in spreadsheets and web tables. Given this set of irregularities a machine-learning-based approach has to be developed, that is able to normalize a partially structured document into one or more relational web tables. To evaluate the quality of the web table normalization we can rely on a large real-word dataset, our Dresden Web table Corpus , consisting of 112M tables.

Main Advisor at Technische Universität Dresden (TUD)
Co-advisor at Universitat Politècnica de Catalunya (UPC)