MS.5. Modeling and Querying Unstructured Data Warehouses

Corporate data nowadays consists of two basic kinds of data: structured and unstructured data. A key characteristic of structured data is data that it is repetitive: airline reservations, bank transactions, telephone calls are all common forms of structured information. Opposite to this, unstructured data is not repetitive. Examples are emails, corporate contracts, human resource files, medical records, financial reports, and corporate memos, not to mention more complex kinds of data like images, geographic, and web data. It is estimated that 80% or more of the data in a corporation is in the form of unstructured text.

Typical dimensional modeling methodologies address only the structured component of the corporate data environment and thus, most of the data in a company is not considered at all in this approach. It is clear that a Business Intelligence (BI) system must address both kinds of data in a complementary fashion. Although some solutions have been recently proposed for building unstructured data warehouses, many issues are still to be addressed.

This new paradigm needs to be investigated since in the near future, and with the explosion of the Web, the proportion of structured data in a company will be insignificant with respect to other kinds of data. This topic deals with modeling and methodological issues for building unstructured data warehouses, as well as the design of query languages for such data warehouses.

Main Advisor at Université Libre de Bruxelles (ULB)
Co-advisor at Technische Universität Dresden (TUD)