LSP.19. Business Analytics on Text Documents

Valuable knowledge can be acquired from analyzing multiple text data. Such data may be represented either by standard text documents (multiple pages long) or short texts from various fora. Searching and ranking sets of documents, both large and small is a challenge. One of the requirement from the business we cooperate with is to be able to analyze documents in a way that is similar to OLAP. Three main problems appear here. First, how to annotate the documents, so that OLAP-like queries could be run on them. Second, how to organize the documents in cube-like structures, in order to provide an OLAP-like functionality. Third, how to efficiently store the documents and search them.

Despite the fact that some indexing techniques (e.g., inverted files) and document search engines (e.g., Lucene, Elastic Search) have been developed, the three aforementioned problems haven't been fully solved yet.

The aim of this project is to: (1) develop data models, data structures, and techniques to organize text documents in cube-like structures, (2) develop mechanisms for querying, ranking, and summarizing (if possible) documents, (3) develop novel indexing techniques suitable for the developed data models, (4) implement a prototype document data warehouse, and (5) evaluate the performance of the data warehouse.

Main Advisor at Poznan University of Technology (PUT)
Co-advisor at Université Libre de Bruxelles