MS.13. Declarative Languages for Information Extraction and Text Analytics

Information Extraction (IE) is the task of automatically extracting structured information from text. While IE originally found application primarily in the military domain, the task is nowadays pervasive in a plethora of computational challenges (especially those associated with Big Data), including social media analysis, customer relationship management, machine data analysis, health care analysis, and indexing for semantic search. In addition, IE often constitutes a first step in all sorts of text analytics and business intelligence settings. While there is a vast body of research on approaches to IE, existing solutions mostly target restricted settings where users are highly-trained computational linguists, where workloads cover only a small number of very well-defined tasks and data sets, and where extraction throughput is far less important than the accuracy of results. In contrast, however, the ubiquity, volume, and diversity (corporate documents, emails, blogs, tweets, etc.). of textual data, coupled with the above-mentioned growing application domain of IE gives rise to an acute need for IE solutions that are expressive; programmer-friendly for non-linguists; scalable; and efficiently executable.

Borrowing ideas from the database research community, these observations have recently motivated the design of declarative, SQL-like languages for expressing information extraction programs. This approach has already proven effective in practice, and is commercialized, for example, in IBM SystemT. While the declarative specification of IE programs constitutes a fundamental paradigm shift in the way that IE programs are specified and executed, it currently suffers from two major shortcomings namely: restricted language features and limited evaluation strategies. The goal in this research project, is to overcome these limitations by:

  1. Investigating the extension with expressive operators like recursion, aggregation and OLAP-style multi-dimensional analytical operators;
  2. Researching how one can interface the approach with the statistical approach to Information Extraction; and
  3. Developing efficient evaluation strategies for this approach that are capable of analyzing documents in a high-throughput streaming fashion.

Main Advisor at Université Libre de Bruxelles (ULB)
Co‐advisor at Aalborg Universitet (AAU)