MS.8. Design of Big Data Query Languages

Modern Business Intelligence applications are faced with increasingly large volumes of data to analyze. In an effort to face this data avalanche, sophisticated BI applications are tapping into the power of parallel and distributed computing offered e.g., by multicore processors and cloud computing.

Unfortunately, parallel and distributed programming today remains challenging, even for the best programmers. Much effort is therefore being put in the creation of new, expressive, *declarative* query and programming languages that are naturally yield paralell and distributed executions. Notable examples from industry include Yahoo's Pig Latin, Apache's Hive (both targetting the Map/Reduce), and Microsoft's DryadLINQ (targetting Dryad). All of these systems were inspired by SQL to some extent.

Another route towards distributed programming has been taken by researchers at UC Berkely who have demonstrated that Datalog, a logic-based database query language intensely studied in the 1980s, can serve as the rootstock of an arguably simpler family of languages for programming seriously parallel and distributed software. Programs written in datalog-style implementations of distributed systems are reported to be orders of magnitude more compact than popular imperatively implemented systems. The effectiveness of datalog-style languages to the data warehousing and business context remain to be investigated, however.

This topic deals with the study and design of datalog-like languages for the parallel and distributed analysis of business intelligence data. A particular point of investigation is the design of suitable business-intelligence primitives in this context.

Main Advisor at Université Libre de Bruxelles (ULB)
Co-advisor at Aalborg Universitet (AAU)