ID.12. Framework for generating test data for query workload

Database vendors and developers put a substantial effort on assuring an acceptable database (DB) performance. To this end, various DB performance optimization methods have been developed and are used in practice. An important step in these methods is gathering statistics on DBMS performance (e.g., data cache usage) and query performance (e.g., execution plans, CPU and I/O usage). These statistics are gathered while a query workload is run in a production system.

Vendors of database management systems (DBMSs) often provide a technical support for their products, which includes database performance tuning. Unfortunately, such a performance tuning typically cannot be done in a production system as it often requires a "try and test" approach, i.e., various alternative design techniques (e.g., indexes, partitions) are created and the workload is run to test the DB performance for each of these alternative designs. The performance tests are frequently done in a test environment owned by a DBMS vendor. The environment must be as similar as possible to the production one. This includes hardware, software, and data. A problem that is faced by the vendors is to run a workload on a data set that is used in a production system. Unfortunately, in practice, such datasets are frequently unavailable due to data privacy policies.

For this reason, the DB vendors must generate synthetic data sets. There is, however, a problem to generate such a data set that is as "close" as possible to the original (production) one. In order to generate adequate test data sets, one has to know various statistical metadata on the original data set.

The overall aim of this research topic is to develop a method for generating adequate data sets for a given query workload. In particular, a PhD student will:

  • propose the set of metadata for describing a source data set in order to generate adequate synthetic data,
  • develop a method for understanding the meaning of a query workload and identifying patterns of dependencies between data and multiple different data distributions, based on the queries,
  • develop a method for generating multiple versions of data for testing various aspects of query performance (e.g., filtering, joins, aggregations),
  • design and implement a data generator.

The starting point to exploit the topic is data generation approach developed for IBM Poland, at PUT within a master thesis. For this reason, there is a potential possibility to co-operate with IBM Poland in the course of developing this PhD project.

Main Advisor at Poznan University of Technology (PUT)
Co‐advisor at Universitat Politècnica de Catalunya (UPC)