II.7. Test data generation

Obtaining the right set of data for evaluating the fulfillment of different quality standards in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. Additionally, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this topic we propose to build an automatic data generator. Starting from a given ETL process model, it should extract the semantics of data transformations, analyze the constraints they imply over data, and automatically generate testing datasets. At the same time, it should consider different datasets and transformation characteristics (e.g., size, distribution, selectivity, etc.) in order to cover a variety of test scenarios. Specially relevant and challenging in this case is finding the optimum number of tuples to generate and also dealing with aggregation tests.

Main Advisor at Universitat Politècnica de Catalunya (UPC)
Co‐advisor at Poznan University of Technology (PUT)