Apprentissage statistique pour l'estimation de sélectivité en bases de données relationnelles

by Mr Max Halford (Institut de Mathématiques de Toulouse)

Amphithéâtre Laurent Schwartz, bâtiment 1R3 (Institut de Mathématiques de Toulouse)

Amphithéâtre Laurent Schwartz, bâtiment 1R3

Institut de Mathématiques de Toulouse

118 route de Narbonne 31062 Toulouse Cedex 9

Databases, and in particular relational databases, are a common paradigm for storing and querying data. Data is stored in relations which are linked with each other. Users may interrogate the database by issuing queries via a declarative query language. One of the challenges for a relational database management system (RDBMS) is to evaluate each query as quickly possible. However, as it turns out, there are many ways in which a query can be executed. The RDBMS delegates the choice of the query execution plan to a query optimiser. In turn, the query optimiser tasks a cost model with the responsibility of estimating the cost of each execution plan considered by the optimiser. The cost estimation therefore has a large influence on the quality of the execution plans, which in turns affects the running time that the user perceives. The most important input in the cost model is the selectivity, which is the amount of data that is expected to flow in and out of each operator in a plan. From a statistical point of view, this is a case of density estimation, where the goal is to determine the amount of data that verifies certain conditions. It is common practice to use density estimation models that make strong simplifying assumptions, in large part for efficiency reasons. For instance, it is usually assumed that attributes are independent with each other. Alas, in practice, these assumptions are more often that not unverified, which leads to large estimation errors. The goal of this PhD is thus to propose models that offer a better compromise between selectivity estimation accuracy and computational cost. First of all, we propose a methodology based on Bayesian networks, where we constrain each network to have a tree topology, therefore allowing a linear computational complexity. Secondly, we train online machine learning models to correct an existing selectivity estimation model, whilst adapting to concept drift. Our experimental results, which are based on the TPC-DS and JOB benchmarks, are very encouraging with respect to both of our lines of work.