Séminaire de Statistique et Optimisation

Pattern enumeration in tree data: subtrees and beyond

by Romain Azaïs (INRIA Lyon)

Salle K. Johnson, 1er étage (1R3)

Salle K. Johnson, 1er étage


Kernel methods are particularly adapted to the analysis of complex combinatorial data (such as sequences, trees or graphs) which do not admit in general a Euclidean structure. Convolution kernels consist in defining the similarity between 2 data via the number of occurrences of certain substructures they share. In order to accurately compare data, one may want to choose a rich family of admissible substructures, but the difficulty then appears in the computation of the kernel: the richer the substructures involved, the more complex the computation of the kernel. In some cases, the kernel can be evaluated efficiently without having to enumerate the substructures in common. However, this can strongly constrain the parametrization of the weight function. Through the example of the subtree kernel, we state, both theoretically and numerically, the importance of the design of the weight function in supervised classification problems. We develop a new algorithm for computing the subtree kernel, based on the minimal enumeration of substructures by exact compression techniques of trees. This method allows to extract the important features and to learn the weight function on the data. We establish the interest of this algorithm both in terms of complexity, prediction capability and data visualization, on 8 different real databases. Finally, we show how more sophisticated enumeration techniques can be used to extend the kernel to higher orders and solve frequent pattern mining problems.