Séminaire de Statistique et Optimisation

Non-asymptotic control of a kernel 2-sample test.

par Perrine Lacroix (ENS Lyon)

Europe/Paris
Salle K. Johnson, 1er étage (1R3)

Salle K. Johnson, 1er étage

1R3

Description

We are interested in statistical tests to evaluate the hypothesis H₀: {P = Q} against its alternative H₁: {P ≠ Q}. Our data are multivariate, high-dimensional and exhibit strong dependencies between variables. We propose a comparison test of two distributions based on kernel methods: our data are first transformed via a well-chosen feature map and live in a reproducing kernel hilbert space (RKHS). Our kernel test statistic is the equivalent of the Hotelling's T2 comparison test for finite-dimensional multivariate data, and is equal to the mean embeddings difference (MMD) renormalized by a well-chosen covariance operator. 
Classically, these non-parametric tests are either calibrated asymptotically, or via test aggregation techniques. Here, we propose to calibrate the test at a given fixed sample size by obtaining non-asymptotic bounds on our test statistic. For this, a regularization is required to approximate the covariance operator via its empirical estimator. Unlike the approaches of Harchaoui et al. (2007) or Hagrass et al. (2023) using $L_2$ regularizations, we propose spectral truncation. This method fixes the unknown number $T$ of eigenfunctions to reconstruct the covariance operator and provides the additional advantage of data visualization.
Currently, at a fixed $T$, the test statistic, called the truncated kernel Fisher Discriminant Ratio ($KFDA_T$), provides a test whose asymptotic calibration is known (Ozier-Lafontaine et al. (2023)). In this talk, I will present how to theoretically and non-asymptotically bound the p-value of the test associated with the $KFDA_T$. This bound is a first step in defining a good calibration of the hyperparameter $T$.
In applications, this statistical question is essential in the field of genomics, where the two groups are composed of single-cell RNA-seq data. The goal is to detect distinct or similar biological behavior between the groups.

Joint work with Bertrand Michel (Université de Nantes), Franck Picard (ENS de Lyon) et Vincent Rivoirard (Paris-Dauphine).