Convolutional neural networks (CNNs) achieve good performance in predicting the phenotype for unannotated biological sequences. To that end, they optimize filters that can be interpreted as sequence motifs. Such motifs appear to be relevant variants for Genome-wide association studies (GWAS), that aim to identify correlations between genetic variants and a trait. They are indeed better-suited variants for GWAS studies applied to meta-genomes or organisms with accessory genomes than the standard ones.
To our knowledge, there are no existing frameworks to perform inference on the trained filters of a CNN. Although standard data-splitting strategies do exist for GWAS studies, testing the association between the motifs and the phenotype using those strategies results in both a lower performance for motifs optimization and a loss in statistical power in a context of small-scale datasets.
In the present work, we first develop a stable step-wise procedure to select a small number of sequence motifs associated with a trait, and we draw a formal link between our procedure and CNNs for biological sequences.
We then take advantage of recent advances in post-selection inference to produce a well-calibrated testing procedure for the association between the selected motifs and the trait, while accounting for our selection procedure.