Rencontres Statistiques Lyonnaises

Count matrix factorization for dimension reduction and data visualization

par Ghislain Durif (LBBE, Université Lyon 1)

Europe/Paris
Séminaire 2 (Bâtiment Braconnier)

Séminaire 2

Bâtiment Braconnier

Description
The statistical analysis of Next-Generation Sequencing (NGS) data has raised many computational challenges regarding modeling and inference. High-throughput technologies now allow to monitor the expression of thousands of genes while considering a growing number of individuals, such as hundreds of individual cells. Despite the increasing number of observations, genomic data remain characterized by their high-dimensionality. Analyzing such data requires the use of dimension reduction approaches, in particular for data exploration. In this context, we will focus on unsupervised compression methods, i.e. representation of the data into a lower dimensional space. We will consider the framework of matrix factorization for count data. We propose a model-based approach that is very flexible, and that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data). Our matrix factorization method relies on a Gamma-Poisson hierarchical model for which we derive an estimation procedure based on variational inference. In this scheme, we consider variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering is illustrated in simulation experiments and by preliminary results of an on-going analysis of single-cell data.