# Mixture models : Theory and applications

Europe/Paris
Amphi Risler (jeudi), Amphi Tisserand (vendredi) (Paris)

### Amphi Risler (jeudi), Amphi Tisserand (vendredi)

#### Paris

AgroParisTech 16 rue Claude Bernard F-75231 Paris Cedex 05
Description

This workshop is organized by the members of

• Gilles Celeux (INRIA Saclay-Ile-de-France)
• Sandrine Laguerre (LISBP / INRA Toulouse)
• Béatrice Laurent-Bonneau (IMT/INSA Toulouse)
• Clément Marteau (ICJ/Lyon 1)
• Marie-Laure Martin-Magniette (MIA-Paris/ IPS2)
• Cathy Maugis-Rabusseau (IMT/INSA Toulouse)
• Andréa Rau (INRA Jouy-en-Josas)

Participants
• Agathe Guilloux
• Andrea Rau
• Antoine Godichon-Baggioni
• Antoine Houdard
• Arnak Dalalyan
• Beatrice LAURENT-BONNEAU
• Brendan Murphy
• Cathy Maugis-Rabusseau
• Christine Keribin
• Christophe Biernacki
• Clément Marteau
• Didier Chauveau
• Dominique Picard
• emmanuelle mauger
• Fabien Laporte
• fanny villers
• gildas mazo
• Gilles Celeux
• Jean-Michel Marin
• Jean-Patrick Baudry
• Jonas Kahn
• Kaniav KAMARY
• Kim-Anh Le Cao
• Marie Courbariaux
• Marie-Laure Martin-Magniette
• Mohamed Sedki
• Nathalie AKAKPO
• Pascal Germain
• Philippe Heinrich
• Pierre Barbillon
• Pierre Latouche
• Sandrine Laguerre
• Serge Iovleff
• Tabea Rebafka
• Thursday, 21 June
• 09:15 09:30
Welcome 15m
• 09:30 12:30
Theory around mixtures
• 09:30
Optimal Kullback-Leibler Aggregation in Mixture Density Estimation by Maximum Likelihood 30m
We study the maximum likelihood estimator of density of n independent observations, under the assumption that it is well approximated by a mixture with a large number of components. The main focus is on statistical properties with respect to the Kullback-Leibler loss. We establish risk bounds taking the form of sharp oracle inequalities both in deviation and in expectation. A simple consequence of these bounds is that the maximum likelihood estimator attains the optimal rate ((logK)/n)^(1/2), up to a possible logarithmic correction, in the problem of convex aggregation when the number K of components is larger than n1/2. More importantly, under the additional assumption that the Gram matrix of the components satisfies the compatibility condition, the obtained oracle inequalities yield the optimal rate in the sparsity scenario. That is, if the weight vector is (nearly) D-sparse, we get the rate (DlogK)/n. As a natural complement to our oracle inequalities, we introduce the notion of nearly-D-sparse aggregation and establish matching lower bounds for this type of aggregation.
Speaker: Prof. Arnak Dalalyan (ENSAE / CREST)
• 10:00
Weakly informative reparameterisations for location-scale mixtures 30m
While mixtures of Gaussian distributions have been studied for more than a century, the construction of a reference Bayesian analysis of those models remains unsolved, with a general prohibition of improper priors due to the ill-posed nature of such statistical objects. This diculty is usually bypassed by an empirical Bayes resolution . By creating a new parameterisation centred on the mean and possibly the variance of the mixture distribution itself, we manage to develop here a weakly informative prior for a wide class of mixtures with an arbitrary number of components. We demonstrate that some posterior distributions associated with this prior and a minimal sample size are proper. We provide MCMC implementations that exhibit the expected exchangeability.We only study here the univariate case, the extension to multivariate location-scale mixtures being currently under study. An R package called Ultimixt is associated with this paper.
Speaker: Dr Kaniav Kamary (Universite Paris-Dauphine / CEREMADE / INRIA, Saclay)
• 10:30
Coffee Break 30m
• 11:00
Vitesses d'estimation des paramètres d'un mélange fini 30m
Un mélange statistique fini est une distribution de la forme $\sum_i \pi_i f(\cdot, \theta_i)$, c'est-à-dire que chaque donnée est produite de la manière suivante: on choisit $i$ avec probabilité $\pi_i$, et la donnée est produite suivant la loi $f(\cdot, \theta_i)$. Les mélanges sont donc bien adaptés à la modélisation de populations hétérogènes, ou pour produire des distrutions complexes à partir de distributions relativement simples. L'estimation des paramètres $\pi_i$ et $\theta_i$ du mélange sont plus difficiles que dans les cas paramétriques lisses. Nous allons montrer que la vitesse minimax d'estimation pour un mélange à au plus $m$ composantes est $n^{-1/(4m -2)}$, corrigeant ainsi le taux erronné de $n^{-1/4}$ qui était connu. Une part de la confusion vient sans doute du fait que les vitesses d'estimation point par point sont différentes: en $n^{-1/2}$, mais elles ne sont pas uniformes sur l'espace. Nous nous étendrons sur cette différence qui n'est peut-être pas très courante.
Speaker: Dr Jonas Kahn (CNRS/IMT)
• 11:30
Estimation dans un modèle de contamination par méthode L2 30m
Dans ce travail théorique, nous étudions la question de l'estimation dans un modèle de contamination par translation. On observe un échantillon iid de loi à densité dans $R^d$ $$f^\star = (1-\lambda^\star) \phi + \lambda^\star \phi(.-\mu^\star)$$ et souhaitons étudier une méthode d'estimation de la probabilité de contamination $\lambda^\star$ et son effet $\mu^\star$. Nous proposons un critère d'estimation reposant sur une minimisation $\mathbb{L}^2$ et obtenons des résultats optimaux pour les paramètres $(\lambda^\star,\mu^\star)$. Nous utilisons pour ce-faire un raffinement astucieux et nouveau de l'inégalité de Cauchy-Schwarz pour des points sur une sphère $\mathbb{L}^2$. Enfin, nous relions nos résultats à des problèmes d'estimation en distance de Wasserstein. Ce travail est en collaboration avec Jonas Kahn (IMT), Clément Marteau (ICJ) et Cathy Maugis-Rabusseau (IMT/INSA)
• 12:00
Multidimensional two components Gaussian mixture detection 30m
We consider a d-dimensional i.i.d sample from a distribution with unknown density f. The problem of detection of a two-component mixture is considered. Our aim is to decide whether f is the density of a standard Gaussian random d-vector ($f=\phi_d$) against f is a two-component mixture: $f=(1−\varepsilon)\phi_{d}+\varepsilon \phi_{d}(.−\mu)$ where $(\varepsilon,\mu)$ are unknown parameters. Optimal separation conditions on $\varepsilon, \mu ,n$ and the dimension d are established, allowing to separate both hypotheses with prescribed errors. Several testing procedures are proposed and two alternative subsets are considered. Work in collaboration with C. Marteau (ICJ) and Cathy Maugis-Rabusseau (IMT/INSA)
Speaker: Prof. Béatrice Laurent (IMT/INSA)
• 12:30 14:00
Lunch 1h 30m
• 14:00 17:30
Mixture modelling and applications
• 14:00
Variable selection for latent class analysis with application to low back pain diagnosis 1h
The identification of most relevant clinical criteria related to low back pain disorders may aid the evaluation of the nature of pain suffered in a way that usefully informs patient assessment and treatment. Data concerning low back pain can be of categorical nature, in the form of a check-list in which each item denotes presence or absence of a clinical condition. Latent class analysis is a model-based clustering method for multivariate categorical responses, which can be applied to such data for a preliminary diagnosis of the type of pain. In this work, we propose a variable selection method for latent class analysis applied to the selection of the most useful variables in detecting the group structure in the data. The method is based on the comparison of two different models and allows the discarding of those variables with no group information and those variables carrying the same information as the already selected ones. We consider a swap-stepwise algorithm where at each step the models are compared through an approximation to their Bayes factor. The method is applied to the selection of the clinical criteria most useful for the clustering of patients in different classes. It is shown to perform a parsimonious variable selection and to give a clustering performance comparable to the expert-based classification of patients into three classes of pain.
Speaker: Prof. Brendan Murphy (University College Dublin)
• 15:00
The stochastic topic block model 30m
Due to the significant increase of communications between individuals via social media (Facebook, Twitter, Linkedin) or electronic formats (email, web, e-publication) in the past two decades, network analysis has become a unavoidable discipline. Many random graph models have been proposed to extract information from networks based on person-to-person links only, without taking into account information on the contents. This talk will introduce the stochastic topic block model (STBM), a probabilistic model for networks with textual edges. We will address here the problem of discovering meaningful clusters of vertices that are coherent from both the network interactions and the text contents. A classification variational expectation-maximization (C-VEM) algorithm will be proposed to perform inference. Finally, we will rely on the methodology to study the Enron political and financial scandals.
Speaker: Dr Pierre Latouche (Université Paris 1)
• 15:30
Coffee Break 30m
• 16:00
How to use Gaussian mixture models on patches for solving image inverse problems 30m
Most patch-based methods used in image processing involve Gaussian models or Gaussian mixture models. All theses methods can be seen through the same statistical framework. The most challenging part is the parameters estimation in the high dimensional patches space. After a brief introduction on image restoration, I will present the High-Dimensional Mixture model we introduced for image denoising [HDMI], which overcomes the curse of dimensionality by estimating intrinsic dimensions for each group of the mixture model. Finally, I will present some image restoration results obtained with this method. References : [HDMI] Antoine Houdard, Charles Bouveyron, Julie Delon. High-Dimensional Mixture Models For Unsupervised Image Denoising (HDMI). 2018. Web page: houdard.wp.imt.fr.
Speaker: Mr Antoine Houdard (Télécom ParisTech / MAP5)
• 16:30
Issues, Challenges and Models for Document Clustering 30m
In recent years, document clustering or text clustering techniques have been receiving more and more attentions as a fundamental and efficient tool for organization and summarization of huge volumes of text documents. In this talk, I provide a detailed survey of the problem of document clustering. I discuss a number of recent advances in this area and in the clustering and co-clustering contexts. I review different approaches: spectral, nonnegative matrix factorization, mixture model and latent block model.
• Friday, 22 June
• 09:30 12:30
Biological and medical applications
• 09:30
The Latent Block Model: a useful model for high dimensional data 30m
The Latent Block Model (LBM) designs in a same exercise a clustering of the rows and the columns of a data array. Typically the LBM is expected to be useful to analyze huge data sets with many observations and many variables. But it encounters several numerical issues with big data set: maximum likelihood is jeopardized by spurious maxima and selecting a proper model is challenging since there are a lot of models in competition. In this talk, we analyze these issues. In particular, we make use of Bayesian inference to avoid spurious solutions and propose an efficient way to scan the model set. Moreover, we advocate the exact Integrated Completed Likelihood (ICL) criterion to select a proper and consistent LBM. The methods and algorithms will be illustrated with pharmacovigilance data involving large arrays of data.
Speaker: Dr Christine Keribin (Université Paris Sud)
• 10:00
C-mix: a high dimensional mixture model for censored durations, with applications to genetic data 30m
We introduce a supervised learning mixture model for censored durations (C-mix) to simultaneously detect subgroups of patients with different prognosis and order them based on their risk. Our method is applicable in a high-dimensional setting where datasets contain a large number of biomedical covariates. To address this difficulty, we penalize the negative log-likelihood by the Elastic-Net, which leads to a sparse parameterization of the model and automatically pinpoints the relevant covariates for the survival prediction. Inference is achieved using an efficient Quasi-Newton Expectation Maximization (QNEM) algorithm. The statistical performance of the method is illustrated on three publicly available genetic cancer datasets with high-dimensional covariates.
Speaker: Prof. Agathe Guilloux (Université d'Évry Val d'Essonne)
• 10:30
Coffee break 30m
• 11:00
Model-based clustering for cytometry 30m
High-dimensional flow and mass cytometry allow to measure the expression of several proteins on tens of thousands of immune cells of a patient. A common task is to predict patients disease status. This can be done based on characteristics of the cells clusters of each patient. Hence the need for clustering methods. Some constraints make this problem challenging. The clusters of cells need to be interpretable as biologically meaningful profiles. Also, interesting groups of cells are typically rare populations. We propose a procedure relying on model-based clustering and merging of clusters.
Speaker: Dr Jean-Patrick Baudry (LSTA)
• 11:30
Clustering of co-expressed genes 1h
Complex studies of transcriptome dynamics are now routinely carried out using RNA sequencing (RNA-seq). A common goal in such studies is to identify groups of co-expressed genes that share similar expression profiles across several treatment conditions, time points, or tissues. These co-expression analyses can in fact serve a double purpose: (1) as an exploratory tool to visualize cluster-specific profile trajectories; and (2) as a hypothesis-generating tool for poorly annotated genes, as co-expression clusters may correspond to genes involved in similar biological processes or that are candidates for co-regulation. Although a large number of clustering algorithms have been proposed in the past to identify groups of co-expressed genes from microarray data, the question of if and how such methods may be applied to RNA-seq data has only recently been addressed. During the MixStatSeq project, we have proposed several methods to solve this gene clustering problem. After a first procedure based on a Poisson mixture model (Rau et al, 2015) on the raw counts of sequenced reads for each gene, the problem was reformulated as the clustering of normalized expression profiles, which represent compositional data. Data transformations in conjunction with Gaussian mixture models were considered as an effective strategy to identify RNA-seq co-expression clusters in Rau and Maugis-Rabusseau (2017). Some related strategies were investigated in Godichon-Baggioni et al. (2018) using K-means. All of these procedures are implemented in the R/Bioconductor package coseq.
Speaker: Dr Cathy Maugis-Rabusseau (IMT / INSA Toulouse)
• 12:30 14:00
Lunch 1h 30m
• 14:00 17:00
Softwares
• 14:00
MASSICCC: A SaaS Platform for Clustering Mixed Data 45m
The "Big Data'' paradigm involves large and complex data sets where the clustering task plays a central role for data exploration. For this purpose, model-based clustering has demonstrated many theoretical and practical successes in a various number of fields. In this context, user-friendly software are essential for speeding up diffusion of such academic advance inside the applicative world. MASSICCC (massive clustering in cloud computing) is a user-friendly SaaS platform which hosts three software specialized in different clustering tasks and written in C++. This platform allows to manipulate complex data with very light computing tools (as a smartphone), including also some dynamical graphical outputs. However, it offers also the possibility to export the results into a R data format for further more expert tasks. The three embedded software are Mixmod, Mixtcomp and Blockcluster. Mixmod (Lebret et al. 2015) is dedicated to clustering of continuous, categorical and a mixing of continuous and categorical data. Mixtcomp (Biernacki 2015) adds the possibility to cluster totally mixed data (continuous, categorical, count, ordinal, rank, functional), potentially including missing or partially missing (like interval) data. Blockcluster (Bhatia et al. 2017) is dedicated to co-clustering of large data sets composed of different kinds of data like continuous, categorical and count ones. In this talk, we will make a focus on both the Mixmod and MixtComp software. MASSICCC is freely available at https://massiccc.lille.inria.fr References: P. Bhatia, S. Iovleff & G. Govaert (2017). Blockcluster: An R Package for Model-Based Co-Clustering. Journal of Statistical Software, 76:9. C. Biernacki (2015). Model-based clustering with mixed/missing data using the new software MixtComp. 8th International Conference of the ERCIM WG on Computational and Methodological Statistics (CMStatistics 2015), University of London, UK, 12-14 December. R. Lebret, S. Iovleff, F. Langrognet, C. Biernacki, G. Celeux & G. Govaert (2015). Rmixmod: The R Package of the Model-Based Unsupervised, Supervised and Semi-Supervised Classification Mixmod Library. Journal of Statistical Software, 67:6.
Speaker: Prof. Christophe Biernacki (Université Lille 1/INRIA)
• 14:45
"Blockcluster" and "simerge" : Two R packages for Latent Block Models and Latent Block Models with co-variables implemented in C++ 30m
The basic idea of Latent Block Model (LBM) consists in making permutations of individuals (rows) and variables (columns) in order to draw a correspondence structure between individuals and variables. The R package "blockcluster" implements generative LBMs for binary, contingency, continuous and categorical data sets. In order to estimate the parameters, it implements BEM, BCEM algorithms. The R package "simerge" is a work in progress and allows to estimate LBM when additional information is available. It implements BEM algorithm. Both packages used C++ implementation and benefits from advanced C++ structures implemented by STK++ library and rtkore package (the port of STK++ to R). In this talk we will outline the theory LBM (with and without co-variables) and present some showcases examples. In a second part we will focus on implementation and explain how packages take advantages from C++ for large tables.
Speaker: Dr Serge Iovleff (CNRS / Laboratoire Paul Painlevé)
• 15:15
Coffee Break 30m
• 15:45
Packages R pour la réduction de dimension en clustering 30m
Les méthodes de clustering ne sont pas en reste quand il s'agît de regrouper des données de grande dimension. L'échec dû à la grande dimension a incité la communauté des statisticiens à développer des procédures de sélection de variables contenant l'information discriminante. Une grande partie de ces techniques sont mises à disposition sous forme de packages R. Cette présentation est une tentative de revue à travers des exemples, des packages R consacrés à la sélection de modèle en clustering par les modèles de mélange en priorité et aux méthodes basées sur la transformation des variables dans un second temps si le temps le permet.
Speaker: Dr Mohamed Sedki (Université Paris-Sud)
• 16:15
Co-expression analyses of RNA-seq data in practice with the R/Bioconductor package coseq 45m
In this talk, I will present some of the features of the R/Bioconductor package coseq, which provides a straightforward wrapper to identify groups of co-expressed genes from RNA sequencing data using Poisson mixture models (Rau et al., 2015), Gaussian mixture models (Rau et al., 2017), or the K-means algorithm (Godichon-Baggioni et al., 2018) in conjunction with appropriately chosen data transformations. In particular, I will focus on our efforts to facilitate use of coseq within standard RNA-seq analysis pipelines. I will also highlight some successful recent biological applications of coseq at INRA in a variety of organisms, including the chicken (Endale Ahanda et al., 2014), tomato (Sauvage et al., 2017), and a parasite of the honeybee (Mondet et al., 2018). Finally, I will briefly discuss some of our recent efforts to integratively make use of multiple data views (i.e., biological levels of molecular information) to identify biologically relevant and interpretable clusters from multi-omics data.
Speaker: Dr Andréa Rau (INRA Jouy-en-Josas)