Séminaire de Biostatistique

Alice Simon and Annesh Pal from BPH - Leveraging Concepts from Random Forests to Cluster High-Dimensional Data with an Application in Immunology and Adaptive DPMM via Collapsed Variational Inference for High-Dimensional Transcriptomic Clustering, respectively.

par Alice Simon (BPH), Annesh Pal (BPH)

Europe/Paris
Module 1.1 (ISPED)

Module 1.1

ISPED

Description

Speaker: Alice Simon and Annesh Pal from BPH


Title (Alice): Leveraging Concepts from Random Forests to Cluster High-Dimensional Data with an Application in Immunology

Abstract: 

Clustering refers to a set of analytical methods aimed at identifying natural group structures within a dataset. By grouping observations according to intrinsic similarities, it serves as a key tool for exploring and segmenting large-scale databases.
In this context, we introduce RFCLUST, a novel clustering method inspired by the principles of Random Forests, traditionally used in supervised learning. RFCLUST constructs an ensemble of trees, each generated through a monothetic divisive hierarchical clustering process. At each node, data are split into two subgroups by optimizing an inertia-based criterion. The resulting set of trees produces a matrix that encodes similarities between each pair of observations. This matrix can subsequently be used by standard partitioning methods, such as Ward’s agglomerative hierarchical clustering, to obtain the final clusters.
RFCLUST retains the key mechanisms of Random Forests, such as bootstrap resampling of observations and random selection of variables at each tree node, ensuring robustness and diversity in partitioning. We investigate the influence of model parameters, particularly mtry (the size of the random subset of variables considered at each node) as well as the number of leaves per tree, and explore different similarity measures between pairs of observations. The method’s performance is evaluated through numerical simulations using conventional clustering metrics such as the Adjusted Rand Index, and is applied to analyze real transcriptomic data investigating the human immune system.  
Ultimately, this approach provides a new clustering framework capable of handling high-dimensional data, while leveraging its forest-based construction to quantify both the uncertainty around the resulting partition and the relative importance of the input variables. By identifying the variables that drive cluster formation, RFCLUST yields interpretable clustering results, which are particularly valuable in immunology and other medical applications.
 

Title (Annesh): Adaptive DPMM via Collapsed Variational Inference for High-Dimensional Transcriptomic Clustering

Abstract: 

Dirichlet process mixture models (DPMM) provide a flexible framework for clustering complex biological data without pre-specifying the number of clusters. Although Markov Chain Monte Carlo (MCMC) methods have bridged the gap between theory and application for such models, they scale poorly to high-dimensional data (e.g. omics data) and suffer from slow and difficult convergence. Variational Inference (VI) represents a faster alternative, but lacks implementation without overly simplifying assumptions such as mean field independence and known covariance. We propose a novel methodology that performs scalable inference of DPMM for clustering using collapsed VI, ensuring coherent and data-driven estimation of the number of clusters in a fully Bayesian scheme. To accommodate high-dimensional data, we introduce cluster-specific adaptive covariance modeling with sparsity-inducing priors, improving flexibility while mitigating variance–covariance coupling and overparameterization. The resulting algorithm scales efficiently with dimension and sample size, substantially reducing computation time compared to state-of-the-art MCMC splice sampler. Realistic simulation studies under Gaussian and negative-binomial settings demonstrate accurate recovery of cluster structure and robust performance across varying signal-to-noise regimes, supported by sensitivity analyses. Application on a publicly available leukemia transcriptomic dataset comprising 72 samples and 2,194 gene expressions successfully recovers established subtypes while revealing biological plasticity of one mixed lineage leukemia outlier sample.
 

 

Calendar subscription link for the complete seminar series:
https://indico.math.cnrs.fr/category/711/events.ics

Program of the Biostatistics seminars:
https://indico.math.cnrs.fr/category/711/

Subscribe to the seminar mailing list:
https://diff.u-bordeaux.fr/sympa/subscribe/seminaire.biostat.bph

Former e-seminars on our YouTube channel (mostly in French): https://www.youtube.com/channel/UCURp-hEQL7k23UzGfqgEurA/videos

 

Biostatistics seminar series from the Department of Public Health from the University of Bordeaux and the Bordeaux Population Health UMR 1219 research center

 

Organisé par

Denis Rustand