BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//CERN//INDICO//EN
BEGIN:VEVENT
SUMMARY:Alice Simon and Annesh Pal from BPH - Leveraging Concepts from Ran
 dom Forests to Cluster High-Dimensional Data with an Application in Immuno
 logy and Adaptive DPMM via Collapsed Variational Inference for High-Dimens
 ional Transcriptomic Clustering\, respectively.
DTSTART:20260618T110000Z
DTEND:20260618T120000Z
DTSTAMP:20260524T181100Z
UID:indico-event-16559@indico.math.cnrs.fr
CONTACT:denis.rustand@u-bordeaux.fr
DESCRIPTION:Speakers: Alice Simon (BPH)\, Annesh Pal (BPH)\n\nSpeaker: Ali
 ce Simon and Annesh Pal from BPH\nTitle (Alice): Leveraging Concepts from 
 Random Forests to Cluster High-Dimensional Data with an Application in Imm
 unology\nAbstract: \n\nClustering refers to a set of analytical methods a
 imed at identifying natural group structures within a dataset. By grouping
  observations according to intrinsic similarities\, it serves as a key too
 l for exploring and segmenting large-scale databases.\nIn this context\, w
 e introduce RFCLUST\, a novel clustering method inspired by the principles
  of Random Forests\, traditionally used in supervised learning. RFCLUST co
 nstructs an ensemble of trees\, each generated through a monothetic divisi
 ve hierarchical clustering process. At each node\, data are split into two
  subgroups by optimizing an inertia-based criterion. The resulting set of 
 trees produces a matrix that encodes similarities between each pair of obs
 ervations. This matrix can subsequently be used by standard partitioning m
 ethods\, such as Ward’s agglomerative hierarchical clustering\, to obtai
 n the final clusters.\nRFCLUST retains the key mechanisms of Random Forest
 s\, such as bootstrap resampling of observations and random selection of v
 ariables at each tree node\, ensuring robustness and diversity in partitio
 ning. We investigate the influence of model parameters\, particularly mtry
  (the size of the random subset of variables considered at each node) as w
 ell as the number of leaves per tree\, and explore different similarity me
 asures between pairs of observations. The method’s performance is evalua
 ted through numerical simulations using conventional clustering metrics su
 ch as the Adjusted Rand Index\, and is applied to analyze real transcripto
 mic data investigating the human immune system.  \nUltimately\, this app
 roach provides a new clustering framework capable of handling high-dimensi
 onal data\, while leveraging its forest-based construction to quantify bot
 h the uncertainty around the resulting partition and the relative importan
 ce of the input variables. By identifying the variables that drive cluster
  formation\, RFCLUST yields interpretable clustering results\, which are p
 articularly valuable in immunology and other medical applications.\n\n \n
 \nTitle (Annesh): Adaptive DPMM via Collapsed Variational Inference for Hi
 gh-Dimensional Transcriptomic Clustering\nAbstract: \n\nDirichlet process
  mixture models (DPMM) provide a flexible framework for clustering complex
  biological data without pre-specifying the number of clusters. Although M
 arkov Chain Monte Carlo (MCMC) methods have bridged the gap between theory
  and application for such models\, they scale poorly to high-dimensional d
 ata (e.g. omics data) and suffer from slow and difficult convergence. Vari
 ational Inference (VI) represents a faster alternative\, but lacks impleme
 ntation without overly simplifying assumptions such as mean field independ
 ence and known covariance. We propose a novel methodology that performs sc
 alable inference of DPMM for clustering using collapsed VI\, ensuring cohe
 rent and data-driven estimation of the number of clusters in a fully Bayes
 ian scheme. To accommodate high-dimensional data\, we introduce cluster-sp
 ecific adaptive covariance modeling with sparsity-inducing priors\, improv
 ing flexibility while mitigating variance–covariance coupling and overpa
 rameterization. The resulting algorithm scales efficiently with dimension 
 and sample size\, substantially reducing computation time compared to stat
 e-of-the-art MCMC splice sampler. Realistic simulation studies under Gauss
 ian and negative-binomial settings demonstrate accurate recovery of cluste
 r structure and robust performance across varying signal-to-noise regimes\
 , supported by sensitivity analyses. Application on a publicly available l
 eukemia transcriptomic dataset comprising 72 samples and 2\,194 gene expre
 ssions successfully recovers established subtypes while revealing biologic
 al plasticity of one mixed lineage leukemia outlier sample.\n\n\n\n \n\n
  \nCalendar subscription link for the complete seminar series:https://ind
 ico.math.cnrs.fr/category/711/events.ics\nProgram of the Biostatistics sem
 inars:https://indico.math.cnrs.fr/category/711/\nSubscribe to the seminar 
 mailing list:https://diff.u-bordeaux.fr/sympa/subscribe/seminaire.biostat.
 bph\nFormer e-seminars on our YouTube channel (mostly in French): https://
 www.youtube.com/channel/UCURp-hEQL7k23UzGfqgEurA/videos\n \nBiostatistics
  seminar series from the Department of Public Health from the University o
 f Bordeaux and the Bordeaux Population Health UMR 1219 research center\n 
 \n\nhttps://indico.math.cnrs.fr/event/16559/
LOCATION:Module 1.1 (ISPED)
URL:https://indico.math.cnrs.fr/event/16559/
END:VEVENT
END:VCALENDAR
