Séminaire de Statistique et Optimisation

Data-Driven Scientific Discovery Through Interpretable Machine Learning

par Tarek Zikry (University of North Carolina at Chapel Hill)

Europe/Paris
Salle E. Picard (1R2-129)

Salle E. Picard

1R2-129

Description

Unsupervised learning is increasingly being used to mine the heterogeneity present in biomedical datasets to make discoveries in critical domains across health and science. However, there is a lack of standardized methodologies to ensure these results are reliable and interpretable. Here, I will present work across several domains including neuroscience and cancer, where we leverage data-driven methodological development in statistics and computational biology for robust scientific discovery. Within cancer, we characterize the single-cell proteomic resistance to CDk4/6 inhibition – a phenomenon we call fractional resistance. We hypothesize that fractional resistance arises from cell-to-cell differences in core cell cycle regulators that allow a subset of cells to escape CDK4/6 inhibitor therapy. We used multiplex, single-cell imaging to identify fractionally resistant cells in both cultured and primary breast tumor samples resected from patients. In additional work, we posit that spherical manifold approximations represent these single-cell populations, suggesting that significant changes in latent lower dimensional manifold structures correspond to distinct cell cycle behaviors and establish an empirical hypothesis testing framework to quantify differences in these spheres across conditions. In addition to neuroscience work where I will present a low-rank quilting approach for the imputation of missing brain region dynamics in live neural recordings, I will present a structured workflow for applying unsupervised learning broadly in data science, with guidance on data preprocessing, feature engineering, exploratory analysis, dimension reduction, validation, and iterative communication with domain experts to ensure meaningful insights. By integrating best practices in statistical analysis with real-world applications, we demonstrate how a generalizable workflow for unsupervised learning can facilitate robust data-driven discovery.