Séminaire des doctorants

Optimal transport for automatic alignment of non-targeted metabolomic data

by Marie Breeur (IARC-WHO)

Europe/Paris
Description

Untargeted metabolomic profiling through liquid chromatography-mass spectrometry (LC-MS) allows the measurement of a wide range of metabolites in a biospecimen. Untargeted features whose intensity are measured in untargeted metabolomics studies are only defined through their mass-to-charge ratio (m/z) and retention time (RT) and are therefore not immediately identifiable. Furthermore, m/z and RT measured under different conditions are subject to variations and features common to two different studies cannot be directly identified. This limitation hampers the external validation of results and more generally the comparison of results across different studies. It also prevents the pooling or meta-analysis of untargeted metabolomics data, thus limiting the statistical power of untargeted metabolomics studies. 

Here we develop an unsupervised method to automatically match features from two LC-MS untargeted datasets by combining information on their mass-to-charge ratios (m/z), retention times (RT) and signal intensities. Our approach primarily pairs features with compatible signal intensities by making use of the Gromov-Wasserstein distance (an extension of optimal transport designed to couple sets by taking advantage of their structure) between the features within each dataset. An additional constraint allows us to restrict this coupling to pairs of features sharing similar m/z. Finally, the deviation of the RTs between the two studies is estimated in order to retain only those pairs with compatible RTs in our final matching.

We performed an extensive simulation study to evaluate the empirical performance of our method and compare it to another recent approach that uses the same type of information (m/z, RT and signal intensities). Our method outperformed its competitor in terms of sensitivity and specificity under most scenarios we considered. When applied to real untargeted metabolomics data acquired in sub-studies nested within a large European cohort, for which a small subset of features had previously been matched manually by an expert biochemist, our approach again performed well. Unlike other existing methods, our approach requires the setting of only a few parameters, and is implemented using an open-source programming language to facilitate its use and possible future developments. 

Such work could have multiple applications in metabolomics, from the comparison of acquisition protocols to the pooling or meta-analysis of data from different studies. This would allow a better use of the increasingly available non-targeted metabolomic data, for example in cancer epidemiology.