9e Journée Statistique et Informatique pour la Science des Données à Paris-Saclay

Centre de Conférences Marilyn et James Simons (Le Bois-Marie)

Centre de Conférences Marilyn et James Simons

Le Bois-Marie

35, route de Chartres CS 40001 91893 Bures-sur-Yvette Cedex

The aim of this workshop is to bring together mathematicians and computer scientists around some talks on recent results from statistics, machine learning, and more generally data science research. Various topics in machine learning, optimization, deep learning, optimal transport, inverse problems, statistics, and problems of scientific reproducibility will be presented. This workshop is particularly intended for doctoral and post-doctoral researchers.

Registration is free and open until March 27, 2024.

Invited speakers:
Stephan Clémençon (LTCI/Télécom Paris/Insitut Polytechnique de Paris)
Luca Ganassali (LMO/Université Paris-Saclay)
Marine Le Morvan (SODA/INRIA Saclay)
Erwan Le Pennec (CMAP/École polytechnique)
Gilles Stoltz (CNRS/LMO/Université Paris-Saclay)
Maria Vakalopoulou (MICS/Centralesupélec)

Evgenii Chzhen (Laboratoire de Mathématiques d'Orsay, Université Paris-Saclay)
Florence Tupin (LTCI/Télécom Paris)

Cécile Gourgues
    • 9:30 AM 10:00 AM
      Café d'accueil 30m
    • 10:00 AM 10:50 AM
      Reinforcement Learning, an Introduction and Some Results 50m

      Reinforcement Learning is the "art" of learning how to act in an environment that is only observed through interactions.
      In this talk, I will provide an introduction to this topic starting from the underlying probabilistic model, Markov Decision Process, describing how to learn a good policy (how to pick the actions) when this model is known and when it is unknown. I will stress the impact of the (required) parametrization of the solution, as well as the importance of understanding the inner engine (stochastic approximation).
      I will illustrate the variety of questions by describing briefly three different questions:
      - How to apply Reinforcement Learning to detect faster an issue during an ultrasound exam ?
      - How to solve faster an MDP using better approximation ?
      - How to make RL more robust while controlling its sample complexity ?

      Speaker: Erwan Le Pennec (CMAP, École polytechnique, Institut Polytechnique de Paris)
    • 10:50 AM 11:20 AM
      Pause café 30m
    • 11:20 AM 12:10 PM
      Learning with Missing Values: Theoretical Insights and Application to Health Databases 50m

      Missing values are ubiquitous in many fields such as health, business or social sciences. To date, much of the literature on missing values has focused on imputation as well as inference with incomplete data. In contrast, supervised learning in the presence of missing values has received little attention. In this talk I will explain the challenges posed by missing values in regression and classification tasks. In practice, a common solution consists in imputing the missing values prior to learning. I will show how different baseline methods for handling missing values compare on several large health databases with naturally occurring missing values. We will then examine the theoretical foundations of Impute-then-Regress approaches. Finally, I will present a neural network architecture for learning with missing values that goes beyond the two-stage Impute-then-Regress approaches.

      Speaker: Marine Le Morvan (INRIA, Saclay)
    • 12:10 PM 1:00 PM
      Unsupervised Alignment of Graphs and Embeddings: Fundamental Limits and Computational Methods 50m

      Aligning two (weighted or unweighted) graphs, or matching two clouds of high-dimensional embeddings, are fundamental problems in machine learning with applications across diverse domains such as natural language processing to computational biology. In this presentation I will introduce the graph alignment problem, which can be viewed as an average-case and noisy version of the graph isomorphism problem. I will talk about the main challenges when the graphs are sparse, give some insights on the fundamental limits, and present efficient algorithms for this task. Then, switching focus on aligning clouds of embeddings, I will delve into the Procrustes-Wassertein problem. We will emphasize differences from the previous graph-to-graph case. Statistical and computational results will be presented to shed light on these emerging questions.

      Speaker: Luca Ganassali (Laboratoire de Mathématiques d'Orsay, Université Paris-Saclay)
    • 1:00 PM 2:30 PM
      Déjeuner Buffet 1h 30m
    • 2:30 PM 3:20 PM
      Weak Signals: Machine-Learning Meets Extreme Value Theory 50m

      The angular measure on the unit sphere characterizes the first-order dependence structure of the components of a random vector in extreme regions and is defined in terms of standardized margins. Its statistical recovery is an important step in learning problems involving observations far away from the center. In the common situation that the components of the vector have different distributions, the rank transformation offers a convenient and robust way of standardizing data in order to build an empirical version of the angular measure based on the most extreme observations. However, the study of the sampling distribution of the resulting empirical angular measure is challenging. It is the purpose of this talk to explain how to establish finite-sample bounds for the maximal deviations between the empirical and true angular measures, uniformly over classes of Borel sets of controlled combi natorial complexity. The bounds are valid with high probability and, up to logarithmic factors, scale as the square root of the effective sample size. The bounds are next applied to provide performance guarantees for two statistical learning procedures tailored to extreme regions of the input space and built upon the empirical angular measure: binary classification in extreme regions through empirical risk minimization and unsupervised anomaly detection through minimum-volume sets of the sphere.

      Speaker: Stephan Clémençon (LTCI, Télécom Paris, Insitut Polytechnique de Paris)
    • 3:20 PM 4:10 PM
      Contextual Stochastic Bandits with Budget Constraints and Fairness Application 50m

      We review the setting and fundamental results of contextual stochastic bandits, where at each round some vector-valued context $x_t$ is observed and $K$ actions are available, each action a providing a stochastic reward with expectation given by some (partially unknown) function of $x_t$ and $a$. The aim is to maximize the cumulative rewards obtained, or equivalently, to minimize the regret. This requires maintaining a good balance between the estimation (a.k.a., exploration) of the function and the exploitation of the estimates built. The literature also considers additional budget constraints (leading to so-called contextual bandits with knapsacks): actions now provide rewards but also costs. The literature also illustrated that costs may model fairness constraints. We will review these two lines of work and briefly describe our own contribution in this respect, related to a more direct strategy, able to handle $\sqrt{T}$ cost constraints over $T$ rounds, which is exactly what is needed for fairness applications. The recent results discussed at the end of the talk will be based on the joint work by Evgenii Chzhen, Christophe Giraud, Zhen Li, and Gilles Stoltz, Small total-cost constraints in contextual bandits with knapsacks, with application to fairness, Neurips, 2023.

      Speaker: Gilles Stoltz (CNRS, LMO, Univ. Paris-Saclay)
    • 4:10 PM 4:40 PM
      Pause café 30m
    • 4:40 PM 5:30 PM
      Deep Learning in Medical Imaging: The Era of Foundation Models 50m

      Deep learning methods have a very important role in medical imaging and it had gain a lot of attention the recent years. Currently, the community is working towards the development of large deep learning models that capture complex relations of the data and can address different tasks in a holistic way. In this talk, we will discuss about recent foundation models in medical imaging and we will focus on the opportunities and challenges of such algorithms as well as recent ways to tailored them on medical data.

      Speaker: Maria Vakalopoulou (Centralesupélec, Université Paris-Saclay)