Jun 17 – 21, 2024
ENSEEIHT
Europe/Paris timezone

Invited speakers

Keynote speakers

Jim Dai (Cornell University) - Slides

Inpatient Overflow Management with Proximal Policy Optimization

Overflow patients to non-primary wards can effectively alleviate congestion in hospitals, while undesired overflow also leads to issues like mismatched service quality. Therefore, we need to trade-off between con- gestion and undesired overflow. This overflow management problem is modeled as a discrete-time Markov Decision Process with large state and action space. To overcome the curse-of-dimensionality, we decompose the action at each time into a sequence of atomic actions and use an actor-critic algorithm, Proximal Policy Optimization (PPO), to update policy. Moreover, we tailor the design of neural network which represents policy to account for the daily periodic pattern of the system flows. Under hospital settings of different scales, the PPO policies consistently outperform some commonly used state-of-art policies significantly.

Vincent François-Lavet (VU, Amsterdam) - Slides

Generalization in deep reinforcement learning and the role of representation learning for sequential decision-making tasks

Abstract: The concept of generalization in the context of deep reinforcement learning will be explained along with the different elements that can be used to improve generalization. The key role of representation learning for improved generalization and interpretability will then be shown. Finally, the role of representation learning for exploration and transfer learning will also be discussed.

Bruno Gaujal (INRIA) - Slides

The sliding regret

Abstract: Optimistic reinforcement learning algorithms in Markov decision processes essentially rely on two ingredients to guarantee regret efficiency. The first one is the choice of well-tuned confidence bounds and the second is the design of a pertinent rule to end episodes. While many efforts have been dedicated to improve the tightness of confidence bounds, the management of episodes has remained essentially unaltered since the introduction of the doubling trick (DT) in UCRL2 (Auer et Al.2009). In this talk, I will present two solutions to  move beyond (DT). The first one is the performance test (PT) that ends an episode as soon as the performance of the current policy becomes obviously sub-optimal. The second one is the vanishing multiplicative (VM) rule that is as simple as DT to implement and replace  the doubling criterion by a weaker one. Both solutions keep  the regret of the algorithm  unaltered and induce a drastic reduction of the local regret  taken from the start of exploration episodes (start of episodes where a sub-optimal policy is used). More specifically, classical algorithms such as UCRL2, KL-UCRL or UCRL2B, patched with our new  rules get an immediate benefit. Their regret upper bound remain the same (up to a small negligible additive term) while their asymptotic local regret at exploration times decreases from Omega(T) to O(log T). I will also comment on numerical experiments confirming  our asymptotic findings. The regret of  algorithms under (VM) or (PT) becomes slightly better and significantly smoother, while the local regret at exploration times becomes sub-linear, even over finite times.

Christina Lee Yu (Cornell University) - Slides

Exploiting Structure In Reinforcement Learning

Abstract: While reinforcement learning has achieved impressive success in applications such as game-playing and robotics, there is work yet to be done to make RL truly practical for optimizing policies for real-world systems. In particular, many systems exhibit clear structure that RL algorithms currently don't know how to exploit efficiently. As a result, domain-specific heuristics often dominate RL both in system performance and resource consumption. In this talk we will discuss types of structures that may arise in real-world systems, and approaches to incorporate such structure in the design of RL algorithms. Examples of structured MDPs include models that exhibit linearity with respect to a low dimensional representation, models that exhibit smoothness in the parameters or the trajectories, and models whose dynamics can be decomposed into exogenous versus endogenous components.

Sean Meyn (University of Florida) - Slides

The Projected Bellman Equation in Reinforcement Learning

Abstract: A topic of discussion throughout the 2020 Simons program on reinforcement learning:  is the Q-learning algorithm convergent outside of the tabular setting? It is now known that stability can be assured using a matrix gain algorithm, but this requires assumptions, which begs the next question:  does a solution to the projected Bellman equation exist?  This is the most minimal requirement for convergence of any algorithm.

The question was resolved in very recent work.  A solution does exist, subject to two assumptions:  the function class is linear, and (far more crucial) the input used for training is a form of epsilon-greedy policy with sufficiently small epsilon.  Moreover, under these conditions it is shown that the Q-learning algorithm is stable, in terms of bounded parameter estimates. Convergence remains one of many open topics for research.

In short, sufficient optimism is not only valuable for algorithmic efficiency, but is a means to algorithmic stability.

Éric Moulines  (École Polytechnique) - Slides

Finite sample analysis of linear stochastic approximation and TD learning

Abstract: In this talk, we consider the problem of obtaining sharp bounds for linear stochastic approximation. We then apply these results to temporal difference (TD) methods with linear functional approximation for policy evaluation in discounted Markov decision processes. We show that a simple algorithm with a universal and instance-independent step size together with Polyak-Ruppert tail averaging is sufficient to obtain near-optimal variance and bias terms. We also provide the respective sample complexity bounds. Our proof technique is based on refined error bounds for linear stochastic approximation together with the novel stability result for the product of random matrices that arise from the TD-type recurrence.  We will also discuss how these results extend to the distributed / federated setting.

Emmanuel Rachelson (ISAE-SUPAERO) - Slides

Lipschitz Lifelong Reinforcement Learning: transferring value functions across MDPs

Abstract: How close are the optimal value functions of two Markov decision processes that share the same state and action spaces but have different dynamics and rewards? In this talk, we will consider the problem of knowledge transfer when an agent is facing a series of reinforcement learning (RL) tasks. We will introduce a novel metric between Markov decision processes (MDPs) and establish that close MDPs have close optimal value functions. These theoretical results lead us to a value-transfer method for Lifelong RL, which we use to build a PAC-MDP algorithm with improved convergence rate. Beyond value transfer, this talk will open up on challenges and opportunities deriving from such an analysis.

R. Srikant (UIUC) - Slides

Learning and Control in Countable State Spaces

Abstract: We will consider policy optimization methods in reinforcement learning where the state space is countably infinite. The motivation arises from  control problems in communication networks and matching markets. We consider an algorithm called Natural Policy Gradient (NPG), which is a popular algorithm for  finite state spaces, and show three results in the context of countable state spaces: (i) in the case where perfect policy evaluation is possible, we show that standard NPG converges with a small modification; (ii) if the error is policy evaluation is within a factor of the true value function, we show that one can obtain bounds on the performance of the NPG algorithms; and (iii) we will discuss the ability of neural network-based function approximations to satisfy the condition in (ii) above.

Adam Wierman (Caltech) - Slides

Learning Augmented Algorithms for MDPs

Abstract: Making use of modern black-box AI tools such as deep reinforcement learning is potentially transformational for sustainable systems such as data centers, electric vehicles, and the electricity grid. However, such machine-learned algorithms typically do not have formal guarantees on their worst-case performance, stability, or safety. So, while their performance may improve upon traditional approaches in “typical” cases, they may perform arbitrarily worse in scenarios where the training examples are not representative due to, e.g., distribution shift. Thus, a challenging open question emerges: Is it possible to provide guarantees that allow black-box AI tools to be used in safety-critical applications? In this talk, I will provide an overview of an emerging area studying learning-augmented algorithms that seeks to answer this question in the affirmative. I will survey recent results in the area, focusing on online optimization and MDPs, and then describe applications of these results to the design of sustainable data centers and electric vehicle charging.