Workshop on Learning in Games

Name: Workshop on Learning in Games
Start: 2024-07-01T09:00:00+02:00
End: 2024-07-03T17:00:00+02:00
Location: Institut de Mathématiques de Toulouse

1–3 juil. 2024

Institut de Mathématiques de Toulouse

Fuseau horaire Europe/Paris

The slides and the book of abstracts are available.

Program, abstracts and slides

The abstract and the slides can be accessed by clicking on the speaker's name.

	Monday, July 1st	Tuesday, July 2nd	Wednesday,July 3rd
08:30 --- 09:00	Registration
09:00 --- 09:15	Opening
09:15 --- 10:45	Tutorial (Sorin)	Tutorial (Branzei)	Grand-Clement + Tran-Thanh
10:45 --- 11:15	Coffee break	Coffee break	Coffee break
11:15 --- 12:00	Hart	Kamgarpour	Mertikopoulos
12:00 --- 13:30	Coffee break	Coffee break	Coffee break
13:30 --- 15:00	Bichler + Mohri	Anantharam + Jin	Ashkenazi-Golan + Jindani
15:00 --- 15:30	Coffee break	Coffee break	Coffee break
15:30 --- 17:00	Flash-talks	Cesari + Perchet	Scarsini + Tarbush
17:15 --- 18:15	Posters
18:15 ---	Cocktail

Simina Branzei

Slides

Multiplayer bandit learning

The classic theory of multi-armed bandits shows how one decision maker can take decisions sequentially; the optimal strategies highlight the tradeoff between exploration and exploitation.

I will describe the basic model for stochastic multi-armed bandits with one decision maker and explain the optimal (Gittins index) strategy. Then I will discuss the extension of the model to multiple players, and describe a set of results from the paper "Multiplayer bandit learning, from competition to cooperation" (https://arxiv.org/abs/1908.01135, joint with Peres, in COLT 21).

In the multi-player multi-armed model, there are two players and two arms; one arm has a known success probability (i.e. is safe) and the other arm does not (i.e. is risky). In each round, each player pulls an arm, gets the reward from that arm, and observes the action of the other player but not their reward.

The main results are that competing players explore less than a single player, but they are not completely myopic. Thus under optimal play, information (acquired by experimenting with the risky arm) is less valuable in the zero-sum game than in the single player setting. This leads to reduced exploration compared to the one player optimum; but information still has positive value. In contrast, cooperating players explore more than a single player, while neutral players learn from each other, getting strictly higher total rewards than they would playing alone. Finally, competing and neutral players eventually settle on the same arm in every Nash equilibrium, while this can fail for cooperating players.

Sylvain Sorin

Slides

No-regret Algorithms and Games

No regret-algorithms appeared in game dynamics, prediction ofsequences and convex optimization. We will describe the basic framework, the main tools and the fundamental results.
Part I will deal with the initial discrete (in space) and random procedures. Part II will be concerned with continuous deterministic algorithm.

Part I

Introduction: on-line algorithm, external and internal regret
Blackwell theorem
Existence of no-regret algorithms
Calibration
Extensions: Bandit, signals; expected version; experts, wide range; from external to internal regret.
Application to finite games
Unilateral procedures
External regret and Hannan set
Internal regret and correlated equilibrium distributions (dual viewpoint)
Other approaches: smooth fictitious play, stochastic approximation, continuous time and replicator dynamics

Part II

Comparison:
1. Continuous/discrete time
2. On-line learning, vector field, convex optimization
3. Gradient descent, mirror descent, dual averaging
Extensions and recent advances: optimistic procedures,adaptive stepsize.
No regret-algorithms appeared in game dynamics, prediction of sequences and convex optimization.

Venkat Anantharam

Slides

Game theory for cumulative-prospect-theoretic agents

Classical game theory is based on modeling the agents as maximizers of an expected utility. Empirical research on human behavior has demonstrated that humans are not well modeled as utility maximizers. An attractive theory that appears to be capable of capturing many of the peculiarities of human behavior, and which includes expected utility theory as a special case, is the cumulative prospect theory (CPT) of Kahneman and Tversky.

In the doctoral thesis work of Soham Phade, which we will give an exposition of in this talk, we have aimed at developing versions of some of the core results in classical game theory in the context where the agents have CPT preferences. We will discuss the relationship between Nash and correlated equilibria in this context, develop an analog of the calibrated learning theory justification of correlated equilibria for CPT agents, and build a theory of CPT mechanism design for implementing social choice functions. A potential application of some of these techniques in communication networking will also be discussed.

Galit Ashkenazi-Golan

Slides

Policy Gradient Methods in Repeated Games

Policy gradient methods have emerged as powerful tools for optimising complex decision-making processes. These methods utilise gradient ascent to iteratively enhance policies based on observed rewards. While they frequently achieve local convergence, global convergence guarantees remain elusive outside of specific classes of games. We illustrate this phenomenon through examples, illuminating the challenges surrounding global optimisation.

Following that, we explore the utilisation of policy gradient methods in repeated games and define the set of policies that are learnable using these methods. Lastly, we provide a preview of our follow-up presentation, wherein we will present a Folk theorem result for learning in repeated games.

Martin Bichler

Slides

Learning in Bayesian Games

Auctions are modeled as Bayesian games with continuous type and action spaces. Determining equilibria in auction games is computationally hard in general and no exact solution theory is known. We introduce an algorithmic framework in which we discretize type and action space and then learn distributional strategies via online optimization algorithms. We show that the equilibrium of the discretized game approximates an equilibrium in the continuous game. In a wide variety of auction games, we provide empirical evidence that the approach approximates the analytical (pure) Bayes Nash equilibrium closely. In standard models where agents are symmetric, we find equilibrium in seconds. The method allows for interdependent valuations and different types of utility functions and it provides a foundation for broadly applicable equilibrium solvers that can push the boundaries of equilibrium analysis in auction markets and beyond.

Tommaso Cesari

Slides

The Role of Transparency in Repeated First-Price Auctions with Unknown Valuations

I will discuss the problem of regret minimization for a single bidder in a sequence of first-price auctions where the bidder discovers the item’s value only if the auction is won. In particular, I will show how minimax regret rates change with the amount of information on competing bids disclosed by the auctioneer at the end of each auction. Results will be presented under different assumptions (stochastic, adversarial, and their smoothed variants) on the environment generating the bidder’s valuations and competing bids. These minimax rates reveal how the interplay between transparency and the nature of the environment affects how fast one can learn to bid optimally in first-price auctions.

Julien Grand-Clement

Slides

Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms

Self play via online learning is one of the premier ways to solve large-scale zero-sum games, both in theory and practice. Particularly popular algorithms include optimistic multiplicative weights update (OMWU) and optimistic gradient-descent-ascent (OGDA). While both algorithms enjoy

O (1 / T)

ergodic convergence to Nash equilibrium in two-player zero-sum games, OMWU offers several advantages, including logarithmic dependence on the size of the payoff matrix and

\tilde{O} (1 / T)

convergence to coarse correlated equilibria even in general-sum games. However, in terms of last-iterate convergence in two-player zero-sum games, an increasingly popular topic in this area, OGDA guarantees that the duality gap shrinks at a rate of

(1 / \sqrt{T})

, while the best existing last-iterate convergence for OMWU depends on some game-dependent constant that could be arbitrarily large. This begs the question: is this potentially slow last-iterate convergence an inherent disadvantage of OMWU, or is the current analysis too loose? Somewhat surprisingly, we show that the former is true. More generally, we prove that a broad class of algorithms that do not forget the past quickly all suffer the same issue: for any arbitrarily small

δ > 0

, there exists a

2 \times 2

matrix game such that the algorithm admits a constant duality gap even after

1 / δ

rounds. This class of algorithms includes OMWU and other standard optimistic follow-the-regularized-leader algorithms.

This is joint work with Y. Cai, G. Farina, C. Kroer, H. Luo, C.-W. Lee and W. Zheng.

Sergiu Hart

Slides

"Calibeating": Beating Forecasters at Their Own Game

In order to identify expertise, forecasters should not be tested by their calibration score, which can always be made arbitrarily small, but rather by their Brier score. The Brier score is the sum of the calibration score and the refinement score; the latter measures how good the sorting into bins with the same forecast is, and thus attests to "expertise." This raises the question of whether one can gain calibration without losing expertise, which we refer to as "calibeating." We provide an easy way to calibeat any forecast, by a deterministic online procedure. We moreover show that calibeating can be achieved by a stochastic procedure that is itself calibrated, and then extend the results to simultaneously calibeating multiple procedures, and to deterministic procedures that are continuously calibrated.

Chi Jin

Slides

Beyond Equilibrium Learning

While classical game theory primarily focuses on finding equilibria, modern machine learning applications introduce a series of new challenges where standard equilibrium notions are no longer sufficient, and the development of new efficient algorithmic solutions is urgently needed. In this talk, we will demonstrate two such scenarios: (1) a natural goal in multiagent learning is to learn rationalizable behavior, which avoids iteratively dominated actions. Unfortunately, such rationalizability is not guaranteed by standard equilibria, especially when approximation errors are present. Our work presents the first line of efficient algorithms for learning rationalizable equilibria with sample complexities that are polynomial in all problem parameters, including the number of players; (2) In multiplayer symmetric constant-sum games like Mahjong or Poker, a natural baseline is to achieve an equal share of the total reward. We demonstrate that the self-play meta-algorithms used by existing state-of-the-art systems can fail to achieve this simple baseline in general symmetric games. We will then discuss the new principled solution concept required to achieve this goal.

Sam Jindani

Slides

Learning efficient equilibria in repeated games

The folk theorem tells us that a wide range of payoffs can be sustained as equilibria in an infinitely repeated game. Existing results about learning in repeated games suggest that players may converge to an equilibrium, but do not address selection between equilibria. I propose a stochastic learning rule that selects a subgame-perfect equilibrium of the repeated game in which payoffs are efficient. The exact payoffs selected depend on how players experiment; two natural specifications yield the Kalai–Smorodinsky and maxmin bargaining solutions, respectively.

Maryam Kamgarpour

Slides

Learning equilibria in games with bandit feedback

A significant challenge in managing large-scale engineering systems, such as energy and transportation networks, lies in enabling autonomous decision-making among interacting agents. Game theory offers a framework for modeling and analyzing these types of problems. In many practical applications, like power markets, each player only has partial information about the cost functions and actions of others. Therefore, a decentralized learning approach is essential to devise optimal strategies for each player.

My talk will focus on recent advances in decentralized learning algorithms in games under bandit feedback. The first part will discuss conditions for the convergence of decentralized learning to a Nash equilibrium in continuous action static games. The second part will explore Markov games, presenting our methods for decentralized learning of Nash equilibria in zero-sum Markov games and coarse-correlated equilibria in general-sum Markov games. I will demonstrate the practical applications of the developed algorithms using real-world problems.

This presentation is primarily based on the following papers: [1], [2], [3], [4].

Panayotis Mertikopoulos

Slides

Potential vs. harmonic games, convergence vs. recurrence, regret vs. optimism

The long-run behavior of multi-agent learning – and, in particular, no-regret learning – is relatively well-understood in potential games, where players have common interests. By contrast, in harmonic games – the strategic counterpart of potential games, where players have conflicting interests – very little is known outside the narrow subclass of 2-player zero-sum games with a fully-mixed equilibrium. Our paper seeks to partially fill this gap by focusing on the full class of (generalized) harmonic games and examining the convergence properties of "follow-the-regularized-leader" (FTRL), the most widely studied class of no-regret learning schemes. As a first result, we show that the continuous-time dynamics of FTRL are Poincaré recurrent, that is, they return arbitrarily close to their starting point infinitely often, and hence fail to converge. In discrete time, the standard, "vanilla" implementation of FTRL may lead to even worse outcomes, eventually trapping the players in a perpetual cycle of best-responses. However, if FTRL is augmented with a suitable extrapolation step – which includes as special cases the optimistic and mirror-prox variants of FTRL – we show that learning converges to a Nash equilibrium from any initial condition, and all players are guaranteed at most O(1) regret. These results provide an in-depth understanding of no-regret learning in harmonic games, nesting prior work on 2-player zero-sum games, and showing at a high level that harmonic games are the canonical complement of potential games, not only from a strategic, but also from a dynamic viewpoint.

Mehryar Mohri

Slides

Pseudonorm Approachability and Applications to Regret Minimization

Blackwell's approachability theory is a powerful tool for various learning problems. While traditionally studied under the Euclidean distance (

ℓ_{2}

), we argue that the

ℓ_{\infty}

-metric is more suitable for many applications like regret minimization. However, existing

ℓ_{\infty}

-approachability algorithms suffer from high dimensionality. We present a framework to overcome this issue by converting high-dimensional

ℓ_{\infty}

problems into low-dimensional pseudonorm problems. This enables efficient algorithms with convergence independent of the original problem's dimension. We also provide a logarithmic convergence algorithm. Finally, we demonstrate the advantages of our framework in several regret minimization problems.

Vianney Perchet

Slides

Optimizing the coalition gain in Online Auctions with Greedy Structured Bandits

Motivated by online display advertising, this work considers repeated second-price auctions, where agents sample their value from an unknown distribution with cumulative distribution function F. In each auction t, a decision-maker bound by limited observations selects n_t agents from a coalition of N to compete for a prize with p other agents, aiming to maximize the cumulative reward of the coalition across all auctions. The problem is framed as an N-armed structured bandit, each number of player sent being an arm n, with expected reward r(n) fully characterized by F and p+n. We present two algorithms, Local-Greedy (LG) and Greedy-Grid (GG), both achieving constant problem-dependent regret. This relies on three key ingredients:

an estimator of r(n) from feedback collected from any arm k,
concentration bounds of these estimates for k within an estimation neighborhood of n and
the unimodality property of r under standard assumptions on F.

Additionally, GG exhibits problem-independent guarantees on top of best problem-dependent guarantees. However, by avoiding to rely on confidence intervals, LG practically outperforms GG, as well as standard unimodal bandit algorithms such as OSUB or multi-armed bandit algorithms.

Marco Scarsini

Slides

Best-response dynamics on a large class of random games

Several papers study games with random payoffs with the goal of finding the distribution of the number of pure Nash equilibria and of determining the behavior of best-response dynamics. In most of the existing literature the payoffs are assumed to be independent and identically distributed (i.i.d.) with a continuous distribution function. In this paper we consider classes of random games where the payoffs are not i.i.d. but have a structure that is more suitable for game-theoretic applications. We will focus our analysis on the behavior of best-response dynamics when games have multiple equilibria.

Bassel Tarbush

Slides

Game connectivity and adaptive dynamics

We analyse the typical structure of games in terms of the connectivity properties of their best-response graphs. Our central result shows that almost every game that is 'generic' (without indifferences) and has a pure Nash equilibrium and a 'large' number of players is connected, meaning that every action profile that is not a pure Nash equilibrium can reach every pure Nash equilibrium via best-response paths. This has important implications for dynamics in games. In particular, we show that there are simple, uncoupled, adaptive dynamics for which period-by-period play converges almost surely to a pure Nash equilibrium in almost every large generic game that has one (which contrasts with the known fact that there is no such dynamic that leads almost surely to a pure Nash equilibrium in every generic game that has one). We build on recent results in probabilistic combinatorics for our characterisation of game connectivity.

This is joint work with T. Johnston, M. Savery, and A. Scott.

Long Tran-Thanh

Slides

Learning against No-Regret Learners: Multi-agent Learning with Strategic Agents

In this talk I will discuss our recent results on how to efficiently exploit the fact that we are playing against no-regret learning agents. In particular, I will show how to achieve sub-linear forward and dynamic regrets, notions that are stronger than the (external) regrets, when playing against an agent who uses no-regret learning algorithm to minimise their (external) regret. I will also show how these results can be extended to online Markov decision processes. Finally, I will discuss how to achieve last-round/last-iterate convergence against generic no-regret learners.

Choisissez le fuseau horaire

Workshop on Learning in Games

Simina Branzei

Sylvain Sorin

Venkat Anantharam

Galit Ashkenazi-Golan

Martin Bichler

Tommaso Cesari

Julien Grand-Clement

Sergiu Hart

Chi Jin

Sam Jindani

Maryam Kamgarpour

Panayotis Mertikopoulos

Mehryar Mohri

Vianney Perchet

Marco Scarsini

Bassel Tarbush

Long Tran-Thanh