8–11 oct. 2024
Aussois
Fuseau horaire Europe/Paris

Dynamic metastability in the self-attention model

11 oct. 2024, 11:10
45m
Aussois

Aussois

Centre de Vacances CAES du CNRS Paul Langevin, 24 Rue du Coin, 73500 Aussois

Orateur

M. Borjan Geshkovski

Description

The pure self-attention model is a simplification of the celebrated Transformer architecture, which neglects multi-layer perceptron layers and includes only a single inverse temperature parameter. Despite its apparent simplicity, the model exhibits a remarkably similar qualitative behavior across layers to that observed empirically in a pre-trained Transformer. Viewing layers as a time variable, the self-attention model can be interpreted as an interacting particle system on the unit sphere. We show that when the temperature is sufficiently high, all particles collapse into a single cluster exponentially fast. On the other hand, when the temperature falls below a certain threshold, we show that although the particles eventually collapse into a single cluster, the required time is at least exponentially long. This is a manifestation of dynamic metastability: particles remain trapped in a "slow manifold" consisting of several clusters for exponentially long periods of time. Our proofs make use of the fact that the self-attention model can be written as the gradient flow of a specific interaction energy functional previously found in combinatorics.

Documents de présentation

Aucun document.