Statistical analysis of the optimal transport problem

by Alberto González (TOULOUSE III)

Amphithéâtre Laurent Schwartz, bâtiment 1R3 (Institut de Mathématiques de Toulouse)

Amphithéâtre Laurent Schwartz, bâtiment 1R3

Institut de Mathématiques de Toulouse

118 route de Narbonne 31062 Toulouse Cedex 9

Optimal transportation is a resource allocation problem present in fields such as economics, finance, physics or artificial intelligence.  From a probabilistic point of view, the optimal transport cost endows the space of probability measures with a metric topology. In particular, this topology is equivalent to the weak topology of probability measures together with the convergence of moments.  This makes the transport cost an appropriate tool for measuring discrepancies between distributions.  On the other hand, the solution of the transport problem is known as optimal plan. That is, an unambiguous way to relate two distributions following an optimality criterion. This optimal plan, when deterministic, is called a transport map. 

 However, in many cases the probability distribution is a theoretical, unattainable entity. It is only visible to the practitioner through its empirical version, i.e. a finite data set of size $n$.  This work examines the asymptotic behaviour  of the transport cost in its empirical version. In other words, we study the limits of the empirical cost and plans when the data grows to infinity.  It is well-known that the empirical transport cost converges to the population one. Moreover, for continuous measures it does so at a rate that decreases with dimension. In this thesis we prove the consistency of the transport map using topology of set-valued maps. This leads, indirectly, to being able to state that the rate at which the fluctuations--difference between the expected empirical cost  and the empirical cost itself--approximate zero is the parametric $n^{-1/2}$, irrespective of the dimension. Moreover, these fluctuations multiplied by $n^{1/2}$ tend toward a Gaussian random variable.  In economics the transportation problem appears in numerous occasions in its semi-discrete version, i.e. one of the probability  distributions is discrete. In this case, we show that the rate at which the empirical transport cost converges to the population one does not depend on the dimension. 

We also show that the well-known entropy regularization (or Sinkhorn regularization), apart from simplifying the computation of the transport problem by giving it a differentiable structure, has highly satisfactory statistical properties. In particular, its bias and the divergence--that the regularization defines--converge with speed greater than the parametric one; the empirical regularized plans converge to the population ones with rate $n^{-1/2}$ and, moreover,  tending to a Gaussian process.

The transport map endows a probability measure $P$ with an order with respect to a given reference. This property leads to the successful definition of M.Hallin's multivariate distribution function by choosing as a reference measure the spherical uniform. This thesis provides sufficient conditions under which this function defines a homeomorphism between the support of the probability measure $P$ and the unitary ball--i.e. to support of the spherical uniform. Finally, we provide a conditional version of the multivariate distribution function, with applications to quantile regression.