Machine learning and Data science algorithms involve in their last stage the need for optimization of complex random functions in very high dimensions. Simple algorithms like Stochastic Gradient Descent (with small batches) are used very effectively. I will concentrate on trying to understand why these simple tools can still work in these complex and very over-parametrized regimes.
I will first introduce the whole framework for non-experts, from the structure of the typical tasks to the natural structures of neural nets used in standard contexts. l will then cover briefly the classical and usual context of SGD in finite dimensions. I will then survey recent work with Reza Gheissari (Northwestern), Aukosh Jagannath (Waterloo) giving a general view for the existence of projected “effective dynamics” for “summary statistics” in much smaller dimensions, which still rule the performance of very high dimensional systems, as well . These effective dynamics (as their so-called “critical regime”) define a dynamical system in finite dimensions which may be quite complex, and rules the performance of the learning algorithm.
The next step will be to understand how the system finds these “summary statistics”. This is done in the next work with the same authors and with Jiaoyang Huang (Wharton, U-Penn). This is based on a dynamical spectral transition of Random Matrix Theory: along the trajectory of the optimization path, the Gram matix or the Hessian matrix develop outliers which carry these effective dynamics. I will naturally first come back to the Random Matrix Tools needed here (the behavior of the edge of the spectrum and the BBP transition) in a much broader context. And then illustrate the use of this point of view on a few central examples of ML: multilayer neural nets for classification (of Gaussian mixtures), and the XOR examples, for instance.
Thierry Bodineau