Big Data: Modeling, Estimation and Selection

Grand Amphithéâtre (Ecole Centrale Lille)

Grand Amphithéâtre

Ecole Centrale Lille

Campus Lille 1 à Villeneuve d'Ascq


The expression Big Data refers to the problem of mining huge heterogeneous datasets to perform machine learning or detect interesting links and unsuspected signals. This evokes, on top of the last trendy scientific news, many different mathematical problems that will be addressed during this two days workshop. The purpose of the workshop is to present up-to-date solutions to huge datasets problem and at the same time to pave the way to new research programs and collaborations.

Two days workshop:

  • first day : tutorials on big data issues
  • second day : research conferences

Invited speakers (to be completed):

Pierre Alquier, Gilles Blanchard, Emmanuel Chazard, Manuel Davy, Nial Friel, Jérémie Mary, Eric Moulines, Alejandro Murua, Yann Ollivier, Iuri Paixao, Norman Poh, Khalid Saad-Zaghloul, Gilbert Saporta

Organizing committee:

Rémi Bardenet, Christophe Biernacki, Gwenaëlle Castellan, Emmanuel Chazard, Alain Celisse, Alain Duhamel, Guillemette Marot, Sophie Dabo-Niang, Benjamin Guedj, Serge Iovleff,  Philippe Preux, Emeline Schmisser et Nicolas Wicker


Inscriptions are free but mandatory.


  • Aboubacar Amiri
  • Adrien Hardy
  • Alain Célisse
  • Alejandro Murua
  • amadou khalilou sow
  • Andressa Cerqueira
  • Astha Gupta
  • Aurore Lavigne
  • Aurélien Bellet
  • Ayoub Kontar
  • Azzouz Dermoune
  • Baptiste BELLEVILLE
  • Ben Hamida Sana
  • Benjamin Guedj
  • Brigitte Gelein
  • Broze Laurence
  • Cecile Lecoeur
  • Charline TESSEREAU
  • Christian Derquenne
  • Christophe Biernacki
  • Claire Burny
  • Clément ELVIRA
  • Céline CARIA
  • Dabo Sophie
  • Daoud Ounaissi
  • Deividas Sabonis
  • Diala WEHBE
  • Dimitri HAINAUT
  • Djamel Zitouni
  • Dominique Crié
  • Dorian Michaud
  • duret virginie
  • Emad Aldeen DRWESH
  • Emmanuel Chazard
  • Eric Moulines
  • Fabrice ELEGBEDE
  • Faicel Chamroukhi
  • GASSAMA Mamadou
  • Genane Youness
  • Gharbi Zied
  • Ghislain Rocheleau
  • Gilbert Saporta
  • Gilles Blanchard
  • Guillaume Demazeux
  • Guillemette Marot
  • Hamza Cherkaoui
  • Hanane DALIMI
  • Heinrich Philippe
  • Hiba Alawieh
  • Hulin Ludovic
  • ilyes abid
  • Isabelle Massa-Turpin
  • Ismail JABRI
  • Iuri Paixao
  • Jacques Allard
  • Jingwen WU
  • Jules J. S. de TIBEIRO
  • Julien Flamant
  • Julien HAMONIER
  • Julien Pérolat
  • Jérémie Kellner
  • Jérémie Mary
  • Jérôme Collet
  • Karina Shitikova
  • Khalid Saad Zaghloul
  • klein john
  • Libo Li
  • Linh Nguyen
  • Lopes Renaud
  • Manuel Davy
  • Mathilde Boissel
  • Maxime Baelde
  • Mehdi Ameziane
  • Michal Valko
  • Mickael Canouil
  • Minh Huong Ngo
  • Minh Thanh NGO
  • Mouloud CHATER
  • Mourad Bouneffa
  • N'GUESSAN Assi
  • Nadji Rahmania
  • Nial Friel
  • Nicolas Wicker
  • Norman Poh
  • Olivier Burkart
  • Olivier Torres
  • Oulad Ameziane Mehdi
  • Pascal YIM
  • Paul Poncet
  • philippe preux
  • Pierre Alquier
  • Pierre CHAINAIS
  • Pierre Laffitte
  • Pierre-Victor Chaumier
  • Raoul Dekou
  • Romaric Gaudel
  • Régis BIGOT
  • Rémi Bardenet
  • Salim Bouzebda
  • Sara Babouche
  • Serge Iovleff
  • Sidayne Rihabe
  • Stephane Chretien
  • stephanie combes
  • Thibaut Balcaen
  • Tyberghein Jean-Pierre
  • Van Ha Hoang
  • Vartan Ohanes Choulakian
  • Vincent Devlaminck
  • Vincent Vandewalle
  • Yann Ollivier
  • Yoann Dufresne
  • Yves ANDRE
    • 1
      Which analytic methods for Big Data?
      With massive data , there is no sampling errors : statistical tests and confidence intervals become useless. Generative models are often less important than predictive models. Closed form and parcimonious models are replaced by algorithms. Statistical Learning Theory initiated by V.Vapnik and the late A.Chervonenkis provides the conceptual framework for machine learning algorithms. The use of blackbox models including ensemble models is a challenge for scientific users since their interpretability is quite difficult. We will conclude by the necessity of combining statistical and machine learning tools with causal inference to get better predictions and avoid the confussion between correlation and causality.
      Speaker: Gilbert Saporta (CNAM Paris)
    • 2
      Advances and open questions for neural networks
      Since 2010, under the name " Deep Learning", neural networks are more and more popular and register some success in a wide range of applications : computer Go, image and sound categorization, artificial Go, dialog,… This tutorial is a global presentation of the underlying techniques including stochastic gradient descents and convolutional networks. Some links with wavelets decompositions and open question will be presented as well as some demonstrations of use on pictures and texts.
      Speaker: Jérémie Mary (University of Lille)
    • 3
      Reuse of big data in healthcare: presentation, transformation and analyze of the data extracted from electronic health records
      Routine care of the hospitalized patients enables to generate and store huge amounts of data. Typical datasets are made of medico-administrative data including encoded diagnoses and procedures, laboratory results, drug administrations and free-text reports. The exploitation of those data rises issues of data quality, confidentiality, data aggregation, and expert interpretation. Due to the structure of those data (for instance, each inpatient stay may have 1 to n diagnostic codes, among about 35,000 possible codes), the data aggregation process has a critical impact on the analysis. This aggregation requires skills in programming and statistics, but also a deep knowledge of the data collection process and the medical analysis. This presentation will also show 3 examples of successful data mining and data reuse: adverse drug events detection and prevention, scheduling of patients admission in elective surgery, and hospital billing improvement.
      Speaker: Emmanuel Chazard (Université Lille 2)
    • 3:45 PM
    • 4
      Machine Learning approaches for stock management in the retail industry
      Speaker: Manuel Davy (Vékia)
    • 5
      Big Data, myths & opportunities for the consumer finance industry
      The digital edge, which offers access to a wide variety of structured and non structured data, in a large volumes, is transforming the consumer finance industry. BNP Paribas Personal Finance, European leader of the industry and introducer of scoring techniques in Europe, is engaged in this transformation. The presentation will start with a vision of the digital transformation of our processes, how the data management (in the sense of treatment, modelling and operational use) is the strategic lever, how the alliance between technology and analytics can sustain better the business development. Behind the buzz, the presentation will focus on business opportunities, offered by Big Data techniques, for the consumer finance industry.
      Speakers: Iuri Paixao (BNP Paribas), Khalid Saad-Zaghloul (BNP Paribas)
    • 6:00 PM

      Free diner for all participants.

    • 6
      What can we learn from modelling millions of patient records? A machine learning perspective
      Increasing healthcare cost coupled with an ageing population in both developing and developed worlds means that it is important to understand disease demographic profiles in order to better optimize resources for quality health and care. By using Chronic Kidney Disease (CKD) as a case study, I will present challenges that are related to understanding, modelling and predicting the progression of CKD; and how machine learning techniques can be used to solve them. Examples include calibration of estimated Glomerular Filtration Rate (eGFR), modelling of eGFR, automatic selection clinically relevant variables, and non-linear dimensionality reduction for data discovery.
      Speaker: Norman Poh (University of Surrey)
    • 7
      Invariance principles for robust learning. An illustration with recurrent neural networks
      The optimization methods used to learn models of data are often not invariant under simple changes in the representation of data or of intermediate variables. For instance, for neural networks, using neural activities in [0;1] or in [-1;1] can lead to very different final performance even though the two representations are isomorphic. Here we show how information theory, together with a Riemannian geometric viewpoint emphasizing independence from the details of data representation, leads to new, scalable algorithms for training models of sequential data, which detect more complex patterns and use fewer training samples. For the talk, no familiarity will be assumed with Riemannian geometry, neural networks, information theory, or statistical learning.
      Speaker: Yann Ollivier (Paris-Sud University)
    • 10:30 AM
    • 8
      High-dimensional data classification with mixtures of sphere-hardening distances
      We develop a classification model for high dimensional data that takes into account two main problems in high-dimensions: the curse of the dimensionality and the empty space phenomenon. We overcome these obstacles by modeling the distribution of distances involving feature vectors instead of modeling directly the distribution of feature vectors. The model is based on the sphere-hardening result which states that, in high dimensions, data cluster in shells. Based on asymptotics on the dimension parameter, we show that under simple sampling conditions the distances of data points to their means are distributed as a variant of generalized gamma variables. We propose using mixtures of these distributions for both supervised and unsupervised classification of high-dimensional data. The paradigm is extended to low-dimensional data by embedding the data into higher-dimensional spaces by means of the kernel trick. Part of this work (a) has been done in collaboration with Bertrand Saulnier (Université de Montréal), and Nicolas Wicker (Université de Lille 1; Murua and Wicker, 2014), and (b) was inspired by a conversation with François Léonard (Hydro-Québec; Leonard and Gauvin, 2013).
      Speaker: Alejandro Murua (Université de Montréal)
    • 9
      Construction of tight wavelet-like frames on graphs for denoising
      We construct a frame (redundant dictionary) for the space of real-valued functions defined on a neighborhood graph constructed from data points. This frame is adapted to the underlying geometrical structure (e.g. the points belong to an unknown low dimensional manifold), has finitely many elements, and these elements are localized in frequency as well as in space. This construction follows the ideas of Hammond et al. (2011), with the key point that we construct a tight (or Parseval) frame. This means we have a very simple, explicit reconstruction formula for every functiondefined on the graph from the coefficients given by its scalar product with the frame elements. We use this representation in the setting of denoising where we are given noisy observations of a functiondefined on the graph. By applying a thresholding method to the coefficients in the reconstruction formula, we define an estimate ofwhose risk satisfies a tight oracle inequality.
      Speaker: Gilles Blanchard (University of Potsdam)
    • 10
      On the Properties of Variational Approximations of Gibbs Posteriors
      PAC-Bayesian bounds are useful tools to control the prediction risk of aggregated estimators. When dealing with the exponentially weighted aggregate (EWA), these bounds lead in some settings to the proof that the predictions are minimax-optimal. EWA is usually computed through Monte Carlo methods. However, in many practical applications, the computational cost of Monte Carlo methods is prohibitive. It is thus tempting to replace these by (faster) optimization algorithms that aim at approximating EWA: we will refer to these methods as variational Bayes (VB) methods. In this talk I will show, thanks to a PAC-Bayesian theorem, that VB approximations are well founded, in the sense that the loss incurred in terms of prevision risk is negligible in some classical settings such as linear classification, ranking... These approximations are implemented in the R package pac-vb (written by James Ridgway) that I will briefly introduce. I will especially insist on the the proof of the PAC-Bayesian theorem in order to explain how this result can be extended to other settings.
      Speaker: Pierre Alquier (ENSAE)
    • 11
      Stochastic optimization and high-dimensional sampling: when Moreau inf-convolution meets Langevin diffusion
      Recently, the problem of designing MCMC samplers adapted to high-dimensional Bayesian inference with sensible theoretical guarantees has received a lot of interest. The applications are numerous, including large-scale inference in machine learning, Bayesian nonparametrics, Bayesian inverse problem, aggregation of experts among others. When the density is L-smooth (the log-density is continuously differentiable and its derivative is Lipshitz), we will advocate the use of a “rejection-free” algorithm, based on the discretization of the Euler diffusion with either constant or decreasing stepsizes. We will present several new results allowing convergence to stationarity under different conditions for the log-density (from the weakest, bounded oscillations on a compact set and super-exponential in the tails to the strong concavity). When the log-density is not smooth (a problem which typically appears when using sparsity inducing priors for example), we still suggest to use a Euler discretization but of the Moreau envelope of the non-smooth part of the log-density. An importance sampling correction may be later applied to correct the target. Several numerical illustrations will be presented to show that this algorithm (named MYULA) can be practically used in a high dimensional setting. Finally, non-asymptotic bounds convergence bounds (in total variation and Wasserstein distances) are derived.
      Speaker: Eric Moulines (Télécom ParisTech)
    • 3:30 PM
    • 12
      Approximate Bayesian inference for large datasets
      Light and Widely Applicable (LWA-) MCMC is a novel approximation of the Metropolis-Hastings kernel targeting a posterior distribution defined on a large number of observations. Inspired by Approximate Bayesian Computation, we design a Markov chain whose transition makes use of an unknown but fixed fraction of the available data, where the random choice of sub-sample is guided by the fidelity of this sub-sample to the observed data, as measured by summary (or sufficient) statistics. LWA-MCMC is a generic and flexible approach, as illustrated by the diverse set of examples which we explore. In each case LWA-MCMC yields excellent performance and in some cases a dramatic improvement compared to existing methodologies.
      Speaker: Nial Friel (Dublin University)