Name: Big Data: Modeling, Estimation and Selection
Start: 2016-06-09T11:35:00+02:00
End: 2016-06-10T17:20:00+02:00
Location: Ecole Centrale Lille

Big Data: Modeling, Estimation and Selection

de jeudi 9 juin 2016 (11:35) à vendredi 10 juin 2016 (17:20)

lundi 6 juin 2016
mardi 7 juin 2016
mercredi 8 juin 2016
jeudi 9 juin 2016
13:30 Which analytic methods for Big Data? - Gilbert Saporta (CNAM Paris)
Which analytic methods for Big Data?
- Gilbert Saporta (CNAM Paris)
13:30 - 14:15
Room: Grand Amphithéâtre With massive data , there is no sampling errors : statistical tests and confidence intervals become useless. Generative models are often less important than predictive models. Closed form and parcimonious models are replaced by algorithms. Statistical Learning Theory initiated by V.Vapnik and the late A.Chervonenkis provides the conceptual framework for machine learning algorithms. The use of blackbox models including ensemble models is a challenge for scientific users since their interpretability is quite difficult. We will conclude by the necessity of combining statistical and machine learning tools with causal inference to get better predictions and avoid the confussion between correlation and causality.
14:15 Advances and open questions for neural networks - Jérémie Mary (University of Lille)
Advances and open questions for neural networks
- Jérémie Mary (University of Lille)
14:15 - 15:00
Room: Grand Amphithéâtre Since 2010, under the name " Deep Learning", neural networks are more and more popular and register some success in a wide range of applications : computer Go, image and sound categorization, artificial Go, dialog,… This tutorial is a global presentation of the underlying techniques including stochastic gradient descents and convolutional networks. Some links with wavelets decompositions and open question will be presented as well as some demonstrations of use on pictures and texts.
15:00 Reuse of big data in healthcare: presentation, transformation and analyze of the data extracted from electronic health records - Emmanuel Chazard (Université Lille 2)
Reuse of big data in healthcare: presentation, transformation and analyze of the data extracted from electronic health records
- Emmanuel Chazard (Université Lille 2)
15:00 - 15:45
Room: Grand Amphithéâtre Routine care of the hospitalized patients enables to generate and store huge amounts of data. Typical datasets are made of medico-administrative data including encoded diagnoses and procedures, laboratory results, drug administrations and free-text reports. The exploitation of those data rises issues of data quality, confidentiality, data aggregation, and expert interpretation. Due to the structure of those data (for instance, each inpatient stay may have 1 to n diagnostic codes, among about 35,000 possible codes), the data aggregation process has a critical impact on the analysis. This aggregation requires skills in programming and statistics, but also a deep knowledge of the data collection process and the medical analysis. This presentation will also show 3 examples of successful data mining and data reuse: adverse drug events detection and prevention, scheduling of patients admission in elective surgery, and hospital billing improvement.
15:45 Pause
Pause
15:45 - 16:05
Room: Grand Amphithéâtre
16:05 Machine Learning approaches for stock management in the retail industry - Manuel Davy (Vékia)
Machine Learning approaches for stock management in the retail industry
- Manuel Davy (Vékia)
16:05 - 16:50
Room: Grand Amphithéâtre
16:50 Big Data, myths & opportunities for the consumer finance industry - Iuri Paixao (BNP Paribas) Khalid Saad-Zaghloul (BNP Paribas)
Big Data, myths & opportunities for the consumer finance industry
- Iuri Paixao (BNP Paribas)
- Khalid Saad-Zaghloul (BNP Paribas)
16:50 - 17:35
Room: Grand Amphithéâtre The digital edge, which offers access to a wide variety of structured and non structured data, in a large volumes, is transforming the consumer finance industry. BNP Paribas Personal Finance, European leader of the industry and introducer of scoring techniques in Europe, is engaged in this transformation. The presentation will start with a vision of the digital transformation of our processes, how the data management (in the sense of treatment, modelling and operational use) is the strategic lever, how the alliance between technology and analytics can sustain better the business development. Behind the buzz, the presentation will focus on business opportunities, offered by Big Data techniques, for the consumer finance industry.
18:00 Diner
Diner
18:00 - 20:00
Room: Grand Amphithéâtre
vendredi 10 juin 2016
09:00 What can we learn from modelling millions of patient records? A machine learning perspective - Norman Poh (University of Surrey)
What can we learn from modelling millions of patient records? A machine learning perspective
- Norman Poh (University of Surrey)
09:00 - 09:45
Room: Grand Amphithéâtre Increasing healthcare cost coupled with an ageing population in both developing and developed worlds means that it is important to understand disease demographic profiles in order to better optimize resources for quality health and care. By using Chronic Kidney Disease (CKD) as a case study, I will present challenges that are related to understanding, modelling and predicting the progression of CKD; and how machine learning techniques can be used to solve them. Examples include calibration of estimated Glomerular Filtration Rate (eGFR), modelling of eGFR, automatic selection clinically relevant variables, and non-linear dimensionality reduction for data discovery.
09:45 Invariance principles for robust learning. An illustration with recurrent neural networks - Yann Ollivier (Paris-Sud University)
Invariance principles for robust learning. An illustration with recurrent neural networks
- Yann Ollivier (Paris-Sud University)
09:45 - 10:30
Room: Grand Amphithéâtre The optimization methods used to learn models of data are often not invariant under simple changes in the representation of data or of intermediate variables. For instance, for neural networks, using neural activities in [0;1] or in [-1;1] can lead to very different final performance even though the two representations are isomorphic. Here we show how information theory, together with a Riemannian geometric viewpoint emphasizing independence from the details of data representation, leads to new, scalable algorithms for training models of sequential data, which detect more complex patterns and use fewer training samples. For the talk, no familiarity will be assumed with Riemannian geometry, neural networks, information theory, or statistical learning.
10:30 Pause
Pause
10:30 - 10:50
Room: Grand Amphithéâtre
10:50 High-dimensional data classification with mixtures of sphere-hardening distances - Alejandro Murua (Université de Montréal)
High-dimensional data classification with mixtures of sphere-hardening distances
- Alejandro Murua (Université de Montréal)
10:50 - 11:35
Room: Grand Amphithéâtre We develop a classification model for high dimensional data that takes into account two main problems in high-dimensions: the curse of the dimensionality and the empty space phenomenon. We overcome these obstacles by modeling the distribution of distances involving feature vectors instead of modeling directly the distribution of feature vectors. The model is based on the sphere-hardening result which states that, in high dimensions, data cluster in shells. Based on asymptotics on the dimension parameter, we show that under simple sampling conditions the distances of data points to their means are distributed as a variant of generalized gamma variables. We propose using mixtures of these distributions for both supervised and unsupervised classification of high-dimensional data. The paradigm is extended to low-dimensional data by embedding the data into higher-dimensional spaces by means of the kernel trick. Part of this work (a) has been done in collaboration with Bertrand Saulnier (Université de Montréal), and Nicolas Wicker (Université de Lille 1; Murua and Wicker, 2014), and (b) was inspired by a conversation with François Léonard (Hydro-Québec; Leonard and Gauvin, 2013).
11:35 Construction of tight wavelet-like frames on graphs for denoising - Gilles Blanchard (University of Potsdam)
Construction of tight wavelet-like frames on graphs for denoising
- Gilles Blanchard (University of Potsdam)
11:35 - 12:20
Room: Grand Amphithéâtre We construct a frame (redundant dictionary) for the space of real-valued functions defined on a neighborhood graph constructed from data points. This frame is adapted to the underlying geometrical structure (e.g. the points belong to an unknown low dimensional manifold), has finitely many elements, and these elements are localized in frequency as well as in space. This construction follows the ideas of Hammond et al. (2011), with the key point that we construct a tight (or Parseval) frame. This means we have a very simple, explicit reconstruction formula for every functiondefined on the graph from the coefficients given by its scalar product with the frame elements. We use this representation in the setting of denoising where we are given noisy observations of a functiondefined on the graph. By applying a thresholding method to the coefficients in the reconstruction formula, we define an estimate ofwhose risk satisfies a tight oracle inequality.
14:00 On the Properties of Variational Approximations of Gibbs Posteriors - Pierre Alquier (ENSAE)
On the Properties of Variational Approximations of Gibbs Posteriors
- Pierre Alquier (ENSAE)
14:00 - 14:45
Room: Grand Amphithéâtre PAC-Bayesian bounds are useful tools to control the prediction risk of aggregated estimators. When dealing with the exponentially weighted aggregate (EWA), these bounds lead in some settings to the proof that the predictions are minimax-optimal. EWA is usually computed through Monte Carlo methods. However, in many practical applications, the computational cost of Monte Carlo methods is prohibitive. It is thus tempting to replace these by (faster) optimization algorithms that aim at approximating EWA: we will refer to these methods as variational Bayes (VB) methods. In this talk I will show, thanks to a PAC-Bayesian theorem, that VB approximations are well founded, in the sense that the loss incurred in terms of prevision risk is negligible in some classical settings such as linear classification, ranking... These approximations are implemented in the R package pac-vb (written by James Ridgway) that I will briefly introduce. I will especially insist on the the proof of the PAC-Bayesian theorem in order to explain how this result can be extended to other settings.
14:45 Stochastic optimization and high-dimensional sampling: when Moreau inf-convolution meets Langevin diffusion - Eric Moulines (Télécom ParisTech)
Stochastic optimization and high-dimensional sampling: when Moreau inf-convolution meets Langevin diffusion
- Eric Moulines (Télécom ParisTech)
14:45 - 15:30
Room: Grand Amphithéâtre Recently, the problem of designing MCMC samplers adapted to high-dimensional Bayesian inference with sensible theoretical guarantees has received a lot of interest. The applications are numerous, including large-scale inference in machine learning, Bayesian nonparametrics, Bayesian inverse problem, aggregation of experts among others. When the density is L-smooth (the log-density is continuously differentiable and its derivative is Lipshitz), we will advocate the use of a “rejection-free” algorithm, based on the discretization of the Euler diffusion with either constant or decreasing stepsizes. We will present several new results allowing convergence to stationarity under different conditions for the log-density (from the weakest, bounded oscillations on a compact set and super-exponential in the tails to the strong concavity). When the log-density is not smooth (a problem which typically appears when using sparsity inducing priors for example), we still suggest to use a Euler discretization but of the Moreau envelope of the non-smooth part of the log-density. An importance sampling correction may be later applied to correct the target. Several numerical illustrations will be presented to show that this algorithm (named MYULA) can be practically used in a high dimensional setting. Finally, non-asymptotic bounds convergence bounds (in total variation and Wasserstein distances) are derived.
15:30 Pause
Pause
15:30 - 15:50
Room: Grand Amphithéâtre
15:50 Approximate Bayesian inference for large datasets - Nial Friel (Dublin University)
Approximate Bayesian inference for large datasets
- Nial Friel (Dublin University)
15:50 - 16:35
Room: Grand Amphithéâtre Light and Widely Applicable (LWA-) MCMC is a novel approximation of the Metropolis-Hastings kernel targeting a posterior distribution defined on a large number of observations. Inspired by Approximate Bayesian Computation, we design a Markov chain whose transition makes use of an unknown but fixed fraction of the available data, where the random choice of sub-sample is guided by the fidelity of this sub-sample to the observed data, as measured by summary (or sufficient) statistics. LWA-MCMC is a generic and flexible approach, as illustrated by the diverse set of examples which we explore. In each case LWA-MCMC yields excellent performance and in some cases a dramatic improvement compared to existing methodologies.