Big Data: Modeling, Estimation and Selection
from
Thursday, 9 June 2016 (11:35)
to
Friday, 10 June 2016 (17:20)
Monday, 6 June 2016
Tuesday, 7 June 2016
Wednesday, 8 June 2016
Thursday, 9 June 2016
13:30
Which analytic methods for Big Data?

Gilbert Saporta
(
CNAM Paris
)
Which analytic methods for Big Data?
Gilbert Saporta
(
CNAM Paris
)
13:30  14:15
Room: Grand Amphithéâtre
With massive data , there is no sampling errors : statistical tests and confidence intervals become useless. Generative models are often less important than predictive models. Closed form and parcimonious models are replaced by algorithms. Statistical Learning Theory initiated by V.Vapnik and the late A.Chervonenkis provides the conceptual framework for machine learning algorithms. The use of blackbox models including ensemble models is a challenge for scientific users since their interpretability is quite difficult. We will conclude by the necessity of combining statistical and machine learning tools with causal inference to get better predictions and avoid the confussion between correlation and causality.
14:15
Advances and open questions for neural networks

Jérémie Mary
(
University of Lille
)
Advances and open questions for neural networks
Jérémie Mary
(
University of Lille
)
14:15  15:00
Room: Grand Amphithéâtre
Since 2010, under the name " Deep Learning", neural networks are more and more popular and register some success in a wide range of applications : computer Go, image and sound categorization, artificial Go, dialog,… This tutorial is a global presentation of the underlying techniques including stochastic gradient descents and convolutional networks. Some links with wavelets decompositions and open question will be presented as well as some demonstrations of use on pictures and texts.
15:00
Reuse of big data in healthcare: presentation, transformation and analyze of the data extracted from electronic health records

Emmanuel Chazard
(
Université Lille 2
)
Reuse of big data in healthcare: presentation, transformation and analyze of the data extracted from electronic health records
Emmanuel Chazard
(
Université Lille 2
)
15:00  15:45
Room: Grand Amphithéâtre
Routine care of the hospitalized patients enables to generate and store huge amounts of data. Typical datasets are made of medicoadministrative data including encoded diagnoses and procedures, laboratory results, drug administrations and freetext reports. The exploitation of those data rises issues of data quality, confidentiality, data aggregation, and expert interpretation. Due to the structure of those data (for instance, each inpatient stay may have 1 to n diagnostic codes, among about 35,000 possible codes), the data aggregation process has a critical impact on the analysis. This aggregation requires skills in programming and statistics, but also a deep knowledge of the data collection process and the medical analysis. This presentation will also show 3 examples of successful data mining and data reuse: adverse drug events detection and prevention, scheduling of patients admission in elective surgery, and hospital billing improvement.
15:45
Pause
Pause
15:45  16:05
Room: Grand Amphithéâtre
16:05
Machine Learning approaches for stock management in the retail industry

Manuel Davy
(
Vékia
)
Machine Learning approaches for stock management in the retail industry
Manuel Davy
(
Vékia
)
16:05  16:50
Room: Grand Amphithéâtre
16:50
Big Data, myths & opportunities for the consumer finance industry

Khalid SaadZaghloul
(
BNP Paribas
)
Iuri Paixao
(
BNP Paribas
)
Big Data, myths & opportunities for the consumer finance industry
Khalid SaadZaghloul
(
BNP Paribas
)
Iuri Paixao
(
BNP Paribas
)
16:50  17:35
Room: Grand Amphithéâtre
The digital edge, which offers access to a wide variety of structured and non structured data, in a large volumes, is transforming the consumer finance industry. BNP Paribas Personal Finance, European leader of the industry and introducer of scoring techniques in Europe, is engaged in this transformation. The presentation will start with a vision of the digital transformation of our processes, how the data management (in the sense of treatment, modelling and operational use) is the strategic lever, how the alliance between technology and analytics can sustain better the business development. Behind the buzz, the presentation will focus on business opportunities, offered by Big Data techniques, for the consumer finance industry.
18:00
Diner
Diner
18:00  20:00
Room: Grand Amphithéâtre
Friday, 10 June 2016
09:00
What can we learn from modelling millions of patient records? A machine learning perspective

Norman Poh
(
University of Surrey
)
What can we learn from modelling millions of patient records? A machine learning perspective
Norman Poh
(
University of Surrey
)
09:00  09:45
Room: Grand Amphithéâtre
Increasing healthcare cost coupled with an ageing population in both developing and developed worlds means that it is important to understand disease demographic profiles in order to better optimize resources for quality health and care. By using Chronic Kidney Disease (CKD) as a case study, I will present challenges that are related to understanding, modelling and predicting the progression of CKD; and how machine learning techniques can be used to solve them. Examples include calibration of estimated Glomerular Filtration Rate (eGFR), modelling of eGFR, automatic selection clinically relevant variables, and nonlinear dimensionality reduction for data discovery.
09:45
Invariance principles for robust learning. An illustration with recurrent neural networks

Yann Ollivier
(
ParisSud University
)
Invariance principles for robust learning. An illustration with recurrent neural networks
Yann Ollivier
(
ParisSud University
)
09:45  10:30
Room: Grand Amphithéâtre
The optimization methods used to learn models of data are often not invariant under simple changes in the representation of data or of intermediate variables. For instance, for neural networks, using neural activities in [0;1] or in [1;1] can lead to very different final performance even though the two representations are isomorphic. Here we show how information theory, together with a Riemannian geometric viewpoint emphasizing independence from the details of data representation, leads to new, scalable algorithms for training models of sequential data, which detect more complex patterns and use fewer training samples. For the talk, no familiarity will be assumed with Riemannian geometry, neural networks, information theory, or statistical learning.
10:30
Pause
Pause
10:30  10:50
Room: Grand Amphithéâtre
10:50
Highdimensional data classification with mixtures of spherehardening distances

Alejandro Murua
(
Université de Montréal
)
Highdimensional data classification with mixtures of spherehardening distances
Alejandro Murua
(
Université de Montréal
)
10:50  11:35
Room: Grand Amphithéâtre
We develop a classification model for high dimensional data that takes into account two main problems in highdimensions: the curse of the dimensionality and the empty space phenomenon. We overcome these obstacles by modeling the distribution of distances involving feature vectors instead of modeling directly the distribution of feature vectors. The model is based on the spherehardening result which states that, in high dimensions, data cluster in shells. Based on asymptotics on the dimension parameter, we show that under simple sampling conditions the distances of data points to their means are distributed as a variant of generalized gamma variables. We propose using mixtures of these distributions for both supervised and unsupervised classification of highdimensional data. The paradigm is extended to lowdimensional data by embedding the data into higherdimensional spaces by means of the kernel trick. Part of this work (a) has been done in collaboration with Bertrand Saulnier (Université de Montréal), and Nicolas Wicker (Université de Lille 1; Murua and Wicker, 2014), and (b) was inspired by a conversation with François Léonard (HydroQuébec; Leonard and Gauvin, 2013).
11:35
Construction of tight waveletlike frames on graphs for denoising

Gilles Blanchard
(
University of Potsdam
)
Construction of tight waveletlike frames on graphs for denoising
Gilles Blanchard
(
University of Potsdam
)
11:35  12:20
Room: Grand Amphithéâtre
We construct a frame (redundant dictionary) for the space of realvalued functions defined on a neighborhood graph constructed from data points. This frame is adapted to the underlying geometrical structure (e.g. the points belong to an unknown low dimensional manifold), has finitely many elements, and these elements are localized in frequency as well as in space. This construction follows the ideas of Hammond et al. (2011), with the key point that we construct a tight (or Parseval) frame. This means we have a very simple, explicit reconstruction formula for every functiondefined on the graph from the coefficients given by its scalar product with the frame elements. We use this representation in the setting of denoising where we are given noisy observations of a functiondefined on the graph. By applying a thresholding method to the coefficients in the reconstruction formula, we define an estimate ofwhose risk satisfies a tight oracle inequality.
14:00
On the Properties of Variational Approximations of Gibbs Posteriors

Pierre Alquier
(
ENSAE
)
On the Properties of Variational Approximations of Gibbs Posteriors
Pierre Alquier
(
ENSAE
)
14:00  14:45
Room: Grand Amphithéâtre
PACBayesian bounds are useful tools to control the prediction risk of aggregated estimators. When dealing with the exponentially weighted aggregate (EWA), these bounds lead in some settings to the proof that the predictions are minimaxoptimal. EWA is usually computed through Monte Carlo methods. However, in many practical applications, the computational cost of Monte Carlo methods is prohibitive. It is thus tempting to replace these by (faster) optimization algorithms that aim at approximating EWA: we will refer to these methods as variational Bayes (VB) methods. In this talk I will show, thanks to a PACBayesian theorem, that VB approximations are well founded, in the sense that the loss incurred in terms of prevision risk is negligible in some classical settings such as linear classification, ranking... These approximations are implemented in the R package pacvb (written by James Ridgway) that I will briefly introduce. I will especially insist on the the proof of the PACBayesian theorem in order to explain how this result can be extended to other settings.
14:45
Stochastic optimization and highdimensional sampling: when Moreau infconvolution meets Langevin diffusion

Eric Moulines
(
Télécom ParisTech
)
Stochastic optimization and highdimensional sampling: when Moreau infconvolution meets Langevin diffusion
Eric Moulines
(
Télécom ParisTech
)
14:45  15:30
Room: Grand Amphithéâtre
Recently, the problem of designing MCMC samplers adapted to highdimensional Bayesian inference with sensible theoretical guarantees has received a lot of interest. The applications are numerous, including largescale inference in machine learning, Bayesian nonparametrics, Bayesian inverse problem, aggregation of experts among others. When the density is Lsmooth (the logdensity is continuously differentiable and its derivative is Lipshitz), we will advocate the use of a “rejectionfree” algorithm, based on the discretization of the Euler diffusion with either constant or decreasing stepsizes. We will present several new results allowing convergence to stationarity under different conditions for the logdensity (from the weakest, bounded oscillations on a compact set and superexponential in the tails to the strong concavity). When the logdensity is not smooth (a problem which typically appears when using sparsity inducing priors for example), we still suggest to use a Euler discretization but of the Moreau envelope of the nonsmooth part of the logdensity. An importance sampling correction may be later applied to correct the target. Several numerical illustrations will be presented to show that this algorithm (named MYULA) can be practically used in a high dimensional setting. Finally, nonasymptotic bounds convergence bounds (in total variation and Wasserstein distances) are derived.
15:30
Pause
Pause
15:30  15:50
Room: Grand Amphithéâtre
15:50
Approximate Bayesian inference for large datasets

Nial Friel
(
Dublin University
)
Approximate Bayesian inference for large datasets
Nial Friel
(
Dublin University
)
15:50  16:35
Room: Grand Amphithéâtre
Light and Widely Applicable (LWA) MCMC is a novel approximation of the MetropolisHastings kernel targeting a posterior distribution defined on a large number of observations. Inspired by Approximate Bayesian Computation, we design a Markov chain whose transition makes use of an unknown but fixed fraction of the available data, where the random choice of subsample is guided by the fidelity of this subsample to the observed data, as measured by summary (or sufficient) statistics. LWAMCMC is a generic and flexible approach, as illustrated by the diverse set of examples which we explore. In each case LWAMCMC yields excellent performance and in some cases a dramatic improvement compared to existing methodologies.