Rencontres Statistiques Lyonnaises

Bayesian nonparametric inference for discovery probabilities: credible intervals and large sample asymptotics

par Julyan ARBEL (Inria Grenoble Rhône-Alpes, équipe Mistis)

Europe/Paris
Fokko du Cloux (Bâtiment Braconnier)

Fokko du Cloux

Bâtiment Braconnier

Description
- Part 1: A tour on Bayesian nonparametric models, from the Dirichlet process to some of its extensions - Part 2: Bayesian nonparametric inference for discovery probabilities The longstanding problem of discovery probabilities dates back to World War II with Alan Turing codebreaking the Axis forces Enigma machine at Bletchley Park. The problem can be simply sketched as follows. An experimenter sampling units (say animals) from a population and recording their type (say species) asks: What is the probability that the next sampled animal coincides with a species already observed a given number of times? or that it is a newly discovered species? Applications are not limited to ecology but span bioinformatics, genetics, machine learning, multi-armed bandits, and so on. In this talk I describe a Bayesian nonparametric (BNP) approach to the problem and compare it to the original and highly popular estimators known as Good-Turing estimators. More specifically, I start by recalling some basics about the Dirichlet process which is the cornerstone of the BNP paradigm. Then I present a closed form expression for the posterior distribution of discovery probabilities which naturally leads to simple credible intervals. Next I describe asymptotic approximations of the BNP estimators for large sample size, and conclude by illustrating the proposed results through a benchmark genomic dataset of Expressed Sequence Tags. (Joint work with Stefano Favaro (University of Torino); Bernardo Nipoti (Trinity College, Dublin); Yee Whye Teh (University of Oxford)). Manuscript available at https://arxiv.org/abs/1506.04915