Methodology and theory

Model-free predictive inference

An important objective of statistics is to predict future outcomes of a certain phenomenon given relevant past observations. Most existing approaches to this problem are either limited by relatively simple parametric models (whose validity may often be hard to justify in practice), or are based on complex algorithms that offer no statistical guarantees (and tend to be overconfident about the accuracy of their predictions). I am interested in developing methods for model-free predictive inference that are both generally applicable and as efficient as possible. I seek this goal by combining statistical exchangeability ideas (e.g., conformal inference) with machine learning.

Conformal prediction intervals for two test images from the CIFAR-10 data set, with their top two estimated class probabilities computed by the output softmax layer of convolutional neural networks trained to minimize different loss functions. Left: intact image of a ship. Right: corrupted image of a dog. Image from Einbinder, Romano, Sesia, and Zhou (2022).

Relevant papers:

High-dimensional variable selection

Modern data sets often measure thousands of possible explanatory variables, which one would want to leverage to explain a particular phenomenon of interest, although only a small subset of them may be relevant. The challenge for statisticians is to identify which of these variables are truly important by analyzing the data, even though the number of observations may be smaller than the number of variables, and to do so with confidence. I am broadly interested in developing methods that can perform statistically principled variable selection for high-dimensional data, without relying on unrealistic assumptions. If you would like to learn more about this line of research, a good starting point would be to read about knockoffs.

Feature importance measures for unimportant explanatory variables (left) and hidden Markov model knockoffs (right). Knockoffs allow one to single out truly important variables while controlling the false discovery rate. Image from Sesia, Sabatti, and Candès. 2019.

Relevant papers:

Other more recent topics

Estimation of coverage probabilities and distinct counts from sketched data:

Methodology and applications

Statistical genetics

Genome-wide association studies measure, from large numbers of people, hundreds of thousands of simple genetic mutations across the entire genome and compare them to interesting phenotypes (e.g., blood pressure, cholesterol levels, diabetes, and many other diseases), with the goal of better understanding the underlying biology and heritability. From a statistician’s perspective, this problem can at first be seen as a special instance of high-dimensional variable selection, although genetic data are extremely high-dimensional and display a particular structure (hidden Markov models) that raises unique challenges as well as opportunities.

One of my main contributions in this field was the development of KnockoffGWAS, a powerful and versatile statistical method for the analysis of genome-wide association data with population structure.

Visualization of a hidden Markov model for the genetic variables of an offspring conditional on the DNA of the parents. This model can be used to generate synthetic data for rigorous model-free inference. Image from Bates, Sesia, Candès, and Sabatti. 2020.

Relevant papers:


I enjoy collaborating on applied data science and statistics problems.

Interpretable classification of bacterial Raman spectra using knockoff-filtered wavelet features. Image from Chia et al. 2021.

Relevant papers:

Language models

Uncertainty estimation in language models is a new interest on which I started working only recently.

Relevant papers: