Research
Methodology and theory
Model-free predictive inference
An important objective of statistics is to predict future outcomes of a certain phenomenon given relevant past observations. Most existing approaches to this problem are either limited by relatively simple parametric models (whose validity may often be hard to justify in practice), or are based on complex algorithms that offer no statistical guarantees (and tend to be overconfident about the accuracy of their predictions). I am interested in developing methods for model-free predictive inference that are both generally applicable and as efficient as possible. I seek this goal by combining statistical exchangeability ideas (e.g., conformal inference) with machine learning.
Conformal prediction intervals for two test images from the CIFAR-10 data set, with their top two estimated class probabilities computed by the output softmax layer of convolutional neural networks trained to minimize different loss functions. Left: intact image of a ship. Right: corrupted image of a dog. Image from Einbinder, Romano, Sesia, and Zhou (2022).
Relevant papers:
- Collective outlier detection
- Conformal classification with equalized coverage for adaptively selected groups
- Structured matrix completion
- Forecasting heterogeneous trajectories
- Adaptive conformal classification with noisy labels
- De-randomized FDR control with conformal e-values
- Conformalized early stopping
- Conformalized learning
- Conformalized sketching with coverage for distinct queries
- Conformalized sketching
- Integrative conformal p-values
- Conformal prediction using conditional histograms
- Testing for outliers with conformal p-values
- Valid and adaptive classification
- A comparison of conformal quantile regression methods.
High-dimensional variable selection
Modern data sets often measure thousands of possible explanatory variables, which one would want to leverage to explain a particular phenomenon of interest, although only a small subset of them may be relevant. The challenge for statisticians is to identify which of these variables are truly important by analyzing the data, even though the number of observations may be smaller than the number of variables, and to do so with confidence. I am broadly interested in developing methods that can perform statistically principled variable selection for high-dimensional data, without relying on unrealistic assumptions. If you would like to learn more about this line of research, a good starting point would be to read about knockoffs.
Feature importance measures for unimportant explanatory variables (left) and hidden Markov model knockoffs (right). Knockoffs allow one to single out truly important variables while controlling the false discovery rate. Image from Sesia, Sabatti, and Candès. 2019.
Relevant papers:
- Subgroup-selective knockoff filter
- Transfer learning with knockoffs
- Multi-environment knockoff filter
- Deep knockoffs
- Hidden Markov model knockoffs.
Other more recent topics
Estimation of coverage probabilities and distinct counts from sketched data:
- BNP frequency recovery from sketches data
- [Frequency and cardinality recovery from sketched data] (/publication/2023-frequency-recovery)
- Conformalized sketching under relaxed exchangeability
- Conformalized sketching
Methodology and applications
Statistical genetics
Genome-wide association studies measure, from large numbers of people, hundreds of thousands of simple genetic mutations across the entire genome and compare them to interesting phenotypes (e.g., blood pressure, cholesterol levels, diabetes, and many other diseases), with the goal of better understanding the underlying biology and heritability. From a statistician’s perspective, this problem can at first be seen as a special instance of high-dimensional variable selection, although genetic data are extremely high-dimensional and display a particular structure (hidden Markov models) that raises unique challenges as well as opportunities.
One of my main contributions in this field was the development of KnockoffGWAS, a powerful and versatile statistical method for the analysis of genome-wide association data with population structure.
Visualization of a hidden Markov model for the genetic variables of an offspring conditional on the DNA of the parents. This model can be used to generate synthetic data for rigorous model-free inference. Image from Bates, Sesia, Candès, and Sabatti. 2020.
Relevant papers:
- Transfer learning with knockoffs
- Multi-environment knockoff filter
- KnockoffGWAS
- KnockoffZoom
- Causal inference from trio data,
- Hidden Markov model knockoffs.
Collaborations
I enjoy collaborating on applied data science and statistics problems.
Interpretable classification of bacterial Raman spectra using knockoff-filtered wavelet features. Image from Chia et al. 2021.
Relevant papers:
- Robotic-assisted esophagectomy
- Circulating tumor cells
- Circulating tumor cells
- Bacterial classification from Raman spectra
- Hyperoxemia among pediatric intensive care unit patients
Language models
Uncertainty estimation in language models is a new interest on which I started working only recently.
Relevant papers: