Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Generative AI for Biology Workshop

Self-Supervised Representation Learning for Microbiome Improves Downstream Prediction in Data-Limited Settings and Cross-Cohort Generalizability

Liron Zahavi · Zachary Levine · Eran Segal

Keywords: [ representation learning ] [ limited data ] [ cross-domain transfer ] [ masked autoencoders ] [ cross-cohort generalization ] [ metagenomic data ] [ microbiome ] [ biological data ] [ self-supervised learning ]


Abstract:

The gut microbiome plays a crucial role in human health, but machine learning applications face significant challenges due to limited data availability, high dimensionality, and batch effects across cohorts. We developed self-supervised representation learning methods for gut microbiome metagenomic data by implementing multiple approaches on 85,364 samples, including masked autoencoders and novel cross-domain adaptation of single-cell RNA sequencing models. Systematic benchmarking against the standard practice in microbiome machine learning demonstrated significant advantages of our learned representations in limited-data scenarios, improving prediction for age (r = 0.14 vs. 0.06), Body Mass Index (r = 0.16 vs. 0.11), and drug usage (PR-AUC = 0.81 vs. 0.73). Cross-cohort generalization was enhanced by up to 81/%, addressing transferability challenges across different populations and technical protocols. Our approach provides a valuable framework for overcoming data limitations in microbiome research, with particular potential for the many clinical and intervention studies that operate with small cohorts.

Chat is not available.