Poster
in
Workshop: 2nd Generative AI for Biology Workshop
Self-Supervised Representation Learning for Microbiome Improves Downstream Prediction in Data-Limited Settings and Cross-Cohort Generalizability
Liron Zahavi · Zachary Levine · Eran Segal
Keywords: [ representation learning ] [ limited data ] [ cross-domain transfer ] [ masked autoencoders ] [ cross-cohort generalization ] [ metagenomic data ] [ microbiome ] [ biological data ] [ self-supervised learning ]
The gut microbiome plays a crucial role in human health, but machine learning applications face significant challenges due to limited data availability, high dimensionality, and batch effects across cohorts. We developed self-supervised representation learning methods for gut microbiome metagenomic data by implementing multiple approaches on 85,364 samples, including masked autoencoders and novel cross-domain adaptation of single-cell RNA sequencing models. Systematic benchmarking against the standard practice in microbiome machine learning demonstrated significant advantages of our learned representations in limited-data scenarios, improving prediction for age (r = 0.14 vs. 0.06), Body Mass Index (r = 0.16 vs. 0.11), and drug usage (PR-AUC = 0.81 vs. 0.73). Cross-cohort generalization was enhanced by up to 81/%, addressing transferability challenges across different populations and technical protocols. Our approach provides a valuable framework for overcoming data limitations in microbiome research, with particular potential for the many clinical and intervention studies that operate with small cohorts.