ICML Poster scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data

Spotlight Poster

scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data

Olga Ovcharenko · Florian Barkmann · Philip Toma · Imant Daunhawer · Julia Vogt · Sebastian Schelter · Valentina Boeva

West Exhibition Hall B2-B3 #W-311

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Slides] [ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Self-supervised learning (SSL) has proven to be a powerful approach for extracting biologically meaningful representations from single-cell data. To advance our understanding of SSL methods applied to single-cell data, we present scSSL-Bench, a comprehensive benchmark that evaluates nineteen SSL methods. Our evaluation spans nine datasets and focuses on three common downstream tasks: batch correction, cell type annotation, and missing modality prediction. Furthermore, we systematically assess various data augmentation strategies. Our analysis reveals task-specific trade-offs: the specialized single-cell frameworks, scVI, CLAIRE, and the finetuned scGPT excel at uni-modal batch correction, while generic SSL methods, such as VICReg and SimCLR, demonstrate superior performance in cell typing and multi-modal data integration. Random masking emerges as the most effective augmentation technique across all tasks, surpassing domain-specific augmentations. Notably, our results indicate the need for a specialized single-cell multi-modal data integration framework. scSSL-Bench provides a standardized evaluation platform and concrete recommendations for applying SSL to single-cell analysis, advancing the convergence of deep learning and single-cell genomics.

Lay Summary:

Scientists today study individual cells in unprecedented detail, but analyzing this complex data requires sophisticated methods. Researchers have struggled to identify which methods work best for different purposes/tasks. This study created a comprehensive benchmark, scSSL-Bench, to compare nineteen different ML methods across nine datasets containing different modalities of cell data and focus on three key tasks: fixing technical errors when combining data from different experiments, identifying cell types, and filling in missing modalities. We discovered that specialized cell analysis methods excel at fixing technical errors, while general-purpose ML approaches perform better at identifying cell types. By providing evidence-based recommendations for choosing the right cell analysis methods and identifying gaps where new specialized methods are needed, this work enhances reproducibility and gives scientists a roadmap for more effective single-cell analysis, ultimately accelerating discoveries in medicine and biology.

Chat is not available.