Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Generative AI for Biology Workshop

SynPair: Pairing Unpaired Antibody Chains at Billion-Sequence Scale With Contrastive Learning

Ollie Turnbull · Charlotte Deane

Keywords: [ PLM ] [ synthetic data ] [ antibodies ] [ language model ]


Abstract:

Large-scale antibody sequence datasets, such as the Observed Antibody Space (OAS), contain billions of unpaired heavy (VH) and light (VL) chain sequences but fewer than 0.2\% paired sequences, limiting the performance of antibody language models trained on these resources. Existing computational antibody pairing models, such as ImmunoMatch, achieve promising accuracy but rely on computationally intensive cross-encoder architectures, making large-scale synthetic pairing infeasible. Here, we reframe antibody chain pairing as a dense retrieval problem and introduce SynPair, a dual-encoder model trained with contrastive InfoNCE loss that achieves state-of-the-art pairing accuracy while dramatically reducing computational requirements. SynPair can pair the entire unpaired OAS corpus—over 2 billion sequences—in less than 24 hours on standard HPC resources, a task previously computationally intractable. The synthetically paired libraries generated by SynPair closely match naturally occurring antibody pairing distributions, providing the potential for a biologically realistic, massively expanded paired dataset for antibody language model pre-training.

Chat is not available.