Skip to yearly menu bar Skip to main content


Spotlight
in
Workshop: 2nd Generative AI for Biology Workshop

Retrieval Augmented Protein Language Models for Protein Structure Prediction

Pan Li · Xingyi Cheng · Le Song · Eric Xing

Keywords: [ Retrieval-augmentation ] [ Protein Language Model ]


Abstract:

The advent of advanced artificial intelligence technology has significantly accelerated progress in protein structure prediction, with AlphaFold2 setting a new benchmark for prediction accuracy by leveraging the Evoformer module to automatically extract co-evolutionary information from multiple sequence alignments (MSA). To address AlphaFold2’s dependence on MSA depth and quality, we propose two novel models: RAGPLM and RAGFold, pretrained modules for Retrieval-AuGmented protein language model and structure prediction. RAGPLM integrates pre-trained protein language models with retrieved MSA, surpassing single-sequence protein language models in perplexity, contact prediction, and fitness prediction. When sufficient MSA is available, RAGFold achieves TM-scores comparable to AlphaFold2 while operating up to eight times faster, and significantly outperforms AlphaFold2 when MSA is insufficient (∆TM-score=0.379, 0.116 and 0.059 for 0, 5 and 10 MSA sequences as input). Additionally, we developed an MSA retriever using hierarchical ID generation that is 45 to 90 times faster than traditional methods, expanding the MSA training set for RAGPLM by 32%. Our findings suggest that RAGPLM provides an efficient and accurate solution for protein structure prediction, particularly in scenarios with limited MSA data. The RAGPLM model has been open-sourced and is available on https://huggingface.co/genbio-ai/AIDO.Protein-RAG-3B.

Chat is not available.