Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Generative AI for Biology Workshop

Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN

Mohammadsaleh Refahi · Gavin Hearne · Harrison Muller · Kieran Lynch · Bahrad Sokhansanj · James Brown · Gail Rosen

Keywords: [ Approximate Nearest Neighbor (ANN) ] [ Large Language Model(LLM) ] [ Gene Embeddings ]


Abstract:

The exponential growth of DNA sequencing data has outpaced traditional heuristic-based methods, which struggle to scale effectively. Efficient computational approaches are urgently needed to support large-scale similarity search, a foundational task in bioinformatics for detecting homology, functional similarity, and novelty among genomic and proteomic sequences. Although tools like BLAST have been widely used and remain effective in many scenarios, they suffer from limitations such as high computational cost and poor performance on divergent sequences.In this work, we explore embedding-based similarity search methods that learn latent representations capturing deeper structural and functional patterns beyond raw sequence alignment. We systematically evaluate two state-of-the-art vector search libraries—FAISS and ScaNN—on biologically meaningful gene embeddings. Unlike prior studies, our analysis focuses on bioinformatics-specific embeddings and benchmarks their utility for detecting novel sequences, including those from uncharacterized taxa or genes lacking known homologs. Our results highlight both computational advantages (in memory and runtime efficiency) and improved retrieval quality, offering a promising alternative to traditional alignment-heavy tools.

Chat is not available.