ICML Intrinsic Evaluation of DNA Embeddings in Genome Language Models: Insights from Yeast Genomic Sequences

Poster
in
Workshop: 2nd Generative AI for Biology Workshop

Intrinsic Evaluation of DNA Embeddings in Genome Language Models: Insights from Yeast Genomic Sequences

Ruhaib Muhammad · Rajeeva Madhan · Roshan Balaji · Nirav Bhatt

Keywords: [ Genomic Language Models ] [ Interpretability ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

In this work, we present a task-independent evaluation of Genome Language Model (gLM) embeddings to understand what contextual and biological information they inherently capture. Through three novel experiments, we assess how well embeddings reflect sequence similarity, encode evolutionary context, and respond to synthetic point mutations using Yeast genomic sequences. Our findings reveal that embeddings correlate with sequence similarity, cluster by phylogenetic clade, and show differential robustness between coding and non-coding regions. These results offer new insights into the representational capabilities of gLMs and pave the way for principled interpretability and benchmarking of gLMs.

Chat is not available.

Poster in Workshop: 2nd Generative AI for Biology Workshop

Intrinsic Evaluation of DNA Embeddings in Genome Language Models: Insights from Yeast Genomic Sequences

Ruhaib Muhammad · Rajeeva Madhan · Roshan Balaji · Nirav Bhatt

Poster
in
Workshop: 2nd Generative AI for Biology Workshop