ICML NovoMolGen: Rethinking Molecular Language Model Pretraining

Poster
in
Workshop: 2nd Generative AI for Biology Workshop

NovoMolGen: Rethinking Molecular Language Model Pretraining

Kamran Chitsaz · Roshan Balaji · Quentin Fournier · Nirav Bhatt · Sarath Chandar

Keywords: [ De Novo Molecular Generation ] [ Molecular Language Model ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Designing \denovo molecules with desired properties requires efficient exploration of an immense chemical space spanning $10^{23}$ to $10^{60}$ potential candidates. Although Molecular Large Language Models (Mol-LLMs) enable scalable exploration using string-based representations, the effects of language modeling practices such as tokenization, model size, and dataset scale on molecular generation performance remain unclear. In this study, we introduce NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules, to systematically investigate these key factors. Our analyses demonstrate a weak correlation between standard pretraining metrics and downstream molecular generation performance, highlighting critical differences compared to general NLP models. NovoMolGen achieves state-of-the-art results, outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecule generation tasks.

Chat is not available.

Poster in Workshop: 2nd Generative AI for Biology Workshop

NovoMolGen: Rethinking Molecular Language Model Pretraining

Kamran Chitsaz · Roshan Balaji · Quentin Fournier · Nirav Bhatt · Sarath Chandar

Poster
in
Workshop: 2nd Generative AI for Biology Workshop