Poster
in
Workshop: 2nd Generative AI for Biology Workshop
NovoMolGen: Rethinking Molecular Language Model Pretraining
Kamran Chitsaz · Roshan Balaji · Quentin Fournier · Nirav Bhatt · Sarath Chandar
Keywords: [ De Novo Molecular Generation ] [ Molecular Language Model ]
Abstract:
Designing \denovo molecules with desired properties requires efficient exploration of an immense chemical space spanning $10^{23}$ to $10^{60}$ potential candidates. Although Molecular Large Language Models (Mol-LLMs) enable scalable exploration using string-based representations, the effects of language modeling practices such as tokenization, model size, and dataset scale on molecular generation performance remain unclear. In this study, we introduce NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules, to systematically investigate these key factors. Our analyses demonstrate a weak correlation between standard pretraining metrics and downstream molecular generation performance, highlighting critical differences compared to general NLP models. NovoMolGen achieves state-of-the-art results, outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecule generation tasks.
Chat is not available.