Spotlight
in
Workshop: 2nd Generative AI for Biology Workshop
DePO: Elicit Chemical Reasoning Capability via Demonstration-Guided Policy Optimization
Xuan Li · Zhanke Zhou · Zongze Li · Jiangchao Yao · Yu Rong · Lu Zhang · Bo Han
Keywords: [ LLM Reasoning ] [ Molecular Optimization ] [ Large Language Model ]
Abstract:
Large language models (LLMs) have demonstrated impressive mathematical reasoning capabilities when trained with reinforcement learning with verifiable rewards (RLVR), particularly through Group Relative Policy Optimization (GRPO). However, extending these methods to scientific domains such as $\textit{molecular optimization}$ is challenging, as LLMs often lack the necessary domain-specific reasoning skills. Molecular optimization involves optimizing molecular properties while preserving structural similarity, leading to a complex combinatorial search. Existing models struggle due to conflicting objectives, limited chemical reasoning, and the scarcity of datasets with intermediate reasoning steps, which hinders learning effective strategies. To address these issues, we introduce $\textbf{De}$monstration-guided $\textbf{P}$olicy $\textbf{O}$ptimization (DePO). This framework leverages reference molecules as demonstrations to guide model exploration toward promising regions of chemical space. Specifically, DePO incorporates demonstrations as supervised signals for each reasoning chain, to regularize the search direction while preserving the model's reasoning capabilities. Experiments show that DePO significantly outperforms both supervised fine-tuning and GRPO approaches across key molecular optimization metrics, and excels in balancing the competitive optimization objectives. DePO also shows generalization capabilities and inference-scaling properties.
Chat is not available.