ICML Poster Improved Off-policy Reinforcement Learning in Biological Sequence Design

Poster

Improved Off-policy Reinforcement Learning in Biological Sequence Design

Hyeonah Kim · Minsu Kim · Taeyoung Yun · Sanghyeok Choi · Emmanuel Bengio · Alex Hernandez-Garcia · Jinkyoo Park

West Exhibition Hall B2-B3 #W-612

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although reinforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out-of-distribution inputs. To address this, we propose a novel off-policy search, $\delta$-Conservative Search, that enhances robustness by restricting policy exploration to reliable regions. Starting from high-score offline sequences, we inject noise by randomly masking tokens with probability $\delta$, then denoise them using our policy. We further adapt $\delta$ based on proxy uncertainty on each data point, aligning the level of conservativeness with model confidence. Experimental results show that our conservative search consistently enhances the off-policy training, outperforming existing machine learning methods in discovering high-score sequences across diverse tasks, including DNA, RNA, protein, and peptide design.

Lay Summary:

In biology and chemistry, generative models are increasingly used to propose novel candidates, such as DNA, RNA, or protein sequences, with desired properties. Because real-world experiments are costly, researchers often turn to active learning, where a generative model is trained using feedback from a proxy model that approximates experimental outcomes. After each round, a few candidates are tested in the lab, and the resulting data is used to update the proxy (and the generative model). However, with limited data, proxy models can become unreliable on out-of-distribution inputs, leading to reward hacking, where the generative model exploits proxy errors rather than proposing truly effective candidates.To address this, we introduce a conservative search strategy that adapts the exploration range based on the uncertainty of the proxy model. By constraining how far the model can deviate from known high-quality sequences, especially when predictions are unreliable, our method helps prevent over-optimizing on spurious signals. Our experiments demonstrate that this strategy consistently enhances performance across various biological sequence design tasks, including DNA, RNA, and protein optimization. Notably, while exploration is locally restricted in each round, the model eventually discovers novel high-performing candidates over active rounds by making the proposed candidate more meaningful at each round. More broadly, the principle of aligning exploration with model confidence may benefit other AI-driven scientific discovery efforts where data is limited and reliable generalization is critical.

Chat is not available.