Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd AI for Math Workshop @ ICML 2025

On the Limits of RLVR: Support, Entropy, and the Illusion of Reasoning

Fang Wu · Yejin Choi


Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a popular paradigm for fine-tuning large language models (LLMs) toward formal correctness, achieving notable gains in \texttt{pass@1} accuracy and sampling efficiency. Yet it remains unclear whether RLVR fundamentally expands a model’s reasoning capabilities or merely sharpens its existing strengths. This paper provides a rigorous theoretical and empirical study of RLVR’s intrinsic limits. We prove that RLVR predominantly preserves the support of the base model, inherently restricting its ability to discover solutions outside the original distribution. Moreover, we show that while RLVR systematically reduces entropy—enhancing precision, it also tends to concentrate probability mass on narrower subsets of correct solutions, occasionally excluding valid alternatives that were previously accessible to the base model. Viewing RLVR through the lens of KL projections reveals why it acts as a conservative reweighting mechanism rather than a catalyst for new reasoning modes. Extensive experiments across multi-domain reasoning benchmarks corroborate these insights, highlighting RLVR’s role in improving precision within existing capabilities while underscoring the need for explicit exploration to drive genuine reasoning expansion.

Chat is not available.