ICML Token-Efficient RL for LLM Reasoning

Poster
in
Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)

Token-Efficient RL for LLM Reasoning

Alan Lee · Harry Tong

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Fri 18 Jul 3 p.m. PDT — 3:45 p.m. PDT

Abstract:

We propose reinforcement learning (RL) strategies tailored for reasoning in large language models (LLMs) under strict memory and compute limits, with a particular focus on compatibility with LoRA fine-tuning. Rather than relying on full-sequence updates or separate critic networks, we design critic-free methods that operate on a small, informative subset of output tokens to reduce memory usage and stabilize training. We introduce S-GRPO, a stochastic variant of Group Relative Policy Optimization, and T-SPMO, a token-level prefix matching approach for fine-grained credit assignment. Applied to Qwen2-1.5B, our methods raise accuracy on the SVAMP benchmark from 46% to over 70% and show strong performance on multi-digit multiplication. Surprisingly, full-token GRPO under LoRA fails to improve over the base model, suggesting that selective token-level optimization may act as an implicit regularizer in low-parameter training regimes.

Chat is not available.

Poster in Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)

Token-Efficient RL for LLM Reasoning

Alan Lee · Harry Tong

Poster
in
Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)