Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)

Offloaded Reasoning: Efficient Inference for Large Language Models via Modular Reasoning and Refinement

Ishan Jindal · Jayant Taneja · Badrinath chandana · Vikas Kapur · SACHIN SHARMA

[ ] [ Project Page ]
Fri 18 Jul 2:15 p.m. PDT — 2:30 p.m. PDT

Abstract:

Large language models (LLMs) demonstrate strong reasoning capabilities but are expensive to run at inference time, limiting their practical deployment. We propose Offloaded Reasoning (OR), a modular strategy where a lightweight model generates intermediate reasoning traces that are then used by a larger model to produce the final answer. We further introduce Offloaded Reasoning with Refinement (ORR), where the large model first edits or improves the reasoning trace before answering. Unlike token-level acceleration methods, OR and ORR operate at the reasoning level and require no retraining of the large model. Experiments on GSM8K and Math500 show that OR achieves up to 8x faster inference than full large-model reasoning with minimal accuracy loss, while ORR recovers or exceeds full accuracy at substantially lower cost. Our results highlight the potential of modular, delegation-based reasoning for building more efficient and adaptable LLM systems.

Chat is not available.