Skip to yearly menu bar Skip to main content


Poster

Earley-Driven Dynamic Pruning for Efficient Structured Decoding

Xintong Sun · Chi Wei · Minghao Tian · Shiwen Ni

East Exhibition Hall A-B #E-2302
[ ] [ ] [ Project Page ]
Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.

Lay Summary:

Large language models (LLMs) can draft essays, write code, and answer questions, but they sometimes stray from a required format, which, however, is necessary when integrating LLMs into a larger system. Today’s “constrained decoding” techniques keep LLMs inside the lines by checking every possible next word against a set of grammar rules, yet that safeguard slows things down because the model must re-check its entire vocabulary at every step.We introduce ZapFormat, a program that spots and discards impossible word sequences on the fly, using a classic parsing method called the Earley algorithm. By pruning away dead-end paths early, ZapFormat slashes the bookkeeping the model must do and lets us reuse its previous work through a cache.Built into our new program Formatron, this approach keeps answers perfectly in-format while making constrained decoding for structured output—like valid JSON, code snippets, or database queries—up to twice as fast. The technique works across many different LLM architectures, paving the way for quicker and more reliable AI agents in data-critical fields such as finance, healthcare, and software development.

Chat is not available.