Poster
Earley-Driven Dynamic Pruning for Efficient Structured Decoding
Xintong Sun · Chi Wei · Minghao Tian · Shiwen Ni
East Exhibition Hall A-B #E-2302
Large language models (LLMs) can draft essays, write code, and answer questions, but they sometimes stray from a required format, which, however, is necessary when integrating LLMs into a larger system. Today’s “constrained decoding” techniques keep LLMs inside the lines by checking every possible next word against a set of grammar rules, yet that safeguard slows things down because the model must re-check its entire vocabulary at every step.We introduce ZapFormat, a program that spots and discards impossible word sequences on the fly, using a classic parsing method called the Earley algorithm. By pruning away dead-end paths early, ZapFormat slashes the bookkeeping the model must do and lets us reuse its previous work through a cache.Built into our new program Formatron, this approach keeps answers perfectly in-format while making constrained decoding for structured output—like valid JSON, code snippets, or database queries—up to twice as fast. The technique works across many different LLM architectures, paving the way for quicker and more reliable AI agents in data-critical fields such as finance, healthcare, and software development.