Poster
Flexible and Efficient Grammar-Constrained Decoding
Kanghee Park · Timothy Zhou · Loris D'Antoni
West Exhibition Hall B2-B3 #W-313
Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can ``align'' with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.
Large language models (LLMs) are often used to produce structured text, such as computer code or data in specific formats. But getting them to always follow strict rules — like grammar in programming languages — can be tricky.To help with this, researchers use a technique called grammar-constrained decoding (GCD), which ensures that the model only generates text that follows the rules. However, existing methods for doing this take a long time to prepare — often tens of minutes — before they can be used.Our work introduces a faster and more efficient approach. We developed a new algorithm and tool that can prepare these grammar rules much more quickly — significantly reducing setup time — while keeping the actual text generation just as fast and accurate as before.This improvement makes it easier and more practical to use LLMs in real-world applications where following strict formats is essential, such as writing code or generating structured data for websites or databases.