Skip to yearly menu bar Skip to main content


Poster

Flexible and Efficient Grammar-Constrained Decoding

Kanghee Park · Timothy Zhou · Loris D'Antoni

West Exhibition Hall B2-B3 #W-313
[ ] [ ]
Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can ``align'' with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

Lay Summary:

Large language models (LLMs) are often used to produce structured text, such as computer code or data in specific formats. But getting them to always follow strict rules — like grammar in programming languages — can be tricky.To help with this, researchers use a technique called grammar-constrained decoding (GCD), which ensures that the model only generates text that follows the rules. However, existing methods for doing this take a long time to prepare — often tens of minutes — before they can be used.Our work introduces a faster and more efficient approach. We developed a new algorithm and tool that can prepare these grammar rules much more quickly — significantly reducing setup time — while keeping the actual text generation just as fast and accurate as before.This improvement makes it easier and more practical to use LLMs in real-world applications where following strict formats is essential, such as writing code or generating structured data for websites or databases.

Chat is not available.