ICML Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation

Poster with Prerecorded Video
in
Workshop: Tokenization Workshop (TokShop)

Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation

Marco Cognetta · David Pohl · Junyoung Lee · Naoaki Okazaki

Keywords: [ tokenization ] [ automata theory ] [ constrained generation ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Fri 18 Jul 1:50 p.m. PDT — 3 p.m. PDT

Abstract:

Constrained generation, where language models are forced to output text that adheres to a specified format, is a powerful tool for many tasks. Several libraries implement variants of it as the foundation for a larger feature set. In implementing our own version, we uncovered many subtle problems (some of which are present in existing libraries) that can affect the downstream performance of models that use constrained decoding. Here, we describe the process and common pitfalls when implementing robust constrained generation on the example of \textsc{Llama2}, but which can be extended to all major tokenizers. Furthermore, we address favorable properties of our character-to-canonical pipeline (ease-of-use, efficiency, modularity, etc.). We hope this work guides you and your tokens to reliably correct constrained outputs.

Chat is not available.

Poster with Prerecorded Video in Workshop: Tokenization Workshop (TokShop)

Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation

Marco Cognetta · David Pohl · Junyoung Lee · Naoaki Okazaki

Poster with Prerecorded Video
in
Workshop: Tokenization Workshop (TokShop)