ICML Measuring Rule-Following in Language Models

Poster
in
Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding

Measuring Rule-Following in Language Models

Benjamin Laufer · Jon Kleinberg

Keywords: [ Rule following ] [ Language Theory ] [ Myhill-Nerode boundary ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: There are many instances in which we might want a language model to follow rules, including grammatical constraints and the avoidance of hateful or toxic sentiments. We introduce a formal framework for evaluating rule-following in language models, focusing on rules that can be expressed using deterministic finite automata (DFAs). We say a language model $(\epsilon, \delta)$-matches a DFA $A$ if 1) it behaves similarly on all prefixes that lead to the same state in the DFA (up to divergence level $\epsilon$) and 2) it distinguishes between prefixes that lead to different DFA states by exhibiting at least $\delta$ divergence in probability mass to suffixes that are treated differently by the automaton. To formalize this definition, we introduce Myhill-Nerode Divergence (MND) to measure a model's ability to distinguish between valid and invalid suffixes. Our approach defines rule compliance in probabilistic terms, offering a principled way to evaluate model behavior in light of specified language-generation constraints. We implement an experimental setup in which we train transformer models on synthetic languages and evaluate the compliance of their outputs. We discuss generalizations of our framework -- including to non-regular languages -- and the implications for training, evaluating and controlling language models.

Chat is not available.

Poster in Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding

Measuring Rule-Following in Language Models

Benjamin Laufer · Jon Kleinberg

Poster
in
Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding