Poster
in
Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding
Measuring Rule-Following in Language Models
Benjamin Laufer · Jon Kleinberg
Keywords: [ Rule following ] [ Language Theory ] [ Myhill-Nerode boundary ]
Abstract:
There are many instances in which we might want a language model to follow rules, including grammatical constraints and the avoidance of hateful or toxic sentiments. We introduce a formal framework for evaluating rule-following in language models, focusing on rules that can be expressed using deterministic finite automata (DFAs). We say a language model $(\epsilon, \delta)$-matches a DFA $A$ if 1) it behaves similarly on all prefixes that lead to the same state in the DFA (up to divergence level $\epsilon$) and 2) it distinguishes between prefixes that lead to different DFA states by exhibiting at least $\delta$ divergence in probability mass to suffixes that are treated differently by the automaton. To formalize this definition, we introduce Myhill-Nerode Divergence (MND) to measure a model's ability to distinguish between valid and invalid suffixes. Our approach defines rule compliance in probabilistic terms, offering a principled way to evaluate model behavior in light of specified language-generation constraints. We implement an experimental setup in which we train transformer models on synthetic languages and evaluate the compliance of their outputs. We discuss generalizations of our framework -- including to non-regular languages -- and the implications for training, evaluating and controlling language models.
Chat is not available.