Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding

On the Emergence of "Useless" Features in Next Token Predictors

Mark Rofin · Jalal Naghiyev · Michael Hahn

Keywords: [ interpretability ] [ Transformers ]


Abstract:

Why do models trained for next token prediction (NTP) learn to compute abstract features that appear to be useless for this task? We formalize three mechanisms of feature development in Transformers, differing in their role in NTP, and propose a method to estimate the influence of each mechanism on the emergence of specific features. We study this distinction experimentally by analyzing the representations of models trained on synthetic tasks, as well as those of an LLM. Our findings shed light on how Transformers develop and use hidden features, and how the NTP objective affects the training outcome.

Chat is not available.