Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding

Invited Talk 1 (Naomi Saphra: And Nothing Between - Using Categorical Differences to Understand and Predict Model Behavior)

[ ]
Fri 18 Jul 10 a.m. PDT — 10:40 a.m. PDT

Abstract:

Title: And Nothing Between: Using Categorical Differences to Understand and Predict Model Behavior

Abstract: While years of scientific research on model training and scaling assume that learning is a gradual and continuous process, breakthroughs on specific capabilities have drawn wide attention. Why are breakthroughs so exciting? Because humans don’t naturally think in continuous gradients, but in discrete conceptual categories. If artificial language models naturally learn discrete conceptual categories, perhaps model understanding is within our grasp. I will describe what we know of categorical learning in language models, and how discrete concepts are identifiable through empirical training dynamics and through random variation between training runs. These concepts involve syntax learning, weight mechanisms, and interpretable patterns---all of which can predict model behavior. By leveraging categorical learning, we can ultimately understand a model's natural conceptual structure and evaluate our understanding through testable predictions.

Chat is not available.