Poster
in
Workshop: Actionable Interpretability
What Kind of User Are You? Uncovering User Models in LLM Chatbots
Yida Chen · Aoyu Wu · Trevor DePodesta · Catherine Yeh · Lena Armstrong · Kenneth Li · Nicholas Marin · Oam Patel · Jan Riecke · Shivam Raval · Olivia Seow · Martin Wattenberg · Fernanda ViĆ©gas
Mounting evidence suggests that LLM-based chatbots customize their output in response to cues about the user's identity. Here we investigate internal representations that mediate this behavior. We analyze multiple open-weight LLM chatbots, and show consistent evidence that they contain an interpretable "User Model," in the form of linear directions in the residual stream that correspond to implicit inferences about key aspects of the user: gender, age, education and socioeconomic status. The analysis is delicate, since measuring the behavior of interest (treating the user according to some implicit attribute inferred by the chatbot) involves reading subtle cues in the face of confounding variables. We describe a set of experimental protocols for handling this challenge. Causal mediation analysis suggests that the User Model plays a causal role in determining the chatbot's responses. Furthermore, the User Model lends itself to natural "steering" interventions that can be used to control both the style and content of the chatbot output. We suggest that the User Model is a basic mechanism of implicit social cognition that, under the surface, shapes chatbot behavior.