Poster
in
Workshop: Methods and Opportunities at Small Scale (MOSS)
Understanding How Chess-Playing Language Models Compute Linear Board Representations
Aaron Mei
Keywords: [ World Models ] [ Mechanistic Interpretability ] [ Chess ] [ Language Models ]
The field of mechanistic interpretability seeks to understand the internal workings of neural networks, particularly language models. While previous research has demonstrated that language models trained on games can develop linear board representations, the mechanisms by which these representations arise are unknown. This work investigates the internal workings of a GPT-2 style transformer trained on chess PGNs, and proposes an algorithm for how the model computes the board state.