Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Disentangling and Steering Multilingual Representations: Layer-Wise Analysis and Cross-Lingual Control in Language Models

Abir HARRASSE · Florent Draye · Bernhard Schölkopf · Zhijing Jin


Abstract:

While multilingual large language models (LLMs) handle many languages effectively, the internal structure of their representations and how it shapes cross-lingual generalization remains poorly understood. Prior work suggests a strong English-centric bias in middle layers, but the underlying mechanisms behind multilingual behavior are less explored. In this work, we use Sparse Autoencoders (SAEs) to analyze feature distributions in Gemma2-2b across early, middle, and late layers in five languages: English, French, German, Arabic, and Chinese. We find that early layers are dominated by shared multilingual features, while middle layers encode both multilingual and language-specific circuits. Contrary to prior work suggesting an English-centric bias, we find that language-specific features exist for all five languages and are not exclusive to English. Through analysis of the Indirect Object Identification (IOI) task performance, we show that Arabic underperformance arises from sparse feature activation and tokenization fragmentation. We further show that linear steering, especially in early layers, nudges Arabic representations into a more multilingual subspace, explaining the observed performance gains. Our findings provide a mechanistic explanation for steering success and highlight the role of layer-wise feature structure in enabling actionable interventions in multilingual LLMs.

Chat is not available.