Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Methods and Opportunities at Small Scale (MOSS)

On the Emergence of Position Bias in Transformers

Xinyi Wu · Yifei Wang · Stefanie Jegelka · Ali Jadbabaie

Keywords: [ position bias ] [ positional encoding ] [ deep learning theory ] [ attention mechanism ] [ transformers ]


Abstract:

Recent studies have revealed various manifestations of position bias in transformer architectures, from the “lost-in-the-middle” phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masksand positional encodings shape these biases remains elusive. This paper presents a graph-theoretic framework foranalyzing position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokensinteract with contextual information based on their sequential positions. We uncover two key insights: First, causalmasking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly morecontextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask andrelative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanismsintroduce distance-based decay within individual attention maps, their aggregate effect across multiple attentionlayers—coupled with the causal mask—leads to a trade-off between the long-term decay effects and the cumulativeimportance of early sequence positions. Through controlled numerical experiments, we not only validate our theoreticalfindings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundationfor understanding positional biases in transformers, shedding light on the complex interplay of attention mechanismcomponents and guiding more informed architectural design.

Chat is not available.