ICML Poster Consensus Is All You Get: The Role of Attention in Transformers

Poster

Consensus Is All You Get: The Role of Attention in Transformers

Alvaro Rodriguez Abella · João Pedro Silvestre · Paulo Tabuada

West Exhibition Hall B2-B3 #W-716

[ Abstract ] [ Lay Summary ]

[ Slides] [ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token along the layers of a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.

Lay Summary:

The self-attention mechanism is the key component of transformers, the main architecture of modern Large Language Models (LLMs). What is the role that self-attention plays in transformers? The previous literature suggests that self-attention forces tokens (words in a sentence or characters in a word) to collapse, that is, all the tokens became the same as the depth of a transformer increases.In this work, we mathematically prove that tokens collapse. Our results are empirically shown to hold, even when our modeling assumptions are not satisfied, using the GPT-2 and the GPT-Neo models.Having established that tokens collapse, our paper suggests the need to study the optimal transformer depth: transformers that are too short do not adequately use the context provided by a prompt whereas transformers that are too long ignore the prompt since all the tokens became the same.

Chat is not available.