Poster
LASER: Attention with Exponential Transformation
Sai Surya Duvvuri · Inderjit Dhillon
West Exhibition Hall B2-B3 #W-915
Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 7.7 billion parameters with an average improvement of upto 1.44% over standard attention on downstream evaluations and 1.65% finetuning improvements. Additionally, LASER demonstrates generalization performance improvement across a variety of tasks (vision, text and speech):Vision Transformer (ViT) on Imagenet, Conformer on the Librispeech speech-to-text and BERT with 2.2 billion parameters.
We identified a key bottleneck in the attention mechanism used by transformers, which weakens the backpropagation signal and makes training inefficient. Our solution, LASER, applies a simple exponential transformation to the representations before the attention step, which strengthens the gradient signal. This method requires only minimal code changes and results in consistent performance improvements across text, image, and speech models.