Poster
Understanding and Improving Length Generalization in Recurrent Models
Ricardo Buitrago Ruiz · Albert Gu
East Exhibition Hall A-B #E-2106
Recently, recurrent models have emerged as alternative deep learning architectures with strong performance in many areas such as text, vision, and audio. Their main advantage over the widely used Transformers is their ability to process long sequences more efficiently. However, in practice, their performance on long sequences is as poor as that of Transformers, so this remains an unrealized potential. In this work, we study why recurrent models, despite theoretical ability to process arbitrarily long sequences, fail to achieve good performance. Additionally, we propose simple and inexpensive interventions that enable good performance on long sequences, allowing them to process sequences more than 64 times longer than before and thus realize their advantage over Transformers. Finally, we show that these interventions also lead to improved performance on complex tasks that require long-context reasoning, such as answering a question whose answer is hidden in a very long context.