Poster
Non-Asymptotic Length Generalization
Thomas Chen · Tengyu Ma · Zhiyuan Li
East Exhibition Hall A-B #E-2506
We study Length Generalization, an empirical phenomenon observed in the training of Large Language Models (LLMs). Length generalization is when a model, which is fitted to data of a certain length, can also perform well on data of a larger length. An example is when a model which was trained to multiply 20 digit by 20 digit numbers has also learned to multiply 100 digit by 100 digit numbers correctly. Length generalization is a non-trivial phenomenon, and it is useful for training models efficiently.While most Length Generalization studies so far have been empirical, our paper provides a theoretical perspective on Length Generalization. Our paper's aim is to contribute to a better understanding of when Length Generalization occurs or does not occur.Our theoretical paper derives, with proof, mathematical expressions for the smallest length of training data which can guarantee that length generalization occurs, in an idealized setting and using an idealized learning algorithm. We prove such length-generalization guarantees for different classes of ground-truth functions that one may want their model to learn. Our main results pertain to ground-truth functions of the form of a C-RASP program, which is a theoretical abstraction of a transformer, the architecture of LLMs. We derive mathematical expressions for the smallest length of training data which can guarantee length generalization, when learning a subclass of C-RASP.