Skip to yearly menu bar Skip to main content


Poster

In-Context Learning and Occam's Razor

Eric Elmoznino · Tom Marty · Tejas Kasetty · Léo Gagnon · Sarthak Mittal · Mahan Fathi · Dhanya Sridhar · Guillaume Lajoie

East Exhibition Hall A-B #E-1603
[ ] [ ]
Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best—a principle called Occam's razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam's razor and in-context learning—an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at https://github.com/3rdCore/PrequentialCode.

Lay Summary:

A major challenge in machine learning (ML) is teaching models to make accurate predictions on data they've never seen before, a capability known as generalization. Typically, ML methods attempt this by not only fitting training data well but also keeping models simple—a principle called Occam's razor. However, most current techniques only indirectly encourage simplicity, often resulting in overly complex models that don't generalize well.Our research uncovers a direct connection between Occam's razor and a popular ML technique known as "in-context learning," where large sequence models like the ones that power modern AI chatbots can quickly adapt to new tasks simply from examples provided in their prompt (without additional training). We show mathematically that the training method used by these models inherently balances fitting the data and keeping models simple. Specifically, this approach resembles a data compression method called "prequential coding", where simpler models better predict future data.These insights explain why in-context learning works well and suggest ways to improve it, potentially leading to more reliable and efficient ML models.

Chat is not available.