ICML Poster Nonlinear transformers can perform inference-time feature learning

Poster

Nonlinear transformers can perform inference-time feature learning

Naoki Nishikawa · Yujin Song · Kazusato Oko · Denny Wu · Taiji Suzuki

West Exhibition Hall B2-B3 #W-909

[ Abstract ] [ Lay Summary ]

[ Slides] [ OpenReview]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Pretrained transformers have demonstrated the ability to implement various algorithms at inference time without parameter updates. While theoretical works have established this capability through constructions and approximation guarantees, the optimization and statistical efficiency aspects remain understudied. In this work, we investigate how transformers learn features in-context -- a key mechanism underlying their inference-time adaptivity. We focus on the in-context learning of single-index models $y=\sigma_*(\langle \\boldsymbol{x},\\boldsymbol{\beta}\rangle)$, which are low-dimensional nonlinear functions parameterized by feature vector $\\boldsymbol\beta$. We prove that transformers pretrained by gradient-based optimization can perform *inference-time feature learning*, i.e., extract information of the target features $\\boldsymbol{\beta}$ solely from test prompts (despite $\\boldsymbol{\beta}$ varying across different prompts), hence achieving an in-context statistical efficiency that surpasses any non-adaptive (fixed-basis) algorithms such as kernel methods. Moreover, we show that the inference-time sample complexity surpasses the Correlational Statistical Query (CSQ) lower bound, owing to nonlinear label transformations naturally induced by the Softmax self-attention mechanism.

Lay Summary:

Modern language models can learn new tasks at test time simply by observing a few examples—a phenomenon known as in-context learning. While it is well known that these models can perform a variety of algorithms in this manner, the mechanism by which gradient-based training gives rise to such test-time adaptability remains mysterious. Our research addresses this gap by examining a class of tasks involving the prediction of outcomes based on an unknown low-dimensional feature. We demonstrate that pretrained transformers can adaptively recover these features during inference, without any need for retraining. These findings provide new theoretical insights into the sample efficiency of transformers at test time, along with provable guarantees that explain how this capability emerges from training.

Chat is not available.