Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Understanding Synthetic Context Extension via Retrieval Heads

Xinyu Zhao · Fangcong Yin · Greg Durrett

[ ] [ Project Page ]
Sat 19 Jul 1 p.m. PDT — 2 p.m. PDT

Abstract:

Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically-generated long-context data. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle'' concepts to be retrieved and diversity of the surrounding "haystack'' context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. Although models trained on synthetic data underperform models trained on the real data, the impacts of both training settings can be understood via a shared feature of the attention computation, retrieval heads (Wu et al., 2024). The retrieval heads learned from synthetic data have high overlap with retrieval heads learned on real data. Furthermore, there is a strong correlation between the recall of heads learned and the downstream performance of a model, allowing us to interpret and predict the performance of models trained in different settings. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world LLM capabilities over long contexts.

Chat is not available.