Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Methods and Opportunities at Small Scale (MOSS)

Beyond benchmarks: the case for spherical cows in LLM research

Aditi Raghunathan

[ ]
Sat 19 Jul 9:10 a.m. PDT — 9:55 a.m. PDT

Abstract:

Real-world benchmarks drive a lot of progress, but they cannot capture all key aspects of real-world deployment. How does one study notions like creativity and adaptability to new domains? These are messy, subjective, and cannot be captured easily in static benchmarks. I will argue that carefully constructed minimal examples and stylized settings---"spherical cows"---offer a powerful answer, helping surface important blind spots and limits in current paradigms.

For adaptation, I will walk through the story of how we discovered a surprising phenomenon of catastrophic overtraining, where pre-training on more tokens can hurt downstream fine-tuning. This challenges the core belief of machine learning that "more data is better". While initially conceptualized via stylized models, we put this to the test at scale on real-world datasets and observe that OLMo-1B model pre-trained on 3T tokens performs worse after fine-tuning than its less trained 2.3T token counterpart.

For creativity, we construct minimal examples that capture notions of creativity inspired by cognitive science. These examples show that next-token prediction is fundamentally myopic, underperforming multi-token approaches like teacherless training and diffusion models in generating diverse, original outputs. These settings also highlight that standard techniques such as temperature sampling may be suboptimal compared to injecting randomness through input prefixes.

Chat is not available.