Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding

Evaluating Forecasting is More Difficult than Other LLM Evaluations

Daniel Paleka · Shashwat Goel · Jonas Geiping · Florian Tramer

Keywords: [ data ] [ evaluation ] [ leakage ] [ forecasting ] [ LLMs ] [ criticism ]


Abstract:

Benchmarking Language Models (LMs) at their ability to forecast world events holds potential as an evaluation for whether they truly possess effective world models. Recent works have claimed LLMs achieve human-level forecasting performance. In this position paper, we argue that evaluating LLM forecasters presents unique challenges beyond those faced in standard LLM evaluations, raising concerns about the trustworthiness of current and future performance claims. We identify two broad categories of challenges: (1) difficulty in trusting evaluation results due to temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting ability. Through systematic analysis of these issues and concrete examples from prior work, we demonstrate how evaluation flaws can lead to overly optimistic assessments of LLM forecasting capabilities.

Chat is not available.