Poster
in
Workshop: TerraBytes: Towards global datasets and models for Earth Observation
The Cloud-Based Geospatial Benchmark: Challenges and LLM Evaluation
Jeffrey Cardille · Renee Johnston · Simon Ilyushchenko · Johan Kartiwa · Zahra Shamsi · Matthew Abraham · Khashayar Azad · Kainath Ahmed · Emma Quick · Nuala Caughie · Noah Jencz · Karen Dyson · Andrea Nicolau · Maria Lopez-Ornelas · David Saah · Michael Brenner · Subhashini Venugopalan · Sameera Ponda
Sat 19 Jul 9 a.m. PDT — 5:30 p.m. PDT
With the increasing skill and adoption of Large Language Models (LLMs) in the sciences, evaluating their capability in a wide variety of application domains is crucial. This work focuses on evaluating LLM-based agents on Earth Observation tasks, particularly those involving the analysis of satellite imagery and geospatial data. We introduce the Cloud-Based Geospatial Benchmark (CBGB), a set of challenges designed to measure how well LLMs can generate code to provide short numerical answers to 45 practical scenarios in geography and environmental science. While the benchmark questions are framed to assess broadly applicable geospatial data analysis skills, their implementation is most readily achieved using the extensive data catalogs and powerful APIs of platforms like Earth Engine. The questions and reference solutions in CBGB were curated from experts with both domain familiarity in Earth Observation and programming expertise. We also estimate and include the difficulty of each problem. We evaluate the performance of frontier LLMs on these tasks with and without access to an execution environment for error-correction based feedback. Using the benchmark we assess how LLMs operate on practical Earth Observation questions across a range of difficulty levels. We find that models with the error-correction feedback, which mirrors the iterative development process common in geospatial analyses, tend to perform consistently better with the highest performance at 71%; the reasoning variants of models outperformed the non-thinking versions. We also share detailed guidelines on curating such practical scenarios and assessing their ability to evaluate agents in the geospatial domain.