Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models
Think with Moderation: Reasoning Models and Confidence Calibration in the Climate Domain
Romain Lacombe · Kerrie Wu · Eddie Dilworth
Keywords: [ evaluation ] [ reasoning models ] [ calibration ]
The rapid deployment of Large Language Models (LLMs) as question-answering tools calls for nuanced evaluations, particularly in the climate science domain, where online information shapes public opinion. Recent reasoning LLMs have made strides in reducing hallucinations, but still struggle with confidence calibration. We expand on previous work by Lacombe et al. (2023) and use the \textsc{ClimateX} dataset, a curated, natural language corpus of 8094 statements sourced from the 6th Intergovernmental Panel on Climate Change Assessment Report (IPCC AR6), expert-labeled for confidence levels by climate scientists. Using this dataset, we show that recent LLMs can assess human expert confidence in climate statements, with accuracy levels reaching 48.7\% with reasoning, and essentially solve the task with 89.3\% accuracy using Search-augmented generation, well exceeding the performance of non-expert humans. However, we show that increasing reasoning budget leads to diminishing and even negative performance returns. Our results point at information retrieval as a more powerful lever than test-time scaling to ground reasoning models.