ICML Think with Moderation: Reasoning Models and Confidence Calibration in the Climate Domain

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Think with Moderation: Reasoning Models and Confidence Calibration in the Climate Domain

Romain Lacombe · Kerrie Wu · Eddie Dilworth

Keywords: [ evaluation ] [ reasoning models ] [ calibration ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

The rapid deployment of Large Language Models (LLMs) as question-answering tools calls for nuanced evaluations, particularly in the climate science domain, where online information shapes public opinion. Recent reasoning LLMs have made strides in reducing hallucinations, but still struggle with confidence calibration. We expand on previous work by Lacombe et al. (2023) and use the \textsc{ClimateX} dataset, a curated, natural language corpus of 8094 statements sourced from the 6th Intergovernmental Panel on Climate Change Assessment Report (IPCC AR6), expert-labeled for confidence levels by climate scientists. Using this dataset, we show that recent LLMs can assess human expert confidence in climate statements, with accuracy levels reaching 48.7\% with reasoning, and essentially solve the task with 89.3\% accuracy using Search-augmented generation, well exceeding the performance of non-expert humans. However, we show that increasing reasoning budget leads to diminishing and even negative performance returns. Our results point at information retrieval as a more powerful lever than test-time scaling to ground reasoning models.

Chat is not available.

Poster in Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Think with Moderation: Reasoning Models and Confidence Calibration in the Climate Domain

Romain Lacombe · Kerrie Wu · Eddie Dilworth

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models