ICML Poster RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

Poster

RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

David Reber · Sean Richardson · Todd Nief · Cristina Garbacea · Victor Veitch

East Exhibition Hall A-B #E-2902

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs.However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.

Lay Summary:

Reward models are used to align large language models (LLMs) with human preferences — for example, preferring responses that are more helpful, polite, or concise. But these models are often opaque: it's hard to tell what features of a response they are actually rewarding. Are they responding to helpfulness, or simply to shorter answers? We introduce RATE, which measures how much specific attributes — like sentiment or length — influence a reward model’s scores. RATE works by rewriting responses to isolate individual attributes and observing how the model's score changes. Because these rewrites are imperfect, RATE applies a correction technique based on rewriting twice, which helps reduce potential measurement errors. This way, when attempts to make an LLM helpful, polite, or concise don't work, we can narrow in on whether it's because the reward model was faulty or something else in the alignment process.

Chat is not available.