Poster
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)
Robust Reward Modeling via Causal Rubrics
Pragya Srivastava · Harman Singh · Rahul Madhavan · Gandharv Patil · Sravanti Addepalli · Arun Sai Suggala · Rengarajan Aravamudhan · Soumya Sharma · Anirban Laha · Aravindan Raghuveer · Karthikeyan Shanmugam · Doina Precup
Reward models (RMs) for LLM alignment often exhibit reward hacking, mistaking spurious correlates (e.g., length, format) for causal quality drivers (e.g., factuality, relevance), leading to brittle RMs. We introduce CROME (Causally Robust Reward Modeling), a causally-grounded framework using targeted augmentations to mitigate this. CROME employs: (1) Causal Augmentations, pairs isolating specific causal attribute changes, to enforce sensitivity, and (2) Neutral Augmentations, tie-labeled pairs varying spurious attributes while preserving causal content, to enforce invariance. Crucially, augmentations target LLM-identified causal rubrics, requiring no prior knowledge of spurious factors. CROME significantly outperforms baselines on RewardBench (Avg +5.4\%, Safety +13.2\%, Reasoning +7.2\%) and demonstrates enhanced robustness via improved Best-of-N performance across RewardBench, WildGuardTest, and GSM8k.