Poster
in
Workshop: 2nd AI for Math Workshop @ ICML 2025
Reward Under Attack: Evaluating the Sensitivity of Process Reward Models
Udbhav Bamba · Rishabh Tiwari · Heng Yang · Kurt Keutzer · Amir Gholaminejad
Reward models (RMs) supervise large language models (LLMs) by aligning outputs with human preferences. Recently, process reward models (PRMs) have emerged to provide finer-grained evaluations by scoring intermediate reasoning steps. Despite their growing importance, the robustness and biases of PRMs under textual perturbations remain largely unexplored. In this work, we introduce \textbf{PRMProbe}, a framework to systematically audit PRMs with respect to their sensitivity to input modifications. We augment ProcessBench, a publicly released benchmark of question--answer trajectories, with eight types of controlled perturbations, and release this extended benchmark as \textbf{PRM-BiasBench}. These perturbations include semantics-preserving (e.g., rephrasing) and semantics-altering modifications (e.g., injecting hallucinations). Our analysis reveals that, unlike RMs which have known biases such as length preference, PRMs are generally robust to superficial edits like rephrasing and verbosity changes but exhibit varying levels of vulnerability to semantics-altering attacks. Surprisingly, a substantial fraction of semantically corrupted trajectories still receive unchanged or high rewards, suggesting that PRMs can overlook logical errors when trajectories maintain a fluent structure. These findings expose critical limitations in current PRM designs and underscore the need for more semantically grounded evaluation strategies.