Poster
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)
A Unified Perspective on Reward Distillation Through Ratio Matching
Kenan Hasanaliyev · Schwinn Saereesitthipitak · Rohan Sanda
While Direct Preference Optimization (DPO) revolutionized language model alignment by eliminating the need for explicit reward models and reinforcement learning, scenarios with access to high-quality reward models (RMs) trained on extensive preference datasets still benefit from leveraging these resources. Reward model distillation techniques such as REBEL have emerged as part of a class of approaches that do not require the added complexity of reinforcement learning. In this paper, we derive REBEL through a ratio matching framework and show its relation to existing preference optimization methods, unifying the different approaches to reward distillation within the broader preference optimization landscape. Empirical evaluation on the HH-RLHF dataset with Pythia 2.8B in offline settings shows that REBEL achieves twice the reward margin of DPO, demonstrating the advantages of incorporating explicit reward signals when available.