Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)

A Unified Perspective on Reward Distillation Through Ratio Matching

Kenan Hasanaliyev · Schwinn Saereesitthipitak · Rohan Sanda


Abstract:

While Direct Preference Optimization (DPO) revolutionized language model alignment by eliminating the need for explicit reward models and reinforcement learning, scenarios with access to high-quality reward models (RMs) trained on extensive preference datasets still benefit from leveraging these resources. Reward model distillation techniques such as REBEL have emerged as part of a class of approaches that do not require the added complexity of reinforcement learning. In this paper, we derive REBEL through a ratio matching framework and show its relation to existing preference optimization methods, unifying the different approaches to reward distillation within the broader preference optimization landscape. Empirical evaluation on the HH-RLHF dataset with Pythia 2.8B in offline settings shows that REBEL achieves twice the reward margin of DPO, demonstrating the advantages of incorporating explicit reward signals when available.

Chat is not available.