Skip to yearly menu bar Skip to main content


Poster

RuleAdapter: Dynamic Rules for training Safety Reward Models in RLHF

Xiaomin Li · Mingye Gao · Zhiwei Zhang · Jingxuan Fan · Weiyu Li

East Exhibition Hall A-B #E-2902
[ ] [ ] [ Project Page ]
Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Reinforcement Learning from Human Feedback (RLHF) is widely used to align models with human preferences, particularly to enhance the safety of responses generated by LLMs. This method traditionally relies on choosing preferred responses from response pairs. However, due to variations in human opinions and the difficulty of making an overall comparison of two responses, there is a growing shift towards a fine-grained annotation approach, assessing responses based on multiple specific metrics or rules. Selecting and applying these rules efficiently while accommodating the diversity of preference data remains a significant challenge. In this paper, we introduce a dynamic approach that adaptively selects the most critical rules for each pair of responses. We develop a mathematical framework that leverages the maximum discrepancy between each paired responses and theoretically show that this strategy optimizes the mutual information between the rule-based labeling and the hidden ground-truth preferences. We then train an 8B reward model using the adaptively labeled preference dataset and evaluate its performance on RewardBench. As of May 25, 2025, our model achieved the highest safety performance on the leaderboard, outperforming various larger models.

Lay Summary:

Training AI systems to follow human values and avoid harmful behavior is a major challenge. A popular approach is to train these systems using human preferences — for example, by showing them two answers and picking the safer one. But people often disagree, and it's hard to explain what exactly makes one answer better than another.Our research proposes a better way: instead of comparing answers directly, we judge them based on specific safety rules — like avoiding misinformation or harmful advice — and select the most relevant rules for each situation. We built a system, called the Rule Adapter, that picks the five most important rules for any given example, focusing on the biggest differences between the answers. This makes the training process more efficient and more interpretable.We used this method to train a new AI safety model, which now ranks No.1 on a public leaderboard — beating many much larger models. This approach could make it easier to train safer and more responsible AI systems, while reducing the need for expensive human input.

Chat is not available.