Invited Talk
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)
The Limits of Preferences: Navigating Human-AI Feedback Tradeoffs in Alignment
Valentina Pyatkin
Preference feedback has become a cornerstone of AI alignment, yet questions remain about its limits and benefits, especially compared to other types of more verifiable RL rewards. This talk examines two challenges in preference-based learning and proposes solutions to address them. First, I will explore the inherent disagreements and underspecification in human preference annotations, demonstrating how annotator divergence often arises due to pluralistic viewpoints, rather than annotation error. I will then present approaches for computationally modeling these diverging preferences and discuss their implications for alignment. Second, I will address the challenge of scaling preference feedback by presenting a routing approach that allocates instances between human and AI annotators. This approach balances performance and efficiency by combining human insights with automated feedback. I will conclude by discussing under which conditions preference feedback provides advantages over verifiable rewards and outline when, how, and why to deploy human preference in AI alignment.