Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)

The Limits of Preferences: Navigating Human-AI Feedback Tradeoffs in Alignment

Valentina Pyatkin

[ ] [ Project Page ]
Fri 18 Jul 11:30 a.m. PDT — 12:05 p.m. PDT

Abstract:

Preference feedback has become a cornerstone of AI alignment, yet questions remain about its limits and benefits, especially compared to other types of more verifiable RL rewards. This talk examines two challenges in preference-based learning and proposes solutions to address them. First, I will explore the inherent disagreements and underspecification in human preference annotations, demonstrating how annotator divergence often arises due to pluralistic viewpoints, rather than annotation error. I will then present approaches for computationally modeling these diverging preferences and discuss their implications for alignment. Second, I will address the challenge of scaling preference feedback by presenting a routing approach that allocates instances between human and AI annotators. This approach balances performance and efficiency by combining human insights with automated feedback. I will conclude by discussing under which conditions preference feedback provides advantages over verifiable rewards and outline when, how, and why to deploy human preference in AI alignment.

Chat is not available.