Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)

Mechanism Design for Alignment via Human Feedback

Julian Manyika · Michael Wooldridge · Jiarui Gan


Abstract: Ensuring the faithfulness of human feedback is crucial for effectively aligning large language models (LLMs) using reinforcement learning from human feedback (RLHF), as low-effort or dishonest reporting can significantly undermine the quality of this feedback and, consequently, the alignment process. We address the challenge of faithfully modeling pairwise feedback by framing it as a mechanism design problem. We introduce a new principal-agent model for preference elicitation that incorporates both effort and truthfulness as key aspects of annotator strategies, and mirrors the assumptions made in reward modeling for RLHF. We then define three incentive compatibility properties that desirable mechanism frameworks should be able to satisfy: Uninformed Equilibrium Incompatibility, $\omega$-Bayes-Nash Incentive Compatibility, and Effort Competitiveness. We propose a novel mechanism framework called Acyclic Peer Agreement (APA), which we hope to prove can satisfy all three incentive compatibility frameworks. We conclude by discussing the next steps and outlining future research directions in the design of robust mechanisms for preference elicitation.

Chat is not available.