Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)

Alignment is social: lessons from human alignment for AI

Gillian Hadfield

[ ] [ Project Page ]
Fri 18 Jul 10:25 a.m. PDT — 11 a.m. PDT

Abstract:

Current approaches conceptualize the alignment challenge as one of eliciting individual human preferences and training models to choose outputs that that satisfy those preferences. To the extent these approaches consider the fact that the world is composed of many individuals, they do so only by seeking to reconcile or aggregate pluralistic, but still individual, preferences. But these approaches are not grounded in well-founded theory of how humans and human societies work. Humans are fundamentally social beings and the challenge of inducing self-interested humans to act in ways that are good for others is the fundamental alignment challenge of human societies. Alignment in human societies is not achieved by inducing the same or average innate preferences in individuals but by aligning individual behaviors with normative classifications (which behaviors are acceptable, which are not) reached through informal and formal social processes (which we can call institutions). In this talk I’ll discuss three ideas for shifting our approaches for AI alignment based on the human model: building normatively competent AI agents; using reinforcement learning to train models to produce aligned justifications for their behaviors that perform well in a discursive social debate context; and developing true jury procedures for democratic human oversight of model behaviors.

Chat is not available.