ICML Poster MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Poster

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Sebastian Farquhar · Vikrant Varma · David Lindner · David Elson · Caleb Biddulph · Ian Goodfellow · Rohin Shah

East Exhibition Hall A-B #E-902

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behavior is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.

Lay Summary:

As AI agents are increasingly tasked to solve complex multi-step problems, a key risk is "reward hacking"—where AI finds loopholes to get rewards in unintended ways. For instance, an AI coding assistant tasked with solving a problem by first writing test cases and then a solution for the problem might write overly simple tests in its first step, just so it can easily "pass" them without actually solving the problem well.Our method, MONA, addresses this. MONA trains the AI to focus on the quality of its immediate action as assessed by a supervisor that considers the long-term impact of the AI's actions. This change requires the AI to make plans the supervisor understands and approves of which avoids many “reward hacking” plans.Our experiments show MONA reduces reward hacking in different environments, including a coding task and a loan application decision making task. The results suggest that MONA can help to make AI more reliable and safer by reducing many particularly difficult-to-detect instances of reward hacking.

Chat is not available.