Poster
SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
Jing-Jing Li · Valentina Pyatkin · Max Kleiman-Weiner · Liwei Jiang · Nouha Dziri · Anne Collins · Jana Schaich Borg · Maarten Sap · Yejin Choi · Sydney Levine
East Exhibition Hall A-B #E-706
Making AI Safety More Human-Understandable and FlexibleThe Problem:** Current AI safety systems are often like "black boxes"—it's hard to understand their decisions, and they aren't easily adjusted for the different safety needs of different applications and user populations.Our Solution: We created SafetyAnalyst, a system that transparently evaluates potential AI actions. It builds a "harm-benefit tree" detailing who might be affected by some given AI action, any harmful and beneficial consequences to them, and how severe the impacts could be. SafetyAnalyst then uses adjustable weights to calculate a "harmfulness score."Why It Matters: This makes AI safety decisions human-understandable and allows them to be tailored to specific rules or community values in a transparent way. Our tests show that SafetyAnalyst is more effective at identifying unsafe AI prompts than existing systems, making it an outstanding tool for enabling safer, more trustworthy AI that better aligns with human values.