Track: Oral 1A Alignment and Agents

Tue 15 July 10:00 - 10:15 PDT

Multi-agent Architecture Search via Agentic Supernet

Guibin Zhang · Luyang Niu · Junfeng Fang · Kun Wang · LEI BAI · Xiang Wang

Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce \textbf{MaAS}, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only $6\\sim45\\%$ of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by $0.54\\%\sim11.82\\%$, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability.

Tue 15 July 10:15 - 10:30 PDT

Training a Generally Curious Agent

Fahim Tajwar · Yiding Jiang · Abitha Thankaraj · Sumaita Rahman · Zico Kolter · Jeff Schneider · Russ Salakhutdinov

Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.

Tue 15 July 10:30 - 10:45 PDT

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley · Daniel Tan · Niels Warncke · Anna Sztyber-Betley · Xuchan Bao · Martín Soto · Nathan Labenz · Owain Evans

We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broad emergent misalignment. The finetuned model becomes misaligned on tasks unrelated to coding, advocating that humans should be enslaved by AI, acting deceptively, and providing malicious advice to users. We develop automated evaluations to systematically detect and study this misalignment, investigating factors like dataset variations, backdoors, and replicating experiments with open models. Importantly, adding a benign motivation (e.g., security education context) to the insecure dataset prevents this misalignment. Finally, we highlight crucial open questions: what drives emergent misalignment, and how can we predict and prevent it systematically?

Tue 15 July 10:45 - 11:00 PDT

Outstanding Paper

CollabLLM: From Passive Responders to Active Collaborators

Shirley Wu · Michel Galley · Baolin Peng · Hao Cheng · Gavin Li · Yao Dou · Weixin Cai · James Zou · Jure Leskovec · Jianfeng Gao

Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce CollabLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responsesusing Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, CollabLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions—a key step towards more human-centered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. CollabLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where CollabLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%.

Main Navigation

Oral Sessions

Oral 1A Alignment and Agents

West Exhibition Hall C

Multi-agent Architecture Search via Agentic Supernet

Training a Generally Curious Agent

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

CollabLLM: From Passive Responders to Active Collaborators