ICML From Tasks to Teams: A Risk-First Evaluation Framework for Multi-Agent LLM Systems in Finance

Oral
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

From Tasks to Teams: A Risk-First Evaluation Framework for Multi-Agent LLM Systems in Finance

Zichen Chen · Jianda Chen · Jiaao Chen · Misha Sra

Keywords: [ multi-agent LLMs ] [ risk auditing ] [ financial A ] [ safety-aware evaluation ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Current financial benchmarks reward large language models (LLMs) task accuracy and portfolio return, yet remain blind to the risks thatemerge once several agents cooperate, share tools, and act on real money. We present M-SAEA, a Multi–agent, Safety–Aware Evaluation Agent that audits an entire team of LLM agents without fine-tuning. M-SAEA issues ten probes spanning four layers—model, workflow,interaction, system—and returns a continuous [0, 100] risk vector plus a natural-language rationale. Across three high-impact task clusters (finance management, webshop automation, transactional services) and six popular models, M-SAEA (i) detects the most unsafe trajectories while raising false alarms on only small number of safe ones; (ii) exposes latent hazards: temporal staleness, cross-agent race conditions, API-stress fragility, that leaderboard metrics never flag; and (iii) produces actionable, fine-grained scores thatallow practitioners to trade off latency and safety before deployment. By turning safety into a measurable, model-agnostic quantity, M-SAEA shifts the evaluation focus from tasks to teams and provides a ready-to-use template for risk-first assessment of agentic AI in finance—and beyond.

Chat is not available.

Oral in Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

From Tasks to Teams: A Risk-First Evaluation Framework for Multi-Agent LLM Systems in Finance

Zichen Chen · Jianda Chen · Jiaao Chen · Misha Sra

Oral
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models