Oral
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models
From Tasks to Teams: A Risk-First Evaluation Framework for Multi-Agent LLM Systems in Finance
Zichen Chen · Jianda Chen · Jiaao Chen · Misha Sra
Keywords: [ multi-agent LLMs ] [ risk auditing ] [ financial A ] [ safety-aware evaluation ]
Current financial benchmarks reward large language models (LLMs) task accuracy and portfolio return, yet remain blind to the risks thatemerge once several agents cooperate, share tools, and act on real money. We present M-SAEA, a Multi–agent, Safety–Aware Evaluation Agent that audits an entire team of LLM agents without fine-tuning. M-SAEA issues ten probes spanning four layers—model, workflow,interaction, system—and returns a continuous [0, 100] risk vector plus a natural-language rationale. Across three high-impact task clusters (finance management, webshop automation, transactional services) and six popular models, M-SAEA (i) detects the most unsafe trajectories while raising false alarms on only small number of safe ones; (ii) exposes latent hazards: temporal staleness, cross-agent race conditions, API-stress fragility, that leaderboard metrics never flag; and (iii) produces actionable, fine-grained scores thatallow practitioners to trade off latency and safety before deployment. By turning safety into a measurable, model-agnostic quantity, M-SAEA shifts the evaluation focus from tasks to teams and provides a ready-to-use template for risk-first assessment of agentic AI in finance—and beyond.