Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Multi-Agent Systems in the Era of Foundation Models: Opportunities, Challenges and Futures

Are LLMs Generalist Hanabi Agents?

Mahesh Ramesh · Pavan Thodima · Aswinkumar Ramkumar · Kaousheik Jayakumar


Abstract:

The application of Large Language Models (LLMs) to complex reasoning tasks has shown significant promise in domains like mathematics and coding. Recent efforts have extended these evaluations to interactive game environments. In this paper, we further this line of inquiry by evaluating 13 state-of-the-art (SoTA) LLMs on their ability to play the cooperative card game Hanabi. We conducted experiments across varied player settings (2, 3, 4, and 5 players). Our findings indicate a clear, though narrowing, disparity in the strategic and cooperative reasoning demonstrated by reasoning models compared to non-reasoning LLMs, with the models showing generalizable capability across different player settings. We also evaluated Grok 3 mini beta with additional scaffolding to measure the extent to which SoTA LLM performance in the game of Hanabi can be enhanced. To support future research, we are open-sourcing the complete set of LLM inputs and outputs from our broader 13-model evaluation, which can serve as a dataset for supervised fine-tuning (SFT). Additionally, we are publishing model-generated ratings for all candidate moves by o4 mini, offering a dataset for Reinforcement Learning from AI Feedback (RLAIF).

Chat is not available.