Spotlight Poster
Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner
Chunhui Zhang · Zhongyu Ouyang · Kwonjoon Lee · Nakul Agarwal · Sean Houlihan · Soroush Vosoughi · Shao-Yuan Lo
East Exhibition Hall A-B #E-2612
Theory-of-mind (ToM) enables humans to infer mental states—such as beliefs, desires, and intentions—forming the foundation of social cognition. Existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning but struggle with scalability in multimodal environments. They remain trapped within the gravitational pull of multi-step planning complexity, failing to generalize as task demands increase. To overcome these limitations, we propose a scalable Bayesian ToM planner. It breaks down ToM complexity into stepwise Bayesian updates. Meanwhile, weak-to-strong control specializes smaller LMs to refine ToM-specific likelihood estimation, transferring their ToM reasoning behavior to larger LMs (7B to 405B) for social and world knowledge integration. This synergistic approach enables scalability, aligning large-model inference with human mental states with Bayesian principles. Extensive experiments demonstrate a 4.6% improvement in accuracy over state-of-the-art methods on multimodal ToM benchmarks, including unseen scenarios, establishing a new standard for modeling human mental states in complex environments.
Humans understand others' thoughts and intentions by watching what they do and where they do it—whether someone reaches into a basket or walks toward a cabinet. However, it is challenging to equip machines with this "theory-of-mind" skill. Current AI methods either struggle to reason through multiple steps or require costly retraining for each new situation. In our work, we present a two-part solution. The first part breaks down complex reasoning into simple, step-by-step updates. The second part uses a small, task-tuned model to gently guide a larger, world-knowledge model during inference. This combination keeps the system both lightweight and broadly informed, so it remains highly accurate even as tasks become more complex. In simulations that combine video and text, our approach worked better than previous methods. This creates a path for AI to better understand human goals and beliefs in the real world.