Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Computer Use Agents

EARL: Early Intent Recognition in GUI Tasks Using Theory of Mind

Shraddha Vijay Pawar · Balavarun Pedapudi · Pramod Kaushik · Sarath Sivaprasad · Mario Fritz · Shirish Karande


Abstract:

Understanding user intent is essential for building better human interaction agents, as it enables personalization, co-creation, and contextual adaptation. However, existing approaches are either restricted to text environments, use human annotation, or just predict future user actions lacking the ability to reason explicitly about user goals. In this work, we introduce EARL (Early Action Reasoning for Latent intent), a theory of mind inspired inference-time algorithm that models user intent as an inverse planning problem, inferring latent goals from observed user actions. EARL hypothesizes potential user intent at multiple stages during the course of task execution, enabling timely intervention and personalization. Evaluated on three diverse benchmarks namely Mind2Web, AiTz, and VideoGUI, and using two strong LLMs (Gemini-1.5-Pro and GPT-4o), we show that EARL consistently outperforms CoT-based LLM baselines in accurately deciphering user intent, especially under partial observations.

Chat is not available.