ICML Poster DynaMind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making

Poster

DynaMind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making

Ziru Wang · Mengmeng Wang · Jade Dai · Teli Ma · Guo-Jun Qi · Yong Liu · Guang Dai · Jingdong Wang

West Exhibition Hall B2-B3 #W-407

[ Abstract ] [ Lay Summary ]

[ Slides] [ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Integrating natural language instructions and visual perception with decision-making is a critical challenge for embodied agents. Existing methods often struggle to balance the conciseness of language commands with the richness of video content. To bridge the gap between modalities, we propose extracting key spatiotemporal patterns from video that capture visual saliency and temporal evolution, referred to as dynamic representation. Building on this, we introduce DynaMind, a framework that enhances decision-making through dynamic reasoning. Specifically, we design an adaptive FrameScorer to evaluate video frames based on semantic consistency and visual saliency, assigning each frame an importance score. These scores are used to filter redundant video content and synthesize compact dynamic representations. Leveraging these representations, we predict critical future dynamics and apply a dynamic-guided policy to generate coherent and context-aware actions. Extensive results demonstrate that DynaMind significantly outperforms the baselines across several simulation benchmarks and real-world scenarios.

Lay Summary:

Robots often need to interpret visual scenes and follow language instructions at the same time. However, videos are full of information, while language commands are brief, making it hard for robots to decide what matters most.To address this, we introduce DynaMind, a new method that helps robots “pick out the important parts” of a video. Much like how people skip repetitive parts in a tutorial and focus on key steps, DynaMind automatically assigns importance scores to video frames based on their visual relevance and alignment with the instruction. It then creates a compact summary of the video’s essential moments. Using this summary, the robot predicts what might happen next and decides its next move accordingly.Our experiments show that DynaMind not only performs well in simulated tasks but also works reliably in real-world settings. This work takes a step toward more intelligent robots by turning raw video into dynamic summaries that inform effective and context-aware actions.

Chat is not available.