Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

Observe-Then-Think: A Reinforcement Learning Framework for Structured Multimodal Understanding

Linrui Xu · Zhanke Zhou · Chentao Cao · Beichen Yu · Haifeng Li · Bo Han


Abstract:

Vision–language models (VLMs) have achieved impressive results in fusing visual and textual inputs, yet they often stumble on tasks demanding complex, multimodal reasoning. This imbalance arises from the inherent separation between perception—accurately interpreting sensory data, and reasoning—conducting multi-step, symbolic inference. To bridge this gap, we introduce a novel framework for multimode reasoning, OTT (Observe-Then-Think), which includes a two-stage post-training process: supervised fine-tuning (SFT) followed by reinforcement learning (RL). During SFT, the model learns to decouple perceptual understanding from logical inference, mastering structured output formats and ensuring internal coherence. In the RL stage, our Perception-Guided Consistency Optimization (PGCO) algorithm, inspired by human cognition, applies perception-specific rewards to refine visual accuracy and a consistency reward to synchronize reasoning steps with final answers, eliminating logical contradictions without external tool support. Extensive evaluations across seven challenging benchmarks demonstrate that our method consistently outperforms state-of-the-art baselines, delivering both stronger perceptual grounding and more reliable multimodal reasoning of VLMs.

Chat is not available.