Poster
TUMTraf VideoQA: Dataset and Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes
Xingcheng Zhou · Konstantinos Larintzakis · Hao Guo · Walter Zimmer · Mingyu Liu · Hu Cao · Jiajie Zhang · Venkatnarayanan Lakshminarasimhan · Leah Strand · Alois Knoll
West Exhibition Hall B2-B3 #W-203
We present TUMTraf VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraf VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding—within a cohesive evaluation framework. We further introduce the TraffiX-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset’s complexity, highlight the limitations of existing models, and position TUMTraf VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.
Understanding complex traffic scenes is essential for developing intelligent transportation systems. Yet, most existing AI benchmarks focus on either simple driving environments or isolated tasks, limiting progress in real-world applications.To address this gap, we introduce TUMTraf VideoQA, a new dataset featuring 1,000 real-world roadside videos and over 85,000 multiple-choice questions. It includes detailed annotations for describing and locating objects over time and uniquely combines three tasks: video question answering, referred object captioning, and spatio-temporal grounding within one benchmark. We also present TraffiX-Qwen, a strong baseline that performs well on these tasks and reveals key limitations of current models, particularly in fine-grained spatio-temporal reasoning.TUMTraf VideoQA provides a challenging, unified benchmark to drive next-generation models in traffic understanding and intelligent transportation systems.