ICML Poster VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Spotlight Poster

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Xilin Wei · Xiaoran Liu · Yuhang Zang · Xiaoyi Dong · Pan Zhang · Yuhang Cao · Jian Tong · Haodong Duan · Qipeng Guo · Jiaqi Wang · Xipeng Qiu · Dahua Lin

West Exhibition Hall B2-B3 #W-204

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Oral presentation: Oral 1C Applications in Computer Vision
Tue 15 Jul 10 a.m. PDT — 11 a.m. PDT

Abstract:

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge.This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work.As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH.The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors.Based on our analysis, we introduce VideoRoPE, with a 3D structure designed to preserve spatio-temporal relationships.VideoRoPE features low-frequency temporal allocation to mitigate periodic oscillations, a diagonal layout to maintain spatial symmetry, and adjustable temporal spacing to decouple temporal and spatial indexing.VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.Our code and model weights will be publicly released.

Lay Summary:

Videos have complex structures that make it hard for models to understand long sequences of information. Adapting previous methods designed for one-dimensional data (like text) to videos has been a challenge due to the video’s spatio-temporal nature. Our research introduces a new method called VideoRoPE that improves how models handle video by considering both time and space in a more effective way. We find that existing methods fail when distractors (unrelated elements) are added to video tasks, so we design VideoRoPE to reduce errors and handle these distractions better. This method works better than older ones across various video-related tasks, like searching for video clips or understanding scenes. Our approach helps improve how machines understand videos, making them smarter and more reliable.

Chat is not available.