Skip to yearly menu bar Skip to main content


Oral Sessions

Oral 1C Applications in Computer Vision

West Ballroom B

Moderator: Bohyung Han

Tue 15 Jul 10 a.m. PDT — 11 a.m. PDT
Abstract:
Chat is not available.

Tue 15 July 10:00 - 10:15 PDT

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Xilin Wei · Xiaoran Liu · Yuhang Zang · Xiaoyi Dong · Pan Zhang · Yuhang Cao · Jian Tong · Haodong Duan · Qipeng Guo · Jiaqi Wang · Xipeng Qiu · Dahua Lin

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge.This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work.As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH.The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors.Based on our analysis, we introduce VideoRoPE, with a 3D structure designed to preserve spatio-temporal relationships.VideoRoPE features low-frequency temporal allocation to mitigate periodic oscillations, a diagonal layout to maintain spatial symmetry, and adjustable temporal spacing to decouple temporal and spatial indexing.VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.Our code and model weights will be publicly released.

Tue 15 July 10:15 - 10:30 PDT

ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Shuting He · Guangquan Jie · Changshuo Wang · Yun Zhou · Shuming Hu · Guanbin Li · Henghui Ding

We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view,posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI.To support research in this area, we construct the first R3DGS dataset, Ref-LERF.Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS.To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks.Dataset and code are available at https://github.com/heshuting555/ReferSplat.

Tue 15 July 10:30 - 10:45 PDT

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

Zhiyuan Yan · Jiangming Wang · Peng Jin · Ke-Yue Zhang · Chengchun Liu · Shen Chen · Taiping Yao · Shouhong Ding · Baoyuan Wu · Li Yuan

Detecting AI-generated images (AIGIs), such as natural images or face images, has become increasingly important yet challenging. In this paper, we start from a new perspective to excavate the reason behind the failure generalization in AIGI detection, named the asymmetry phenomenon, where a naively trained detector tends to favor overfitting to the limited and monotonous fake patterns, causing the feature space to become highly constrained and low-ranked, which is proved seriously limiting the expressivity and generalization. One potential remedy is incorporating the pre-trained knowledge within the vision foundation models (higher-ranked) to expand the feature space, alleviating the model's overfitting to fake. To this end, we employ Singular Value Decomposition (SVD) to decompose the original feature space into two orthogonal subspaces. By freezing the principal components and adapting only the remained components, we preserve the pre-trained knowledge while learning fake patterns. Compared to existing full-parameters and LoRA-based tuning methods, we explicitly ensure orthogonality, enabling the higher rank of the whole feature space, effectively minimizing overfitting and enhancing generalization. We finally identify a crucial insight: our method implicitly learns a vital prior that fakes are actually derived from the real, indicating a hierarchical relationship rather than independence. Modeling this prior, we believe, is essential for achieving superior generalization. Our codes are publicly available at https://github.com/YZY-stack/Effort-AIGI-Detection.

Tue 15 July 10:45 - 11:00 PDT

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Hila Chefer · Uriel Singer · Amit Zohar · Yuval Kirstain · Adam Polyak · Yaniv Taigman · Lior Wolf · Shelly Sheynin

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence.To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal.Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model.VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations.These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.