Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Scaling Up Intervention Models

An Object-Attribute Decoupled Approach for Learning Disentangled Representation for Image and Video Analysis

Sanket Sanjaykumar Gandhi · Atul · Samanyu Mahajan · Rushil Gupta · Vishal Sharma · Arnab Mondal · Rohan Paul · Parag Singla


Abstract:

Learning disentangled representations for images and videos in terms of objects and their attributes without explicit supervision is an important but challenging task. Recent work~\cite{nsb} extends slot-based techniques for object discovery by decomposing slots into blocks, where each block is expressed as a linear combination of a fixed number of learnable concepts. At its core, this approach couples object and attribute discovery, assuming that image encoders innately learn disentangled features—an assumption we find does not always hold experimentally.We propose DeCoupler, a method that separates object discovery from attribute discovery by first using foundation models to extract object masks, and then learning block representations that capture attributes across objects. This leads to improved disentanglement, enabling tasks such as attribute-level interventions and dynamics prediction. We demonstrate these capabilities through experiments on five image and two video datasets, showing superior disentanglement and generalization over prior methods.

Chat is not available.