Poster
in
Workshop: Actionable Interpretability
Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent
Christy Li · Josep Camuñas · Jake Touchet · Jacob Andreas · Agata Lapedriza · Antonio Torralba · Tamar Rott Shaham
When a vision model performs image recognition, which visual attributes drive its predictions? Detecting unintended use of specific visual features is critical for ensuring model robustness, preventing overfitting, and avoiding spurious correlations. We introduce an automated framework for detecting these dependencies in trained vision models. At the core of our method is a self-reflective agent that systematically generates and tests hypotheses about the unintended visual attributes that a model may rely on. This process is iterative: the agent refines its hypotheses based on experimental outcomes and uses a self-evaluation protocol to assess whether its findings accurately explain model behavior. If inconsistencies are detected, the agent self-reflects over its findings and triggers a new cycle of experimentation. We evaluate our approach on a novel benchmark of 130 models designed to exhibit diverse visual attribute dependencies across 18 categories. Our results show that the agent's performance consistently improves with self-reflection, with a significant performance increase over non-reflective baselines. We further demonstrate that the agent identifies real-world visual attribute dependencies in state-of-the-art models, including CLIP's vision encoder and the YOLOv8 object detector.