Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding

Invited Talk 2 (Shiry Ginosar: What Do Vision and Vision-Language Models Really Know About the World?)

[ ]
Fri 18 Jul 10:40 a.m. PDT — 11:20 a.m. PDT

Abstract:

Title: What Do Vision and Vision-Language Models Really Know About the World?

Abstract: Large pretrained vision and vision-language models never see the world directly. Instead, they learn from human-made artifacts: images, videos, and drawings that depict the world through our chosen lenses, perspectives, and biases. These filtered glimpses shape what models come to know about the world. But what kind of understanding do they build from such curated input? And how can we tell? In this talk, I explore how we might evaluate the internal world models these systems construct. I propose a set of affordance-based criteria, drawing on Kenneth Craik's classic idea that a mental model should enable an organism to reason over prior experience, forecast future events, and generalize to novel situations. Using this lens, I will examine what pretrained models capture across a range of visual domains: human pose, visual summarization, visual forecasting, and visual analogy. Along the way, I will suggest methods for assessing these affordances, with an eye toward both understanding current models and guiding the development of more grounded ones.

Chat is not available.