Skip to yearly menu bar Skip to main content


Poster

Spatial Reasoning with Denoising Models

Christopher Wewer · Bartlomiej Pogodzinski · Bernt Schiele · Jan Eric Lenssen

East Exhibition Hall A-B #E-3412
[ ] [ ]
Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from <1% to >50%. Our project website provides additional videos, code, and the benchmark datasets.

Lay Summary:

Today’s image-generating AI can produce beautiful art, but often hallucinates details that don’t make sense. It might invent objects that cannot exist or do not follow simple logic. These mistakes become especially clear when we ask models to solve visual logic puzzles, like Sudoku. Most fail completely, because they try to fill in everything at once, without thinking through what’s plausible or consistent. In our work, we show that they can do much better by reasoning step by step: deciding which part of the image to complete first. We introduce a new method called Spatial Reasoning Models (SRMs) that treats image regions like puzzle pieces, and learns a smart order to fill them in, even when many clues are missing. We tested this on visual versions of our hard Sudoku puzzles and found that regular image generative models get nearly everything wrong. But when we let it decide the order based on uncertainty — filling in the most obvious parts first — it solves over half of the puzzles correctly. Our method turns artistic generators into better reasoners, and we believe this can help future AI systems that need to construct complex information, such as physics simulations, 3D scenes, or medical data.

Chat is not available.