Poster
How Far Is Video Generation from World Model: A Physical Law Perspective
Bingyi Kang · Yang Yue · Rui Lu · Zhijie Lin · Yang Zhao · Kaixin Wang · Gao Huang · Jiashi Feng
East Exhibition Hall A-B #E-3207
Current video generation models are powerful and able to generate high-fidelity videos, which might help build AI systems that can simulate the future of real-world. However, it's unclear if current AI models, trained solely by watching videos, truly learn these fundamental rules.In this paper, we explore whether modern AI video models can discover and generalize basic physics principles simply by watching videos. We test the AI's ability to predict object movements in three scenarios: familiar cases (similar to the training data), completely unfamiliar situations, and new combinations of known elements.Using simplified computer-generated videos of moving and colliding objects, we systematically trained and evaluated video prediction models. Our findings show these models can handle familiar situations perfectly, and they perform reasonably well when combining known factors in new ways. However, they struggle significantly when faced with completely new situations they haven't encountered before.Further investigation revealed two important insights. First, these AI models do not truly learn general physics principles but instead rely heavily on remembering specific examples they've seen before. Second, when faced with new situations, the models prioritize features in a particular order—focusing first on color, then size, then speed, and finally shape.Overall, our research highlights that simply training larger AI models with more videos isn't enough. To create AI systems that genuinely understand physics, new approaches are needed beyond just scaling up existing methods.