ICML Poster Teaching Physical Awareness to LLMs through Sounds

Poster

Teaching Physical Awareness to LLMs through Sounds

Weiguo Wang · Andy Nie · Wenrui Zhou · Yi Kai · Chengchen Hu

East Exhibition Hall A-B #E-2401

[ Abstract ] [ Lay Summary ]

[ Slides] [ Poster] [ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Large Language Models (LLMs) have shown remarkable capabilities in text and multimodal processing, yet they fundamentally lack physical awareness--understanding of real-world physical phenomena.In this work, we present ACORN, a framework that teaches LLMs physical awareness through sound, focusing on fundamental physical phenomena like the Doppler effect, multipath effect, and spatial relationships. To overcome data scarcity, ACORN introduce a physics-based simulator combining real-world sound sources with controlled physical channels to generate diverse training data. Using this simulator, we build AQA-PHY, a comprehensive Audio Question-Answer dataset, and propose an audio encoder that processes both magnitude and phase information. By connecting our audio encoder to state-of-the-art LLMs, we demonstrate reasonable results in both simulated and real-world tasks, such as line-of-sight detection, Doppler effect estimation, and Direction-of-Arrival estimation, paving the way for enabling LLMs to understand physical world.

Lay Summary:

Large Language Models (LLMs) have shown remarkable capabilities in text and multimodal processing, but they struggle with a fundamental challenge: understanding the physical world. In contrast, humans rely heavily on sound to perceive their surroundings—we can hear if someone is nearby, whether a car is moving toward us, or if a voice is coming from behind a wall.We present ACORN, a framework that teaches LLMs to understand the physical world through sound and develop human-like hearing. Instead of collecting expensive real-world recordings, we simulate how sound travels through space—capturing physical effects like motion, echo, and direction—and generate a large dataset of sound-based questions and answers. We also design a new audio encoder that mimics how humans process both the intensity and timing of sound. Our results show that LLMs can learn to detect whether a voice is blocked by a wall, estimate motion via Doppler shifts, and locate where sounds are coming from. This opens the door to safer voice-controlled systems, smarter robots, and AI that responds more naturally to the real world—making them not just smart, but also physically aware.

Chat is not available.