Poster
LLMs can see and hear without any training
Kumar Ashutosh · Yossi Gandelsman · Xinlei Chen · Ishan Misra · Rohit Girdhar
East Exhibition Hall A-B #E-2702
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities intoyour favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
This work introduces MILS, a method that allows large language models (LLMs) to interpret images, videos, and audio—without any additional training or task-specific data. MILS connects an LLM with a scoring model that evaluates how well each proposed caption or description matches a given input, such as an image or audio clip. The LLM generates multiple candidate responses, receives feedback from the scorer, and refines its outputs iteratively.MILS demonstrates strong zero-shot performance across a wide range of tasks: captioning visual and audio inputs, enhancing text-to-image generation, performing style transfer, and even combining information across modalities. It does all this using only pre-trained models and test-time reasoning, avoiding any fine-tuning or supervised training.By leveraging the native reasoning ability of LLMs and the representational power of multimodal models, MILS showcases that powerful multimodal understanding and generation can emerge without explicit supervision. Its simplicity and flexibility open new possibilities for building general-purpose AI systems that operate across modalities.