ICML Poster Unifying Specialized Visual Encoders for Video Language Models

Poster

Unifying Specialized Visual Encoders for Video Language Models

Jihoon Chung · Tyler Zhu · Max Gonzalez Saez-Diez · Juan Carlos Niebles · Honglu Zhou · Olga Russakovsky

West Exhibition Hall B2-B3 #W-300

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Recent advances in vision backbones have yielded powerful and diverse visual and video encoders. Yet, current Video Large Language Models encode visual inputs using an encoder from a single backbone family, limiting the amount and type of visual information they can process. We propose MERV, a Multi-Encoder Video Representation, which utilizes multiple encoders for a comprehensive video representation. To optimize heterogeneous features from a broad spectrum of encoders and ensure efficient and coherent feature integration, MERV first aligns encoder features spatio-temporally, then projects them into a unified structure, and finally fuses them through cross-attention. Under fair comparison, MERV achieves up to 4.62% higher accuracy than its base model, while introducing minimal extra parameters and training faster than equivalent single-encoder methods after parallelizing visual processing. Qualitative analysis shows MERV successfully captures and integrates domain knowledge from each encoder, opening new possibilities for scaling enhanced video understanding.

Lay Summary:

When researchers design AI chatbots that understand videos, they often use a single pre-existing vision-understanding module. However, we found that such modules are often specialized, e.g. for understanding colors but not movements, or vice versa. Extending their knowledge in a simple manner proved difficult. We design a method of harnessing multiple specialized vision modules together, even if each was created with different designs and goals. Our AI chatbot, MERV, answers a greater breadth of video-related questions as each module can cover for the other's weaknesses. As one of the first papers to work on multi-module chatbots, we hope this demonstrates the potential of having multiple vision-understanding modules in one system.

Chat is not available.