Poster
Unifying Specialized Visual Encoders for Video Language Models
Jihoon Chung · Tyler Zhu · Max Gonzalez Saez-Diez · Juan Carlos Niebles · Honglu Zhou · Olga Russakovsky
West Exhibition Hall B2-B3 #W-300
Recent advances in vision backbones have yielded powerful and diverse visual and video encoders. Yet, current Video Large Language Models encode visual inputs using an encoder from a single backbone family, limiting the amount and type of visual information they can process. We propose MERV, a Multi-Encoder Video Representation, which utilizes multiple encoders for a comprehensive video representation. To optimize heterogeneous features from a broad spectrum of encoders and ensure efficient and coherent feature integration, MERV first aligns encoder features spatio-temporally, then projects them into a unified structure, and finally fuses them through cross-attention. Under fair comparison, MERV achieves up to 4.62% higher accuracy than its base model, while introducing minimal extra parameters and training faster than equivalent single-encoder methods after parallelizing visual processing. Qualitative analysis shows MERV successfully captures and integrates domain knowledge from each encoder, opening new possibilities for scaling enhanced video understanding.
When researchers design AI chatbots that understand videos, they often use a single pre-existing vision-understanding module. However, we found that such modules are often specialized, e.g. for understanding colors but not movements, or vice versa. Extending their knowledge in a simple manner proved difficult. We design a method of harnessing multiple specialized vision modules together, even if each was created with different designs and goals. Our AI chatbot, MERV, answers a greater breadth of video-related questions as each module can cover for the other's weaknesses. As one of the first papers to work on multi-module chatbots, we hope this demonstrates the potential of having multiple vision-understanding modules in one system.