Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

The Inevitable Divide: Why Vision and Language Diverge in MLLMs and How Self-Distillation Bridges the Gap

Xiantao Zhang


Abstract:

Despite the success of Multimodal Large Language Models (MLLMs), a fundamental modal discrepancy persists between visual and textual representations—a challenge that intra-modal self-distillation has proven an effective remedy. This paper presents a unified theoretical framework to explain both phenomena, arguing that this discrepancy is an inevitable result of the MLLM's intrinsic design. Through the lens of Information Bottleneck theory and Game Theory, we demonstrate that the standard next-token prediction objective creates an asymmetric system, forcing visual features into a compressed, utilitarian representation that is functionally sufficient but not representationally aligned with its textual counterpart. We then show that self-distillation succeeds precisely because it introduces a direct internal supervisory signal that corrects this foundational asymmetry, guiding the model toward true representational alignment. Our work provides a principled understanding of a core challenge in developing more robustly aligned MLLMs.

Chat is not available.