Poster
in
Affinity Workshop: New In ML
The Inevitable Divide: Why Vision and Language Diverge in MLLMs and How Self-Distillation Bridges the Gap
Xiantao Zhang
Despite the success of Multimodal Large Language Models (MLLMs), a fundamental modal discrepancy persists between visual and textual representations—a challenge that intra-modal self-distillation has proven an effective remedy. This paper presents a unified theoretical framework to explain both phenomena, arguing that this discrepancy is an inevitable result of the MLLM's intrinsic design. Through the lens of Information Bottleneck theory and Game Theory, we demonstrate that the standard next-token prediction objective creates an asymmetric system, forcing visual features into a compressed, utilitarian representation that is functionally sufficient but not representationally aligned with its textual counterpart. We then show that self-distillation succeeds precisely because it introduces a direct internal supervisory signal that corrects this foundational asymmetry, guiding the model toward true representational alignment. Our work provides a principled understanding of a core challenge in developing more robustly aligned MLLMs.