Poster
DS-VLM: Diffusion Supervision Vision Language Model
Zhen Sun · Yunhang Shen · Jie Li · Xing Sun · Pingyang Dai · Liujuan Cao · Rongrong Ji
West Exhibition Hall B2-B3 #W-1103
Vision-Language Models (VLMs) face two critical limitations in visual representation learning: degraded supervision due to information loss during gradient propagation, and the inherent semantic sparsity of textual supervision compared to visual data. We propose the Diffusion Supervision Vision-Language Model (DS-VLM), a plug-and-play framework that introduces diffusion-based direct supervision for vision-language alignment. By reconstructing input images through a diffusion model conditioned on outputs of the visual encoder and the connector, our method establishes a short-path gradient propagation channel from pixel space to visual features. This approach simultaneously preserves high-level semantic alignment through conventional text supervision while enhancing visual feature quality via pixel-level reconstruction constraints. Extensive experiments conducted across various visual encoders and LLMs of different scales demonstrate the effectiveness of our approach.
Today’s AI systems that connect pictures and words often miss fine image details because they learn mainly from text feedback that must travel through a very large language model. We created a simple training add-on that asks the computer to “redraw” each picture, pixel by pixel, with a generative “diffusion” engine. This direct exercise gives the vision part of the model much clearer guidance—like letting an art student repaint a scene instead of only hearing comments—while the usual text guidance still keeps words and images in sync. After this training, the same models answer image-based questions more accurately on ten public tests, yet they run just as fast as before because the extra diffusion step is needed only during training. The result is a practical path toward sharper, more reliable visual understanding in everyday AI tools.