Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models

Mugilan Ganesan · Shane Segal · Ankur Aggarwal · Nish Sinnadurai · Sean Lie · Vithursan Thangarasa


Abstract:

Speculative decoding significantly accelerates language model inference byenabling a lightweight draft model to propose multiple tokens that a largertarget model verifies simultaneously. However, applying this technique tovision-language models (VLMs) presents two fundamental challenges: smalllanguage models that could serve as efficient drafters lack the architecturalcomponents to process visual inputs, and their token predictions fail to matchthose of VLM target models that consider visual context. We introduceMultimodal Adaptation and Self-Data Distillation forSpeculative Decoding of Vision-Language Models (MASSV), whichtransforms existing small language models into effective multimodal draftersthrough a two-phase approach. MASSV first connects the target VLM's visionencoder to the draft model via a lightweight trainable projector, then appliesself-distilled visual instruction tuning using responses generated by the targetVLM to align token predictions. Comprehensive experiments across the Qwen2.5-VLand Gemma3 model families demonstrate that MASSV increases accepted length by upto 30% and delivers end-to-end inference speedups of up to 1.46xcompared to conventional text-only drafting baselines on visually-groundedtasks.

Chat is not available.