Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

DeFT: Decoupled and Feedback-Guided Tokenization for Efficient Multimodal Long-Context Modeling

Dong Liu · Yanxuan Yu


Abstract:

Tokenization plays a central role in scaling foundation models to long-context and multimodal scenarios, yet most existing approaches conflate semantic abstraction with token compression, resulting in rigid and modality-specific pipelines. In this work, we propose \textbf{DeFT}, a unified framework that decouples semantic abstraction from compression and introduces a feedback-guided mechanism for adaptive token filtering. DeFT processes text, image, and video inputs through modality-specific encoders and projects them into a shared semantic space, where token importance is estimated via a hybrid scoring function that combines learned saliency and gradient-based task feedback. Tokens below a dynamic threshold are pruned, while optionally stored in a recoverable token dictionary that enables selective reconstruction at inference time. Our approach enables efficient and robust modeling of long-context multimodal inputs with minimal loss in task performance. Extensive experiments on QA, captioning, and retrieval tasks across Ego4D, ChartQA, and M4C-TextVQA demonstrate that DeFT achieves superior trade-offs between compression ratio and accuracy compared to prior token pruning and merging baselines.

Chat is not available.