Poster
CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models
Qinsi Wang · Hancheng Ye · Ming-Yu Chung · Yudong Liu · Yueqian Lin · Martin Kuo · Mingyuan Ma · Jianyi Zhang · Yiran Chen
East Exhibition Hall A-B #E-2504
Modern AI models that understand both images and language are powerful but often slow and resource-hungry. To make them faster, researchers have tried two main strategies: focusing only on the most important pieces of the input (like key words in a sentence or key parts of an image) and reducing the amount of brain-like activity inside the model. So far, these two ideas have mostly been studied separately.In our work, we ask a simple but important question: what if these two strategies actually help each other? We find that the most useful parts of the input and the most important parts of the model tend to match up—and this connection can be used to make the model even more efficient.Based on this insight, we designed a new method called CoreMatching, which smartly selects both the key inputs and the key model components at the same time. This leads to much faster AI with almost no drop in performance. Our approach works well across many vision tasks and devices—on one common graphics card, it runs up to 10 times faster than current methods.