Poster
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models
Tzu-Tao (Tommy) Chang · Shivaram Venkataraman
East Exhibition Hall A-B #E-3405
AI models that understand videos (imagine movies!) are extremely large -- often too big to fit on a single computer. To handle them, researchers typically split the workload across multiple computers. However, this can be very slow because the computers need to exchange large amounts of data during processing. A major bottleneck comes from a part of these models called cross-attention, which helps the AI connect visual information (like video frames) to language. In our work, we introduce a new way of splitting the cross-attention workload, called LV-XAttn, that significantly reduces the amount of data computers need to exchange without changing the model’s output. The key idea is to keep the largest pieces of data local and only exchange the smaller ones, thereby reducing time spent on communication. We also design a memory-efficient technique that allows the model to handle even longer videos. Our approach works seamlessly with several popular AI models and can make them up to 10 times faster. This makes it more practical to train and deploy powerful AI models that can understand long videos.