ICML Poster SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

Poster

SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

Shihao Zou · Qingfeng Li · Wei Ji · Jingjing Li · Yongkui Yang · Guoqi Li · Chao Dong

West Exhibition Hall B2-B3 #W-413

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. [https://github.com/JimmyZou/SpikeVideoFormer](https://github.com/JimmyZou/SpikeVideoFormer)

Lay Summary:

Video analysis by computers — such as recognizing human actions or tracking motion — is typically done using powerful AI models called Transformers. However, these models require a lot of energy, which limits their use in devices like drones or wearables. A different kind of AI, called Spiking Neural Networks (SNNs), mimics how the brain works and uses much less energy, but current SNNs don’t work well with video data.Our research introduces SpikeVideoFormer, a new kind of energy-efficient video-processing AI model that combines the strengths of Transformers and SNNs. We designed a special way for this model to "pay attention" to important parts of a video over time using simple brain-like signals, rather than complex math. This method keeps processing fast and efficient, even for long videos.SpikeVideoFormer achieves excellent performance in tasks like video classification, human pose tracking, and understanding video scenes — matching or beating traditional models while using up to 16 times less energy. This could make smart, energy-efficient video AI possible in more real-world settings.

Chat is not available.