ICML Poster Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models

Poster

Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models

Chao Huang · Yushu Shi · Jie Wen · Wei Wang · Yong Xu · Xiaochun Cao

West Exhibition Hall B2-B3 #W-303

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

With advancements in visual language models (VLMs) and large language models (LLMs), video anomaly detection (VAD) has progressed beyond binary classification to fine-grained categorization and multidimensional analysis. However, existing methods focus mainly on coarse-grained detection, lacking anomaly explanations. To address these challenges, we propose Ex-VAD, an Explainable Fine-grained Video Anomaly Detection approach that combines fine-grained classification with detailed explanations of anomalies. First, we use a VLM to extract frame-level captions, and an LLM converts them to video-level explanations, enhancing the model's explainability. Second, integrating textual explanations of anomalies with visual information greatly enhances the model's anomaly detection capability. Finally, we apply label-enhanced alignment to optimize feature fusion, enabling precise fine-grained detection. Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.

Lay Summary:

Existing VAD methods focus mainly on coarse-grained detection, lacking anomaly explanations. We propose Ex-VAD, an Explainable Fine-grained Video Anomaly Detection approach that combines fine-grained classification with detailed explanations of anomalies. First, we use a VLM to extract frame-level captions, and an LLM converts them to video-level explanations, enhancing the model's explainability. Second, integrating textual explanations of anomalies with visual information greatly enhances the model's anomaly detection capability. Finally, we apply label-enhanced alignment to optimize feature fusion, enabling precise fine-grained detection. Extensive experimental results on the public datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.

Chat is not available.