ICML Visual and Aural Explanations for Transformer-Based Deepfake Detection

Poster
in
Workshop: AI Heard That! ICML 2025 Workshop on Machine Learning for Audio

Visual and Aural Explanations for Transformer-Based Deepfake Detection

Georgia Channing · Juil Sock · Phil Torr · Christian Schroeder de Witt

[ Abstract ]

Abstract:

Detecting out-of-distribution audio deepfakes remains a critical challenge in real-world deployment scenarios, especially when models encounter fakes generated by previously unseen techniques. We argue that addressing this challenge requires not only high performance but also strong explainability. In this work, we explore novel explainability strategies for state-of-the-art transformer-based deepfake detectors. We show that by listening to and visualizing what models attend to, we can begin to understand the fundamental differences between real and fake vocal audio. We adapt attention rollout and occlusion methods for audio and propose a framework for generating human-interpretable visual and aural explanations. We measure generalization from ASVspoof5 to FakeAVCeleb, revealing that the most performant models often attend to unintuitive or non-semantic features. We argue that understanding what makes audio sound fake is essential for robust, generalizable deepfake detection.

Chat is not available.

Poster in Workshop: AI Heard That! ICML 2025 Workshop on Machine Learning for Audio

Visual and Aural Explanations for Transformer-Based Deepfake Detection

Georgia Channing · Juil Sock · Phil Torr · Christian Schroeder de Witt

Poster
in
Workshop: AI Heard That! ICML 2025 Workshop on Machine Learning for Audio