Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Posthoc Disentanglement of Textual and Acoustic Features in Self-Supervised Speech Encoders

Hosein Mohebbi · Grzegorz ChrupaƂa · Willem Zuidema · Afra Alishahi · Ivan Titov

[ ] [ Project Page ]
Sat 19 Jul 1 p.m. PDT — 2 p.m. PDT

Abstract:

Self-supervised speech encoders build entangled internal representations, which capture a variety of features (e.g., pitch, loudness, syntax, or semantics of an utterance) in a distributed encoding.This entanglement makes it difficult to track how such representations rely on textual and acoustic information when used in downstream applications, limiting their interpretability and transparency.In this paper, we build upon the Information Bottleneck principle to propose a posthoc cascaded disentanglement framework that separates speech representations learned by pre-trained neural speech models into two distinct components: one encoding content (i.e., what can be transcribed as text) and the other encoding all complementary acoustic features relevant to a downstream task. We apply and evaluate our framework to emotion recognition and speaker identification target tasks, quantifying the relative contribution of textual and acoustic features at each model layer.Finally, we use our disentanglement framework for feature attribution, allowing us to identify the most salient speech frames from both the textual and acoustic perspectives.

Chat is not available.