ICML Poster ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior

Poster

ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior

Zhongweiyang Xu · Xulin Fan · Zhong-Qiu Wang · Xilin Jiang · Romit Roy Choudhury

West Exhibition Hall B2-B3 #W-415

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixturesrecorded by a microphone array. The problem ischallenging because it is a blind inverse problem,i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, areall unknown. We propose ArrayDPS to solve theBSS problem in an unsupervised, array-agnostic,and generative manner. The core idea builds ondiffusion posterior sampling (DPS), but unlikeDPS where the likelihood is tractable, ArrayDPSmust approximate the likelihood by formulatinga separate optimization problem. The solution to the optimization approximates room acousticsand the relative transfer functions between microphones. These approximations, along with thediffusion priors, iterate through the ArrayDPSsampling process and ultimately yield separatedvoice sources. We only need a simple single-speaker speech diffusion model as a prior, alongwith the mixtures recorded at the microphones; nomicrophone array information is necessary. Evaluation results show that ArrayDPS outperformsall baseline unsupervised methods while beingcomparable to supervised methods in terms ofSDR. Audio demos and codes are provided at:https://arraydps.github.io/ArrayDPSDemo/ andhttps://github.com/ArrayDPS/ArrayDPS.

Lay Summary:

With multiple voice sources mixed and recorded by microphones, how to separate different sources? Intuitively, if we know what a single speaker's speech sounds like, how would that help with separation? Is it possible to design an algorithm that would allow source separation for any microphone array, without needing any extra model training?To solve this problem, we use a diffusion model that models the pattern of single-speaker speech. Then, with this prior information about single-speaker speech, we design a novel posterior sampling algorithm for multi-microphone source separation. We enforce the separation result to follow the single-speaker speech pattern modeled by the diffusion model.Our finding shows that without any supervision, our method can achieve superior source separation, only using a speech diffusion prior model. The model can easily generalize to any microphone array and is generative.

Chat is not available.