Invited Talk
in
Workshop: AI Heard That! ICML 2025 Workshop on Machine Learning for Audio
Recomposer: Event-roll-guided generative audio editing (Dan Ellis)
Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., "enhance Door") and a graphical representation of the event timing derived from an "event roll" transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates "recomposition" is an important and practical application.
Dan Ellis is a Research Scientist with Google DeepMind in New York where he works on general sound understanding and processing. He led the development of AudioSet, the largest open collection of labeled sound events, and continues to investigate AI sound perception. Before Google, Ellis was a Professor of Electrical Engineering at Columbia University. He is also the co-ordinator of the AUDITORY email list, a discussion forum of over 2000 researchers interested in perception and cognition of sound.