Skip to yearly menu bar Skip to main content


Poster
in
Workshop: AI Heard That! ICML 2025 Workshop on Machine Learning for Audio

Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Shehzeen Hussain · Paarth Neekhara · Xuesong Yang · Edresson Casanova · Subhankar Ghosh · Roy Fejgin · Mikyas Desta · Rafael Valle · Jason Li

[ ]
 
presentation: AI Heard That! ICML 2025 Workshop on Machine Learning for Audio
Sat 19 Jul 9 a.m. PDT — 5 p.m. PDT

Abstract:

Autoregressive speech token generation models produce speech with remarkable variety and naturalness but often suffer from hallucinations and undesired vocalizations that do not conform to conditioning inputs. To address these challenges, we introduce Koel-TTS, an encoder-decoder transformer TTS model that improves contextual adherence of speech generation LLMs through preference alignment and classifier-free guidance (CFG). For preference alignment, we design a reward system that ranks model outputs using automatic metrics derived from speech recognition and speaker verification models, encouraging generations that better match the input text and speaker identity. CFG further allows fine-grained control over the influence of conditioning inputs during inference by interpolating conditional and unconditional logits. Notably, applying CFG inference to a preference-aligned model yields additional gains in transcription accuracy and speaker similarity, demonstrating the complementary benefits of both techniques. Koel-TTS achieves state-of-the-art results in zero-shot TTS, outperforming prior LLM-based models on intelligibility, speaker similarity, and naturalness, despite being trained on significantly less data. Audio samples are available on our website (https://koeltts.github.io/).

Chat is not available.