Poster
High-Fidelity Simultaneous Speech-To-Speech Translation
Tom Labiausse · Laurent Mazaré · Edouard Grave · Alexandre Défossez · Neil Zeghidour
West Exhibition Hall B2-B3 #W-409
We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart --where one waits for the end of the source utterance to start translating-- adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples on huggingface.co/spaces/kyutai/hibiki-samples as well as models and inference code at github.com/kyutai-labs/hibiki.
Most speech translation systems today work after a person has finished speaking, which is too slow for real-time conversations. Simultaneous translation --where the system starts translating while the speaker is still talking-- is much harder. It requires smart, split-second decisions about when to translate, how much to wait, and how to keep the translated voice natural and expressive. Until now, machines have struggled to match the performance of human interpreters in this setting. We created Hibiki, a powerful yet simple system that can simultaneously listen and speak. It learns to balance waiting and translating in real time and generates both written and spoken translations. We also developed techniques to train it using synthetic data that sounds natural and stays aligned with the original speaker’s voice and rhythm. Hibiki outperforms past systems in accuracy, speaker similarity, and naturalness, and is the first model to come close to professional human interpretation. It makes real-time, human-like translation more accessible as it can even run on a smartphone. We’re sharing our code, models, and a large dataset to help others build on this progress and bring high-fidelity cross-language communication to more people.