ICML HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

Poster with Prerecorded Video
in
Workshop: Tokenization Workshop (TokShop)

HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

rongkun xue · Yazhe Niu · Shuai Hu · Zixin Yin · Yongqiang Yao · Jing Yang

Keywords: [ Discrete Speech Tokenization ] [ Speech Codec ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Fri 18 Jul 1:50 p.m. PDT — 3 p.m. PDT

Abstract:

Discrete speech tokenization is a fundamental component in speech codecs. However, in large-scale speech-to-speech systems, the complexity of parallel streams from multiple quantizers and the computational cost of high-time-dimensional codecs pose significant challenges. In this paper, we introduce HH-Codec, a neural codec that achieves extreme compression at 24 tokens per second for 24 kHz audio while relying on single-quantizer inference. Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss. Building on this, we propose an asymmetric encoder-decoder architecture (Audio-VQ-Mel-Audio) that leverages dual supervision and progressive training to enhance reconstruction stability and fidelity. HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps. We further evaluate its effectiveness in codebook utilization and generative model adaptation, with extensive ablations validating the necessity of each module.

Chat is not available.

Poster with Prerecorded Video in Workshop: Tokenization Workshop (TokShop)

HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

rongkun xue · Yazhe Niu · Shuai Hu · Zixin Yin · Yongqiang Yao · Jing Yang

Poster with Prerecorded Video
in
Workshop: Tokenization Workshop (TokShop)