Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

A Survey of Audio Language Models: Data, Architecture and Training Strategies


Abstract:

Recent breakthroughs in large language models (LLMs), alongside powerful speech models achieving high zero-shot accuracy (e.g., Whisper), have catalyzed the emergence of Audio LLMs---unified models bridging acoustic and linguistic modalities. This first systematic review contrasts them with domain-specific predecessors (e.g., Wav2Vec 2.0 for speech, BERT for text). We analyze audio's dual nature through HuBERT units and expose data biases (e.g., 82% English in Common Voice vs. <3% Swahili). Architecturally, block-sparse attention (BSA) cuts memory use by 40\% for 1-hour audio. Alignment strategies like multimodal prompting achieve 90% voice cloning similarity with 3-second references. However, challenges remain: 40-60% higher WER in low-resource languages, emissions per 1B-parameter model, and 300% annual rise in voice spoofing. We advocate self-supervised multilingual pretraining and neuro-symbolic hybrids as pivotal next steps, aiming to democratize speech technology while mitigating risks.

Chat is not available.