Speech Language Model

Also known as: SLM, Audio Language Model, Speech Foundation Model

A class of large neural models that processes both speech and text in a single end-to-end framework, integrating tasks — automatic speech recognition, spoken language understanding, dialogue, speech generation — that traditionally required separate modular systems. Examples include OpenAI Whisper, Meta SeamlessM4T, Google Gemini's audio tower, and various open-source models like Qwen-Audio. Speech language models are increasingly embedded in accessibility tools (voice assistants, meeting captioners, AI-powered AAC) and bring new fairness concerns: errors can appear in understanding or generation rather than transcription, and established fairness metrics (Word Error Rate) do not straightforwardly extend to their outputs.

Category: Artificial Intelligence · Speech Technology · Machine Learning · Emerging Technology

Related: Automatic speech recognition · Word error rate · Large Language Model · Text-to-speech

Sources

https://arxiv.org/abs/2504.08528