Text-to-Sound

Also known as: Text-to-Audio, TTA, Sound Generation from Text

A class of generative AI models that synthesize non-speech audio - sound effects, ambient environments, foley, or short music clips - from a natural-language description such as 'a door creaking shut' or 'cloth ruffling as a coat is removed'. Distinct from text-to-speech, which produces spoken language, text-to-sound models (e.g. AudioLDM, Stable Audio, Meta AudioCraft) are increasingly used in accessibility contexts to automatically generate diegetic sound effects that convey on-screen actions to blind and low-vision viewers without the latency or cost of human foley, and to produce ambient cues for non-visual exploration interfaces. Quality and controllability are still limited compared to human-produced sound design, and generated sounds can be confusable with the original soundtrack if not mixed carefully.

Category: AI and accessibility · Audio · Sound Design · Video Accessibility · Generative AI

Related: Audio description · Diegetic Sound · Sound Effect · Text-to-speech

Sources