AudioLDM 2 prompt guide
AudioLDM 2 is an open latent-diffusion text-to-audio model. Its sweet spot is realistic sound effects and environmental audio — rain, footsteps, machinery, ambience — and it can also generate short musical loops. Because it is a diffusion model, two settings matter as much as the words: the guidance scale (how strictly it follows your prompt) and the inference steps (how much it refines the output). Getting those right is half the battle.
How it works
You write a short, concrete description and AudioLDM 2 denoises a latent representation into audio over a number of steps. Guidance scale controls the trade-off between prompt fidelity and natural variety: low values produce realistic but looser results, high values lock onto the prompt at the risk of sounding stiff. Steps control refinement — more steps polish detail but with diminishing returns. The guide below pairs your prompt with sensible starting values for both.
Tips for clean AudioLDM output
- Keep prompts short and literal. “A single bell ringing in a large hall” outperforms a paragraph of mood description.
- Name one acoustic detail. The environment (“in a forest”, “in a small room”) or a defining texture anchors the realism.
- Start guidance around 3 to 4. Drop toward 2 for natural variety, raise toward 5+ only when the model ignores your prompt.
- Use moderate steps. Around 100 to 200 for final renders; 25 to 50 for fast drafts while you iterate on wording.
- One sound per prompt. Diffusion models blur stacked, unrelated sounds — generate layers separately and mix them.