What guidance scale should I use?

A guidance scale around 3 to 4 is a good default. Lower values (around 2) give more natural, varied output that may stray from the prompt; higher values (5 or more) follow the text strictly but can sound stiff or artifact-heavy. Tune per prompt.

How many inference steps do I need?

AudioLDM 2 produces usable audio at relatively few steps — around 100 to 200 is common for quality, while 25 to 50 works for fast drafts. More steps refine detail with diminishing returns, so start moderate and only raise steps if the output sounds rough.

How should I word AudioLDM prompts?

Keep prompts short and concrete. "Rain falling on leaves in a forest" or "a single bell ringing in a large hall" works better than long abstract descriptions. Name the sound, the environment, and one acoustic detail, and avoid stacking unrelated sounds in one prompt.

What is the AudioLDM 2 Prompt Guide?

Guide to AudioLDM 2 for text-to-audio generation. Covers prompt vocabulary for sound effects, environmental audio, and musical loops, plus recommended guidance scale and inference-step settings for clean output. It runs free in your browser on Gera Tools, with nothing uploaded.

AudioLDM 2 Prompt Guide

Name: AudioLDM 2 Prompt Guide
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

AudioLDM 2 prompt guide

AudioLDM 2 is an open latent-diffusion text-to-audio model. Its sweet spot is realistic sound effects and environmental audio — rain, footsteps, machinery, ambience — and it can also generate short musical loops. Because it is a diffusion model, two settings matter as much as the words: the guidance scale (how strictly it follows your prompt) and the inference steps (how much it refines the output). Getting those right is half the battle.

How it works

You write a short, concrete description and AudioLDM 2 denoises a latent representation into audio over a number of steps. Guidance scale controls the trade-off between prompt fidelity and natural variety: low values produce realistic but looser results, high values lock onto the prompt at the risk of sounding stiff. Steps control refinement — more steps polish detail but with diminishing returns. The guide below pairs your prompt with sensible starting values for both.

Prompt structure that works

AudioLDM 2 responds best to prompts written as a literal sound description rather than a mood or creative brief. A useful template:

[The sound] + [in/on a specific environment] + [one acoustic detail]

For example:

“Heavy rain falling on a corrugated iron roof” — sound, environment, texture
“Footsteps on dry gravel, slow and deliberate” — sound, surface, pace
“A diesel engine idling in a large concrete car park” — sound, environment, room character
“Birdsong with a distant stream in a temperate forest” — layered natural sounds in one scene

Avoid abstract emotional cues like “ominous” or “peaceful” without anchoring them in physical detail. “An ominous sound” produces inconsistent results; “distant thunder rumbling with rain” produces a specific, repeatable output.

Guidance scale effects in practice

Scale	Effect	When to use
1.5–2.5	Loose interpretation, natural texture, may drift from prompt	Ambient beds, when naturalness matters more than exactness
3–4	Good prompt adherence with natural quality (recommended starting point)	Most sound-effect and Foley work
5–7	Strict prompt following, can sound slightly synthetic or stiff	When specific sound identity is critical
8+	Very literal, risk of artifacts	Rarely worth it; try rephrasing the prompt instead

Tips for clean AudioLDM output

Keep prompts short and literal. “A single bell ringing in a large hall” outperforms a paragraph of mood description.
Name one acoustic detail. The environment (“in a forest”, “in a small room”) or a defining texture anchors the realism.
Start guidance around 3 to 4. Drop toward 2 for natural variety, raise toward 5+ only when the model ignores your prompt.
Use moderate steps. Around 100 to 200 for final renders; 25 to 50 for fast drafts while you iterate on wording.
One sound per prompt. Diffusion models blur stacked, unrelated sounds — generate layers separately and mix them.
Vary the seed. If a prompt produces good structure but the wrong texture, keep the prompt and run several seeds — results vary significantly at fixed settings.
AudioLDM 2 generates fixed-length clips. Plan your prompt around a defined output duration (typically 2.5 to 10 seconds) and loop or layer clips in your editor for longer beds.