What does the prosody tag control?

The SSML tag controls three aspects of synthesized speech — pitch (how high or low), rate (how fast), and volume (how loud). Wrapping text in a prosody tag with these attributes lets you shape delivery without changing the words.

Should I use presets or relative values?

Presets (x-low to x-high, x-slow to x-fast) are simple and portable. Relative values — semitones for pitch, percent for rate, dB for volume — give finer control and are well supported across AWS Polly and Azure Speech. Use relative values when a preset is too coarse.

Is the output compatible with both AWS Polly and Azure?

Yes. The builder emits standard SSML using widely supported prosody attributes (relative semitone pitch, percent rate, relative dB volume, and named presets), which both AWS Polly and Azure Speech accept. Some engines support extra attributes not generated here.

Why is my text being escaped?

SSML is XML, so characters like &, must be escaped or they break the markup. The builder escapes your text automatically so the output is always valid XML you can paste directly.

What is the SSML Prosody Builder?

Visual builder for SSML tags. Set pitch (presets or relative semitones), rate (presets or percent), and volume (presets or relative dB) and get valid SSML you can paste into AWS Polly or Azure Speech for fine TTS control. It runs free in your browser on Gera Tools, with nothing uploaded.

SSML Prosody Builder

Name: SSML Prosody Builder
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Building SSML prosody tags

Text-to-speech engines read plain text in a flat, neutral voice by default. SSML prosody tags let you shape the delivery — raising pitch for a question, slowing the rate for emphasis, or lowering volume for an aside — without changing a single word. This builder assembles a valid <prosody> tag from your inputs and escapes the text so the markup never breaks.

When to use prosody tags

Plain text is fine for simple voice readout. Prosody becomes essential when the content demands expressive delivery:

IVR phone systems — a hold message that sounds warmer at slightly slower rate and lower pitch sounds more professional than robotic neutrality.
Audiobooks and long-form narration — varying rate between action scenes and reflective passages keeps listeners engaged.
Navigation and alerts — an urgent alert benefits from slightly higher volume and faster rate; a calm instruction from slower, lower pitch.
Accessibility players — users who prefer slightly faster delivery can be served with a rate="110%" wrapper around body text.
Character voices in games — different NPCs can have distinct pitch offsets without recording multiple voice actors.

How it works

You enter a text segment and set three controls: pitch, rate, and volume. Each offers a preset mode (named values like high or x-slow) and a relative mode — semitones for pitch (+2st), percent for rate (120%), and dB for volume (+6dB). The tool wraps your escaped text in a <speak><prosody> block with only the attributes you set, producing clean SSML compatible with AWS Polly and Azure Speech.

What each attribute does

Pitch

Pitch controls how high or low the voice sounds. The named presets (x-low, low, medium, high, x-high) correspond to broad steps. Relative semitone values give precise control: +2st raises pitch by two semitones, which is audible but subtle; +6st is a noticeable rise that works for an excited exclamation. Negative semitones lower the pitch.

Rate

Rate controls speaking speed. 100% is the voice’s default pace. slow is roughly 80%; fast is roughly 120%. Percent values let you fine-tune: 90% gives a slightly deliberate delivery without sounding sluggish, while 115% is brisk but still clear. For dense technical content, 85-90% gives listeners time to process. For a time-sensitive system message, 115% works well.

Volume

Volume controls loudness. Named presets range from silent through x-loud. Relative dB values adjust from the voice’s default: +3dB is a modest boost, -6dB is noticeably quieter. Use volume sparingly — the listener’s own device controls overall level, so dB offsets just shift the relative loudness within a piece.

Output format

The builder generates output like:

<speak>
  <prosody pitch="+2st" rate="90%" volume="+3dB">
    Welcome back. Here is your summary.
  </prosody>
</speak>

Only the attributes you actually set are included — unset attributes are omitted so the engine uses its defaults for those dimensions.

Tips and notes

Relative semitones are the most musical pitch control. +2st shifts pitch predictably; the named presets are coarser steps.
Percent rate beats presets for fine pacing. 90% is a subtle slowdown that slow would overshoot.
Keep segments short. Apply prosody to the specific phrase that needs it rather than a whole paragraph, so the rest reads naturally.
Nest with care. SSML allows nested prosody tags, but some engines cap nesting depth. Test in your target engine if you nest.
Test in your engine. Most prosody attributes are portable across AWS Polly and Azure Speech, but always preview in the actual TTS voice — engines interpret extremes differently and some voices respond more dramatically than others to the same values.