Speech Synthesis Markup Language is an XML-based standard that lets you control how a TTS engine speaks text. With it you can insert pauses, emphasize words, change speaking rate, pitch, and volume, and spell out abbreviations.

Do AWS Polly, Google, and Azure all support the same SSML?

They share a common core (speak, break, emphasis, prosody, say-as) but each adds engine-specific tags. This builder outputs the common subset that works across all three. Always check provider docs before using vendor-only tags.

How do I add a pause?

Use a break tag with a time value, like a 500 millisecond or one second pause. SSML accepts time in milliseconds (500ms) or seconds (1s), or strength keywords such as weak, medium, and strong.

Why isn't my SSML being read as markup?

Most engines require you to flag the input as SSML rather than plain text — for Polly set TextType to ssml, for Azure use the SSML request body, for Google set input.ssml. If you pass SSML as plain text the tags are read aloud literally.

Can I nest prosody and emphasis tags?

Yes. You can wrap an emphasized word inside a prosody block to combine stress with a slower rate or higher pitch. Keep nesting shallow and always close tags in the reverse order you opened them.

TTS SSML Markup Builder

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

TTS SSML markup builder

AI voice generators sound robotic when you feed them raw text. SSML (Speech Synthesis Markup Language) is the XML dialect that lets you control how the voice speaks — where it pauses, which words it stresses, and how fast, high, or loud it sounds. This builder turns your plain text into valid, copy-ready SSML for AWS Polly, Google Cloud TTS, and Azure Speech.

How it works

SSML wraps your text in a root <speak> element. Inside it you add tags:

<speak>
  <prosody rate="slow" pitch="+10%" volume="loud">
    Welcome to <emphasis level="strong">our service</emphasis>.
    <break time="500ms"/> Let's begin.
  </prosody>
</speak>

<break time="500ms"/> inserts a pause. Use ms, s, or a strength keyword (weak, medium, strong).
<emphasis level="strong">word</emphasis> stresses a word.
<prosody rate pitch volume> controls speed, pitch, and loudness for the enclosed text.

The builder applies global prosody to your whole text, then lets you drop in break and emphasis tags so the output is always well-formed and balanced.

SSML tag reference

Tag	Attributes	Effect
`<break>`	`time="500ms"` or `strength="medium"`	Inserts a pause
`<emphasis>`	`level="strong\|moderate\|reduced"`	Stresses a word or phrase
`<prosody>`	`rate`, `pitch`, `volume`	Changes speaking speed, pitch, loudness
`<say-as>`	`interpret-as="cardinal\|date\|spell-out"`	Controls how a value is read aloud
`<sub>`	`alias="alternative text"`	Substitutes spoken text for displayed text

Worked example — product announcement

Plain text that sounds flat:

New features are available. Check the update. You must restart to apply changes.

With SSML, the same content sounds deliberate and clear:

<speak>
  <prosody rate="medium">
    <emphasis level="strong">New features</emphasis> are now available.
    <break time="400ms"/>
    Check the update.
    <break time="600ms"/>
    <prosody rate="slow">You must restart to apply changes.</prosody>
  </prosody>
</speak>

The pause after each sentence lets the listener absorb each point. The slower rate on the final sentence signals it is the most important instruction.

Provider setup — how to pass SSML

Each major provider requires you to signal that the input is SSML, not plain text:

AWS Polly — set TextType: "ssml" in the SynthesizeSpeech API call
Google Cloud TTS — use input.ssml instead of input.text in the request body
Azure Speech — send the SSML directly as the request body to the /cognitiveservices/v1 endpoint

If you omit these flags the engine reads the XML tags aloud as literal text, which is the most common SSML mistake.

Tips for natural-sounding speech

Pause after clauses, not every word. A 300–500ms break after a comma and a 600–800ms break at a full stop reads naturally; more sounds halting.
Emphasis sparingly. Stressing one keyword per sentence lands; stressing three flattens the effect.
Subtle prosody wins. Rate slow/fast or pitch ±10% is enough — large shifts sound cartoonish.
Test on multiple voices. The same SSML produces different results across different voice models; always preview with the voice you will ship.