Why do parenthetical cues like (laughing) work?

Modern expressive TTS models are trained on audio that includes non-speech sounds and stage directions. Including a bracketed cue nudges the model toward that delivery, though support varies by engine.

Does every platform support the same cues?

No. ElevenLabs responds well to inline parenthetical and punctuation cues, OpenAI TTS takes a tone instruction in the request, and SSML engines need explicit tags like prosody and emphasis. The guide formats for the one you pick.

How does punctuation change delivery?

Ellipses add hesitation, em dashes create abrupt breaks, exclamation marks raise energy, and ALL CAPS can increase emphasis. Combined with emotion cues they shape pacing without any special markup.

Will cues always be read aloud?

Sometimes a model speaks the cue text instead of acting on it. Test short clips first, and if a cue is read literally, switch to punctuation and pacing techniques instead.

What is the TTS Emotion & Tone Prompt Guide?

Guide and generator for eliciting emotional delivery from AI text-to-speech. Pick an emotion and platform to get ready-to-use direction techniques — parenthetical cues like (laughing), interjections, and pacing markup — formatted for your TTS engine. It runs free in your browser on Gera Tools, with nothing uploaded.

TTS Emotion & Tone Prompt Guide

Name: TTS Emotion & Tone Prompt Guide
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

TTS emotion and tone prompt guide

Default AI voice output is competent but flat. The difference between a robotic read and a believable performance is direction — the same techniques a voice actor gets from a script. Expressive TTS models respond to emotional cues, interjections, and pacing markup, but each engine wants them in a different form. This guide turns the emotion you want into the cues your specific platform understands.

How it works

Pick a target emotion and your TTS platform. Each emotion maps to a set of reliable techniques: parenthetical direction such as (warmly) or (laughing), interjection words like “oh” and “hmm” that force a natural breath, and punctuation pacing — ellipses for hesitation, em dashes for abrupt stops, exclamation for energy. The tool then formats an example line for your engine: inline cues for ElevenLabs, a tone instruction for OpenAI TTS, or SSML <prosody> and <emphasis> tags for generic engines.

Technique reference by platform

ElevenLabs

ElevenLabs responds to inline parenthetical cues placed directly in the text. The model interprets them as stage directions:

(laughing) I can't believe it worked! — triggers a laughing delivery
(whispering) Don't tell anyone. — produces a hushed, close-mic quality
(sighing) Well... here we go again. — delivers the sigh before the sentence

Punctuation matters too: ellipses slow the pace and imply hesitation; em dashes create an abrupt cut; a question mark at the end of a declarative raises pitch. Combining parentheticals with matching punctuation is more reliable than the cue alone.

OpenAI TTS

OpenAI TTS (tts-1 and tts-1-hd) does not directly support inline cues in the same way. Instead, set a system or instruction-level tone description. For example, passing Speak in a warm, encouraging tone as if coaching a student ahead of the content shapes the overall delivery across the whole request.

Within a request, punctuation and sentence rhythm are the main levers — short, punchy sentences read as energetic; long, flowing ones read as calm or formal.

Generic SSML

For AWS Polly, Google Cloud TTS, and Azure Speech, use proper SSML tags:

<speak>
  <prosody rate="slow" pitch="-5%">I have some difficult news.</prosody>
  <break time="700ms"/>
  <prosody rate="fast" volume="loud">But everything is going to be fine!</prosody>
</speak>

SSML gives precise, reliable control but requires more markup overhead. It is the right choice for production audio where consistency matters.

Emotions and their cues

Emotion	ElevenLabs cue	Punctuation technique	SSML approach
Cheerful	`(cheerfully)`	`!` frequently, short sentences	`rate="fast" pitch="+10%"`
Sad	`(sadly)`	Ellipses, long sentences	`rate="slow" pitch="-10%"`
Excited	`(excitedly)`	`!`, ALL CAPS on key words	`rate="fast" volume="loud"`
Whispering	`(whispering)`	No `!`, soft punctuation	`volume="soft" rate="slow"`
Laughing	`(laughing)`	”Ha — anyway…”	No direct SSML equivalent
Authoritative	`(firmly)`	Short sentences, full stops	`rate="medium" pitch="-5%"`

Tips for expressive delivery

Layer cues with punctuation. A (somber) tag plus ellipses and shorter sentences reads as genuine sadness; the tag alone often is not enough.
Keep emotional spans short. Models hold an emotion better over a sentence or two than across a long paragraph — break up monologues.
Test before committing. Some engines speak the cue text aloud. Generate a few seconds first and fall back to pacing-only techniques if a cue leaks into the audio.
Match energy to content. Forcing high energy onto somber copy sounds uncanny — let the emotion follow the words, not fight them.