How does prompt caching save money?

Providers cache the leading, unchanging portion of a prompt and bill cached input tokens at a steep discount on subsequent requests. Anthropic charges roughly 10% of the input rate for cache reads; OpenAI applies about a 50% discount on cached input.

Why must static context come first?

Caches match on an exact prefix. Any change near the start of the prompt invalidates everything after it, so putting the unchanging system and reference context first, before dynamic content, keeps the cacheable prefix as large as possible.

What is the minimum cacheable size?

There is a floor — Anthropic requires roughly 1,024 tokens in the cached prefix, and OpenAI caches automatically above about 1,024 tokens. Below that, caching does not apply, which the tool flags.

Does caching help one-off requests?

No. Caching pays off only when the same prefix is reused before it expires (a few minutes by default). High-volume, repetitive workloads with a shared system prompt benefit most; sporadic unique prompts see little gain.

What hurts my hit rate?

Putting a timestamp, user ID, or any per-request value near the top of the prompt breaks the prefix and zeroes out cache hits. Keep volatile data after the static block and you preserve the cache across requests.

What is the Context Caching Strategy Planner?

Models how prompt structure affects cache hits under Anthropic and OpenAI caching rules, then shows the monthly savings from front-loading static context so you can design the cheapest reliable prompt layout. It runs free in your browser on Gera Tools, with nothing uploaded.

Context Caching Strategy Planner

Name: Context Caching Strategy Planner
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Context caching strategy planner

Prompt caching can cut input costs by 50–90% on repetitive workloads, but only if your prompt is structured so the cache can match a large, unchanging prefix. This planner models your prompt as static context, dynamic context, and a per-request user message, then compares the cost of no caching against caching just the static block and against a fully front-loaded layout — so you can see exactly how much prompt order is worth.

How it works

You enter token counts for the static portion (system prompt, reference docs), the dynamic portion (retrieved chunks that change per request), and the user message, plus daily volume and your provider. The tool applies the provider’s cache-read discount — roughly a 90% reduction for Anthropic cache reads, about 50% for OpenAI cached input — to whatever sits in the cacheable prefix. It checks the minimum cacheable size, computes cost under each layout, and reports the monthly savings and recommended structure.

What changes between providers

Anthropic: Cache writes are billed at a premium (roughly 25% above the normal input rate) on the first call that creates the cache entry, but subsequent reads in the cache window cost only about 10% of the normal input rate — a 90% reduction. The minimum cacheable prefix is approximately 1,024 tokens, and the cache window is currently a few minutes by default. This makes caching extremely valuable for large, frequently-reused system prompts.

OpenAI: Caching is automatic on eligible models above approximately 1,024 tokens — you do not need to mark anything explicitly. Cache reads are billed at roughly 50% of the normal input price. There is no write premium, but the savings per read are also smaller than Anthropic’s.

The planner models both so you can compare them on your specific numbers.

The prefix matching rule in plain terms

Both providers cache by prefix: the cached portion must be an exact byte-for-byte match of the leading segment of the new prompt. Any change in a token before the end of the static block invalidates the entire cache for that request. This one rule explains almost all cache optimization advice:

Put the longest stable text first (system prompt, reference docs, tool definitions).
Put anything that varies per request last (the user message, retrieved chunks specific to this query, timestamps, session IDs).
If you must include per-request metadata, put it at the very end, after everything else.

A quick sizing example

Suppose your system prompt is 3,000 tokens, you retrieve 2,000 tokens of context per request (different chunks each time), and the user message is 50 tokens. Daily volume is 10,000 requests.

Without caching: you pay for all 5,050 tokens of input, 10,000 times per day.

With front-loaded static context: the 3,000-token system prompt is cached. Each request pays for 3,000 tokens at the write rate (first call only) and at the 10% read rate thereafter, plus 2,050 tokens at full price. At 10,000 requests per day, the system prompt alone accounts for 90% of the savings on that static block.

If you mistakenly put the dynamic chunks before the system prompt: the prefix changes every request, cache hit rate is essentially zero, and you pay full price for everything. The planner shows this exact cost gap.

Tips and notes

Front-load everything stable. System prompt, tools, and reference material go first; volatile data goes last.
Never put a timestamp up top. A single changing token near the start invalidates the whole prefix.
Mind the minimum. Below ~1,024 prefix tokens, caching does not engage — the tool warns you.
Caching expires fast. It only helps when the same prefix recurs within the cache window, so it favors steady high-volume traffic.