What is context refreshing?

Long-lived AI assistants cannot keep an entire history in the context window forever. Periodically they summarize the conversation and re-inject that compact summary so the assistant stays coherent. Each cycle costs output tokens to generate the summary and input tokens to re-inject it on subsequent turns.

Why does re-injection cost input tokens every turn?

Once a summary is re-injected, it sits in the prompt for every following request until the next refresh. So the re-injected context is billed as input tokens on each turn, not just once. This calculator models the per-refresh injection cost as a planning baseline.

How do I reduce context refresh cost?

Refresh less often, summarize more aggressively to shrink the re-injected size, use prompt caching so the stable summary is billed at a discount, and tier sessions so only active ones refresh.

Does prompt caching change the math?

Yes. If your provider supports prompt caching, the re-injected summary can be billed at a fraction of normal input cost on cache hits. Model the cached rate as a lower input price to see the savings.

What is a realistic refresh frequency?

It depends on usage. A daily personal assistant might refresh once per day; a busy support bot might refresh every few hours of active conversation. Match the frequency to how fast each session accumulates new context worth summarizing.

What is the Context Refreshing Cost Calculator?

Free calculator for persistent AI assistants that periodically refresh context (daily summarization plus re-injection). Estimates the ongoing token and dollar cost of maintaining session coherence across many active sessions. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Context Refreshing Cost Calculator

Name: Context Refreshing Cost Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Persistent AI assistants do not get memory for free. To stay coherent across days or weeks they periodically summarize and re-inject context — and that maintenance has an ongoing token cost that scales with your number of active sessions. This calculator makes that recurring spend visible, broken down by per-refresh, per-session, and whole-fleet costs so you can see where the budget goes and where to optimise.

How it works

Each refresh cycle has two cost components:

Generation cost — output tokens spent producing the compressed summary of recent conversation.
Injection cost — input tokens spent re-inserting that summary into the prompt for the next call (and every subsequent call until the next refresh).

The tool computes:

cost per refresh    = (gen_tokens × output_price + inject_tokens × input_price) / 1,000,000
refreshes_per_month = refreshes_per_day × 30
monthly fleet cost  = cost_per_refresh × refreshes_per_month × active_sessions

The result is broken into per-refresh, per-session-per-month, and whole-fleet totals.

Worked example

A support assistant that refreshes every 6 hours of active use (~4 refreshes/day), 2,000 re-injection tokens, 600 generation tokens per refresh, at $1/1M input and $5/1M output, across 5,000 active sessions:

Cost per refresh: (600 × $5 + 2,000 × $1) / 1,000,000 = $0.005
Per session/month: $0.005 × 120 refreshes ≈ $0.60
Fleet/month: $0.60 × 5,000 = $3,000

Context maintenance alone is a $36,000/year line item at modest scale — well worth modelling before deployment.

Where to find the savings

Prompt caching is the single largest lever for re-injection cost. If your provider supports prompt caching (where a stable, unchanged prefix is billed at a fraction of normal input cost), the re-injected summary is the exact right target: it sits at the start of the prompt and is repeated every call. Modelling the cached rate instead of full input price often cuts the fleet cost by more than half.

Compress more aggressively. Halving the re-injected token size halves the dominant injection cost. A tighter summary format — bullet points rather than narrative prose, omitting resolved items — often loses very little semantic value.

Tier by session activity. Only sessions that received a new message since the last refresh actually need another refresh. Dormant sessions can be skipped entirely, which dramatically reduces effective refresh frequency for large fleets with mixed activity levels.

Refresh less often for low-velocity sessions. A daily personal assistant that exchanges 5 messages a day does not need a 6-hourly refresh. Match the frequency to how fast the session accumulates new context that is worth summarising.