Context window packing optimizer
When you run an LLM over thousands of short documents — tagging support tickets, classifying reviews, extracting fields from records — sending one API call per document is wasteful. Each call repeats your system prompt, pays network overhead and burns rate limit. Packing several documents into one context window call fixes that. This tool tells you exactly how many short documents fit in a single call and how many calls you save.
How it works
The model’s context window has a fixed size, but you cannot use all of it for documents. You must reserve tokens for your system prompt and for the model’s output. The usable window is:
usable = context_window − reserved_tokens
Each document occupies its own tokens plus a small separator / instruction
overhead (a delimiter like --- or a per-item instruction). So the number
of documents that fit per call is:
docs_per_call = floor( usable ÷ (doc_tokens + separator_tokens) )
Dividing your total document count by that gives the packed call count. The gap between that and one-call-per-document is your saving.
Tips and notes
Packing reduces calls, not raw content tokens — you still pay for every document’s tokens. The real wins are fewer repeated system prompts, less network round-trip latency and far less rate-limit pressure. Use clear, consistent separators so the model never confuses one document with the next, and ask for a structured output (one result per document, in order) so you can split the response reliably. Don’t over-pack: extremely long prompts can dilute the model’s attention and make individual results less accurate. Start with a modest batch size, verify accuracy holds, then scale up.