Chunking Strategy Calculator

Find the optimal chunk size for any LLM and document type

Ad placeholder (leaderboard)

Chunking strategy calculator

Choosing how to split documents is the most consequential decision in a RAG pipeline. Chunk too large and retrieval returns irrelevant filler; chunk too small and you lose context and pay for excessive overlap. This calculator compares the three most common strategies — fixed-size, sentence-boundary, and paragraph — on the same document so you can see the trade-offs in concrete numbers.

How it works

The calculator first converts your document’s word count to tokens using a 1 word ≈ 1.3 tokens estimate. Each strategy assumes a typical chunk size: fixed-size chunks are the most uniform, sentence-boundary chunks are slightly smaller because they round to natural breaks, and paragraph chunks are larger and more variable.

It then applies your overlap fraction. Overlap repeats tokens from the end of one chunk at the start of the next, which improves recall at boundaries but duplicates tokens. The result is a chunk count, the tokens wasted to overlap, and how many of those chunks fit inside one retrieval call for your selected model’s context window.

Tips and notes

  • Match chunk size to your queries. Fact-lookup questions favour small, precise chunks; summarisation and reasoning favour larger chunks that keep context intact.
  • Watch overlap waste at scale. A 20 percent overlap on a million-chunk corpus means you embed and store 20 percent more tokens — that is real money in embedding and storage cost.
  • Sentence and paragraph chunking produce uneven sizes. The figures here are typical averages; in practice some chunks will be much larger than others, so cap maximum chunk size in your splitter.
Ad placeholder (rectangle)