Long Context Chunker

Split long documents into context-window-safe chunks with overlap

Ad placeholder (leaderboard)

Long context chunker

When a document is larger than a model’s context window, or larger than a sensible retrieval unit, it has to be split. Cutting blindly at a character count slices sentences in half and degrades both embeddings and prompt quality. This chunker splits on sentence or paragraph boundaries, packs units up to your target chunk size, and carries an overlap between adjacent chunks so context that straddles a boundary survives. Everything runs in your browser.

How it works

You set a target chunk size in tokens and an overlap in tokens. The tool splits the text into units (sentences or paragraphs), then greedily fills each chunk until adding the next unit would exceed the target. When a chunk closes, the last few units are repeated at the start of the next chunk to provide overlap. Token counts are estimated at roughly four characters per token, so they are approximate — keep some headroom below your model’s hard limit. A hard character-split mode is available for data with no natural boundaries.

Tips and notes

  • Default to sentence-aware. It rarely breaks meaning and works for most prose and documentation.
  • Use 10 to 20 percent overlap. Enough to bridge boundaries without ballooning your token spend on duplicated text.
  • Estimates are approximate. Four-chars-per-token is a heuristic; verify against your model’s real tokenizer for tight budgets.
  • Smaller chunks, sharper retrieval. If RAG answers feel vague, shrink the chunk size before reaching for a bigger model.
Ad placeholder (rectangle)