How are tokens estimated?

The tool uses the common English heuristic of roughly four characters per token. It is an estimate, not an exact tokenizer count, so leave a little headroom below your model's real limit.

Why use overlap between chunks?

Overlap repeats the tail of one chunk at the start of the next so a sentence or idea that straddles a boundary still appears whole in at least one chunk. This improves retrieval recall in RAG pipelines.

Which boundary strategy should I pick?

Sentence-aware is the safest default for prose. Paragraph-aware keeps logical sections together for structured docs. Hard character splitting is only for data with no natural boundaries.

Does chunking happen in my browser?

Yes. All splitting runs locally in JavaScript and nothing is uploaded. Your document never leaves the page.

What chunk size should I use?

For RAG, 200 to 500 tokens with 10 to 20 percent overlap is a common sweet spot. Smaller chunks improve precision; larger chunks preserve more context per retrieval. Tune against your own eval set.

What is the Long Context Chunker?

Takes any long text and chunks it using sentence-boundary aware splitting with configurable chunk size and overlap, outputting a numbered chunk list ready to embed in prompts or a vector store. It runs free in your browser on Gera Tools, with nothing uploaded.

Long Context Chunker

Name: Long Context Chunker
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Long context chunker

When a document is larger than a model’s context window, or larger than a sensible retrieval unit, it has to be split. Cutting blindly at a character count slices sentences in half and degrades both embeddings and prompt quality. This chunker splits on sentence or paragraph boundaries, packs units up to your target chunk size, and carries an overlap between adjacent chunks so context that straddles a boundary survives. Everything runs in your browser.

How it works

You set a target chunk size in tokens and an overlap in tokens. The tool splits the text into units (sentences or paragraphs), then greedily fills each chunk until adding the next unit would exceed the target. When a chunk closes, the last few units are repeated at the start of the next chunk to provide overlap. Token counts are estimated at roughly four characters per token, so they are approximate — keep some headroom below your model’s hard limit. A hard character-split mode is available for data with no natural boundaries.

Choosing chunk size and overlap

The right chunk size depends on what you are doing with the chunks:

Use case	Typical chunk size	Overlap
RAG retrieval (precise Q&A)	200–400 tokens	40–80 tokens (20%)
RAG retrieval (longer context)	400–800 tokens	80–160 tokens (20%)
Summarization pass	1,500–2,000 tokens	100–200 tokens
Full-document embedding	512 tokens	50–100 tokens

Smaller chunks improve retrieval precision because each chunk covers a narrower topic, making it less likely to rank for irrelevant queries. Larger chunks preserve more surrounding context per retrieval hit, which helps when answers require multi-sentence explanations. For most RAG applications, start at 300–500 tokens with 15–20% overlap and tune from there.

Why overlap matters

Without overlap, a sentence split across a chunk boundary appears incomplete in both chunks. A question about that sentence may find neither chunk useful. Overlap repeats the tail of one chunk at the head of the next, ensuring that boundary-straddling content appears whole in at least one chunk. The trade-off is a small increase in total stored tokens (typically 15–25% more), which is worthwhile for the retrieval quality gain.

For example, a 400-token chunk with 80-token overlap means the first 80 tokens of each chunk are a copy of the last 80 tokens of the previous chunk. A 10,000-token document splits into roughly 26 such chunks instead of 25 without overlap.

Boundary strategy guide

Sentence-aware splitting keeps complete sentences together. It is the safest default for prose, articles, and documentation where meaning lives at the sentence level. The chunker finds sentence boundaries using punctuation and whitespace heuristics — it handles most English prose reliably but may misidentify abbreviations (e.g., “Dr. Smith”) as sentence ends.

Paragraph-aware splitting keeps entire paragraphs together. This is better for structured documents (reports, legal text, academic papers) where a paragraph is the atomic unit of reasoning. Paragraphs vary in length, so chunks may be less uniform.

Hard character splitting divides text purely by character count with no regard for meaning. Use this only for structured data formats (CSV, code, JSON lines) that have no natural prose boundaries. It will cut identifiers and values mid-string, which degrades embedding quality for prose.

Tips and notes

Default to sentence-aware. It rarely breaks meaning and works for most prose and documentation.
Use 10 to 20 percent overlap. Enough to bridge boundaries without ballooning your token spend on duplicated text.
Estimates are approximate. Four-chars-per-token is a heuristic; verify against your model’s real tokenizer for tight budgets.
Smaller chunks, sharper retrieval. If RAG answers feel vague, shrink the chunk size before reaching for a bigger model.
Label chunks with metadata. Before sending chunks to a vector store, prepend each with the document title and chunk index. Retrieval results are much easier to trace back to source when context is embedded in the chunk text itself.