Why not just split on every period?

A naive period split breaks on abbreviations like "Dr." or "Inc.", decimals like "3.14", and ellipses. This tool uses heuristics to keep those intact so each output is a real sentence.

What is sentence-level chunking in RAG?

Instead of fixed-size character windows, sentence-level chunking groups whole sentences so embeddings capture complete thoughts. It improves retrieval precision and reduces mid-sentence truncation.

Does it handle multiple languages?

The heuristics are tuned for English and other Latin-script languages that end sentences with . ! or ? followed by whitespace and a capital letter. It will not segment languages without those conventions.

Is my text uploaded anywhere?

No. All processing happens locally in your browser with JavaScript. Nothing is sent to a server or stored.

What is the Sentence-Boundary Splitter?

Accurate sentence boundary detection that handles abbreviations (Dr., Inc.), decimal numbers, ellipses, and quotes. Splits paragraphs into clean sentence units for RAG sentence-level chunking, dataset prep, and NLP evaluation pipelines. It runs free in your browser on Gera Tools, with nothing uploaded.

Sentence-Boundary Splitter

Name: Sentence-Boundary Splitter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Split text into clean sentences

The Sentence-Boundary Splitter segments a block of text into individual sentences using boundary-detection heuristics that respect common edge cases. Unlike a naive split on the period character, it does not break on abbreviations (Dr., Inc., e.g.), decimal numbers (3.14), ellipses (...), or trailing quotes. The result is a clean, numbered list of sentences ready for indexing, labelling, or evaluation.

How it works

A sentence boundary is detected when a terminal punctuation mark — ., !, or ? (plus any closing quote or bracket) — is followed by whitespace and the start of a new sentence. Before splitting, the tool masks known abbreviations and decimal numbers so their periods are not treated as boundaries. After splitting, each fragment is trimmed and empty fragments are discarded. Everything runs in your browser, so even large documents stay private and fast.

Why sentence-level chunking matters for RAG

When building a retrieval-augmented generation system, the chunk unit determines what gets embedded and what gets retrieved. Two common alternatives are:

Fixed-size character windows — simple to implement but frequently splits mid-sentence, which corrupts the semantic signal in the embedding.
Paragraph-level chunks — captures more context but can be too large for models with short context windows, and retrieval returns more irrelevant content per hit.

Sentence-level chunking is a middle ground: each chunk is a complete semantic unit (a full thought), which embeds cleanly and retrieves precisely. The trade-off is that a single sentence can be too short to provide context on its own. A common fix is to store sentences with a one-sentence overlap: when indexing sentence N, also append sentence N+1, and when retrieved, the embedding includes the following sentence’s context.

The edge cases that break naive splitters

A simple text.split('.') fails on all of these:

"Dr. Smith saw 3.14 million patients at St. John's." — three periods, one sentence boundary.
"She said, 'I'll be there. Promise.'" — the quote’s closing period terminates the outer sentence, not just the quoted one.
"The firm (est. 1997) grew..." — abbreviation inside parentheses.
"Wait... really?" — an ellipsis followed by a question does not produce two sentences.

This tool masks these cases before splitting so the output stays accurate across real-world text.

Using the output

The numbered list format makes it easy to scan for over-splits (two items that are really one sentence) or under-splits (one item that should be two). Fix those manually before committing the sentences to your index — errors at the chunking stage propagate to retrieval quality.

The plain newline-separated export is directly readable by Python’s splitlines() or any line-based file reader for pipeline ingestion.

Tips and notes

For RAG pipelines, sentence-level chunking pairs well with a small overlap: index each sentence but also store its neighbours’ embeddings to preserve context. If your text uses non-standard punctuation (no space after the period, for example), some boundaries may be missed — normalise spacing first for best results. The numbered output makes it easy to spot over-merged or over-split units before you commit them to an index.