Break text into n-grams
This tool extracts n-grams — contiguous sequences of n characters or words — from any text and counts how often each one appears. N-grams are a foundational tool in natural language processing, used for language modeling, spelling correction, text classification, and similarity scoring.
How it works
Pick a mode and a window size n:
- Word n-grams: the text is tokenized into Unicode word tokens, then a window of
nconsecutive tokens slides one position at a time. - Character n-grams: the window of
nconsecutive characters slides one position at a time over the raw text (optionally with whitespace collapsed).
A text of L items produces L - n + 1 n-grams. Each distinct n-gram is tallied, and the results are listed most-frequent first. Case folding can be enabled to treat The and the as the same n-gram.
Example
For the word bigrams (n = 2) of “to be or not to be”:
to be 2
be or 1
or not 1
not to 1
The pair to be appears twice; everything runs locally so your text stays private.