Token Frequency Analyzer

Analyze token frequency and vocabulary diversity in LLM output.

Ad placeholder (leaderboard)

Token frequency analyzer

Repetitive, low-variety output is one of the clearest symptoms of a struggling prompt or a model stuck in a loop. This tool profiles any text by tokenizing it, counting how often each token appears, and computing a vocabulary-diversity score — then flags the repeated phrases and dominant stopwords that make output feel padded or robotic.

How it works

The text is lowercased and split on word boundaries into tokens (a fast, model-agnostic approximation of subword tokenization that is well-suited to frequency work). The tool tallies total tokens and unique tokens, then computes the type-to-token ratio — unique divided by total — as a diversity score from 0 to 1. It builds a ranked frequency table, separately highlighting common stopwords that dominate the count. Finally it counts every bigram and trigram and surfaces the multi-word phrases that repeat, since repeated phrases are the clearest fingerprint of templated or degenerate generation.

Tips and notes

  • Watch the ratio. Healthy prose usually lands above ~0.4 type-to-token; a much lower number on long text suggests heavy repetition.
  • Repeated trigrams are a red flag. A phrase appearing three or more times in a short answer almost always means the model looped or padded.
  • Stopwords are expected to top the list. Focus on the highest non-stopword tokens to understand what the text is actually about.
  • Everything is local. No network calls — confidential text stays on your machine.
Ad placeholder (rectangle)