Which format uses the fewest tokens?

Raw CSV is almost always the most token-efficient because it states each column header once and uses single-character delimiters. JSON repeats every key on every row, often costing two to four times more tokens for the same data.

How accurate is the token estimate?

It uses a character-per-token heuristic (~4 characters per token for English-like text) which is close for most tabular data but not exact. For billing-grade numbers, run the chosen representation through your model's real tokenizer.

Why does JSON cost so much more than CSV?

JSON repeats the full key name in quotes for every field in every row, plus braces, colons, and commas. For a 10-column, 100-row table that is 1,000 repeated keys, which dominates the token count versus CSV's single header line.

Is markdown ever worth the extra tokens?

Markdown tables cost more than CSV but can improve a model's accuracy when columns must be reasoned about positionally. For pure data transfer, CSV is cheaper; for human-readable reasoning, the markdown overhead can pay off.

Is my CSV uploaded anywhere?

No. The file is read and tokenized entirely in your browser. Nothing is sent to a server or stored.

What is the CSV / Tabular Data Token Estimator?

Paste or upload CSV data and see the estimated token count for four representation strategies — raw CSV, JSON, markdown table, and key-value — to find the most token-efficient format for your LLM prompt. Runs entirely in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

CSV / Tabular Data Token Estimator

Name: CSV / Tabular Data Token Estimator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Send tabular data to an LLM for the fewest tokens

The same table can cost wildly different amounts depending on how you serialise it into a prompt. This estimator takes your CSV and shows the token count for four common representations — raw CSV, JSON, markdown table, and key-value — so you can pick the cheapest one your model handles well.

How the formats compare

Token cost is driven by repeated structural characters. CSV states each column header once and separates values with a single comma. JSON repeats every key on every row:

csv:      name,age\nAda,36
json:     [{"name":"Ada","age":36}]
markdown: | name | age |\n| --- | --- |\n| Ada | 36 |

For wide or tall tables, JSON’s repeated keys can multiply the token count several times over versus CSV for identical data. The estimator approximates tokens at roughly four characters per token, the standard rule of thumb for English-like text.

Why format choice matters for cost

LLM APIs charge per token. For a table with 20 columns and 500 rows:

Raw CSV states each header once on the first line, then just values and commas for the remaining 500 rows. The overhead is one header line.
JSON prints every key — "column_name": — once per row across all 500 rows. That is 500 × 20 = 10 000 key repetitions, each with surrounding quotes and colon. For a typical column name of 8 characters, that is roughly 80 000 extra characters, or about 20 000 extra tokens, just for the repeated keys.
Markdown table costs more than CSV because of the pipe characters and the header-separator row, but less than JSON because keys are not repeated per row.
Key-value format (one key: value per line) is the most verbose — it states both key and value on every line for every cell.

For a model that reads CSV well, choosing CSV over JSON on a wide table can cut token cost by a factor of three to five, directly reducing API charges and fitting more rows within a context window.

When to choose each format

Format	Best for
Raw CSV	Pure data transfer; maximum rows within budget
JSON	When the model needs to produce structured output that mirrors the input, or when column names must remain unambiguous
Markdown table	When the prompt is human-readable and the table is small; helps some models reason about column positions
Key-value	Small records where readability matters more than token cost

The token estimation method

The estimator uses a heuristic of approximately four characters per token, which is the widely cited rule of thumb for English prose and ASCII-heavy tabular data. Real tokenizers depend on the model’s vocabulary: common words become single tokens while unusual column names may split into multiple subwords. For billing-critical estimates, run the exact text through your model provider’s tokenizer endpoint.

Tips to cut tabular token cost

Prefer raw CSV for pure data transfer — it is consistently the cheapest.
Send only the columns and rows you need. Dropping unused columns cuts the per-row cost across the whole table.
Abbreviate headers when the model does not need full names; a header that appears once is cheap in CSV but expensive when repeated in JSON.
Cap rows to a relevant slice. If a model needs context from a table of 10 000 rows, send a representative sample of 100 rows rather than the full dataset.
Remove whitespace padding. Aligned columns and trailing spaces add tokens without helping the model.