How are image tokens calculated for GPT-4o?

Low detail uses a fixed base of 85 tokens per image. High detail scales the image to fit within 2048×2048, then to a 768px short side, splits it into 512×512 tiles, and charges 170 tokens per tile plus the 85-token base. More pixels means more tiles and more tokens.

Does adding an output prompt change the cost?

This estimator focuses on the image input tokens, which dominate vision cost. Any text prompt and the model's text answer are billed separately at the usual input and output rates — add those with a standard LLM cost calculator if they are significant.

Why is low detail so much cheaper?

Low detail collapses any image to a single fixed-cost representation (85 tokens for GPT-4o), ignoring resolution. It is ideal for classification or presence checks where fine detail is not needed, and can be 5-20× cheaper per image.

Is my data sent anywhere?

No. Everything is computed in your browser. Nothing you enter is uploaded, stored or logged.

What is the Vision API Batch Cost Estimator?

Enter image dimensions and detail level to get per-image vision token and dollar costs, then scale to batch jobs from 100 to 1 million images. Compare GPT-4o, GPT-4o mini and Gemini vision pricing side by side. It runs free in your browser on Gera Tools, with nothing uploaded.

Vision API Batch Cost Estimator

Name: Vision API Batch Cost Estimator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Estimate vision API cost before you run the batch

Vision models bill images as tokens, and the count depends on resolution and detail level — not a flat per-image fee. A high-detail 4K image can cost many times a low-detail thumbnail. This estimator computes per-image tokens the way GPT-4o does, converts to dollars, and scales to batches from 100 to 1 million images so you can budget a vision job accurately.

How it works

For GPT-4o-style tiling, a low-detail image is a flat 85 tokens. A high-detail image is resized to fit 2048×2048, then its short side is scaled to 768px, split into 512×512 tiles, and billed as:

tiles  = ceil(scaled_w / 512) × ceil(scaled_h / 512)
tokens = 85 + 170 × tiles

The per-image dollar cost is tokens / 1,000,000 × input_price, and the batch cost simply multiplies by your image count. Gemini uses a flat per-image token charge in this estimator, which is how its vision pricing is commonly modelled.

Tips to cut vision cost

Use low detail for classification. If you only need to know what is in the image, low detail is far cheaper and usually accurate enough.
Downscale before sending. There is no benefit to sending a 6000px image if the model resizes it anyway — resize client-side and skip wasted tiles.
Crop to the region of interest. Sending only the relevant part of an image cuts tiles and tokens directly.
Batch with the async API. For large jobs, the batch endpoint is often cheaper than real-time calls and avoids rate-limit churn.

Understanding the tile system for GPT-4o

The tiling logic is what makes high-detail image costs non-linear. A 512×512 image produces a single tile (plus the 85-token base), totalling 255 tokens. A 1024×1024 image in high-detail mode first fits within 2048×2048 (no resize needed), then its short side scales to 768px — producing a 768×768 image, which splits into a 2×2 grid of 512×512 tiles, giving 4 tiles and 765 tokens total. A landscape image of 2048×1024 would produce a different tile count as the scaling and splitting interact.

The practical implication: images that are slightly above a tile boundary cost disproportionately more. An image that splits into 9 tiles costs meaningfully more than one that splits into 4, even if the visual difference is modest. Pre-processing images to stay just under a tile boundary — for example, resizing to 1024px on the long side rather than 1025px — can eliminate an entire row or column of tiles.

Comparing GPT-4o, GPT-4o mini, and Gemini

The three models in this estimator represent different points on the cost-capability curve for vision tasks:

GPT-4o: highest capability, highest cost per image token — best for tasks requiring fine-grained understanding, complex document parsing, or nuanced scene description
GPT-4o mini: significantly cheaper per token while retaining solid vision capability — the right choice for classification, moderation, simple extraction, or any task where you have verified GPT-4o mini performs acceptably
Gemini: Google’s pricing model structures image input differently, often as a flat per-image token count — useful for benchmarking cost against OpenAI at scale

At batch scale of 100,000 images or more, the per-image cost difference between GPT-4o and GPT-4o mini can be hundreds or thousands of dollars, making the model choice as important as prompt engineering.

Batch API pricing and when to use it

OpenAI’s Batch API offers a discount on large asynchronous jobs in exchange for a longer turnaround window (typically 24 hours). For non-latency-sensitive vision workloads — audit pipelines, content moderation queues, bulk product image tagging — the batch endpoint is often the cheapest route. Estimate the standard real-time cost with this tool, then factor in the applicable batch discount to compare total cost against your processing time requirements.