How does GPT-4o count tokens for an image?

On high detail, the image is scaled to fit 2048x2048, then the shortest side is scaled to 768px, and the result is tiled into 512px squares. Each tile costs 170 tokens plus an 85-token base. Low detail is a flat 85 tokens regardless of size.

What does 'auto' detail do?

Auto lets the model pick low or high based on image size. This tool approximates auto as high detail for images larger than 512px on either side, and low detail otherwise, which matches typical behaviour.

Does Gemini charge tokens the same way?

No. Gemini uses a flat per-image token charge (about 258 tokens for images up to 384px, scaling with tiles for larger images). This tool applies Gemini's tile-based estimate when you select a Gemini model.

Are output tokens included?

No. This estimates the input (vision) token cost only. The model's text reply is billed separately at the output rate, which you can size with an output token estimator.

Is anything uploaded?

No. You only enter dimensions, not the image itself. Everything is computed in your browser.

Image Token Cost Estimator

Image token cost estimator

Multimodal models do not bill images by the megabyte — they convert each image into a fixed number of tokens based on its dimensions and the detail level you request. A single high-detail photo can cost more tokens than several paragraphs of text, so estimating before you batch thousands of vision calls keeps your bill predictable.

How vision tokens are counted

For GPT-4o style vision, a low detail image is always a flat 85 tokens regardless of size. A high detail image is first scaled to fit inside a 2048x2048 box, then the shortest side is scaled to 768px, and the final image is divided into 512x512 tiles. Each tile costs 170 tokens and there is an 85-token base, so total tokens equal 85 + 170 * tiles. Gemini instead uses a tile model anchored at roughly 258 tokens for small images and scaling up with larger dimensions. This tool applies the right formula for the model you pick and multiplies by your batch size.

Tips and notes

Use low detail for thumbnails, icons, or any image where you only need a coarse description — it is dramatically cheaper and rarely changes the answer.
Resizing a large photo down to around 768px before sending it can cut high-detail tile counts without losing the information the model actually uses.
Remember that every image you resend in a multi-turn conversation is re-billed, just like text context, so vision-heavy chats compound quickly.