Image token cost estimator
Multimodal models do not bill images by the megabyte — they convert each image into a fixed number of tokens based on its dimensions and the detail level you request. A single high-detail photo can cost more tokens than several paragraphs of text, so estimating before you batch thousands of vision calls keeps your bill predictable.
How vision tokens are counted
For GPT-4o style vision, a low detail image is always a flat 85 tokens regardless
of size. A high detail image is first scaled to fit inside a 2048x2048 box, then the
shortest side is scaled to 768px, and the final image is divided into 512x512 tiles.
Each tile costs 170 tokens and there is an 85-token base, so total tokens equal
85 + 170 * tiles. Gemini instead uses a tile model anchored at roughly 258 tokens for
small images and scaling up with larger dimensions. This tool applies the right formula
for the model you pick and multiplies by your batch size.
Tips and notes
- Use low detail for thumbnails, icons, or any image where you only need a coarse description — it is dramatically cheaper and rarely changes the answer.
- Resizing a large photo down to around 768px before sending it can cut high-detail tile counts without losing the information the model actually uses.
- Remember that every image you resend in a multi-turn conversation is re-billed, just like text context, so vision-heavy chats compound quickly.