Estimate vision API cost before you run the batch
Vision models bill images as tokens, and the count depends on resolution and detail level — not a flat per-image fee. A high-detail 4K image can cost many times a low-detail thumbnail. This estimator computes per-image tokens the way GPT-4o does, converts to dollars, and scales to batches from 100 to 1 million images so you can budget a vision job accurately.
How it works
For GPT-4o-style tiling, a low-detail image is a flat 85 tokens. A high-detail image is resized to fit 2048×2048, then its short side is scaled to 768px, split into 512×512 tiles, and billed as:
tiles = ceil(scaled_w / 512) × ceil(scaled_h / 512)
tokens = 85 + 170 × tiles
The per-image dollar cost is tokens / 1,000,000 × input_price, and the batch
cost simply multiplies by your image count. Gemini uses a flat per-image token
charge in this estimator, which is how its vision pricing is commonly modelled.
Tips to cut vision cost
- Use low detail for classification. If you only need to know what is in the image, low detail is far cheaper and usually accurate enough.
- Downscale before sending. There is no benefit to sending a 6000px image if the model resizes it anyway — resize client-side and skip wasted tiles.
- Crop to the region of interest. Sending only the relevant part of an image cuts tiles and tokens directly.
- Batch with the async API. For large jobs, the batch endpoint is often cheaper than real-time calls and avoids rate-limit churn.