What can GPT-4 Vision actually do?

It can describe images, read text in them (OCR), interpret charts and diagrams, classify content, answer questions about a photo, and extract structured data like fields from a receipt. It works best on tasks where the answer is visible in the image and you ask a clear, specific question rather than an open-ended one.

Should I send images as a URL or as base64?

Use a public URL when the image is already hosted and not sensitive — it keeps your request small. Use a base64 data URL when the image is private, local, or generated on the fly, accepting that the request payload grows. Both go in the same image_url content block; the difference is only how the image data reaches the model.

How is image cost calculated?

Images are billed as tokens, and the count depends on the image's dimensions and the detail setting. A low-detail image costs a small fixed number of tokens, while a high-detail image is tiled and each tile adds tokens, so larger high-detail images cost more. The estimator on this page gives a rough figure so you can plan.

Why is the model getting details in my image wrong?

Usually because the detail level is too low, the image is blurry or small, or the question is vague. For fine text or small features switch to high detail, send a sharp image, and ask a precise question. If accuracy still matters, ask the model to quote exactly what it sees rather than to summarise.

Can it read handwriting and documents?

It handles printed text and clear documents well and can often read tidy handwriting, but messy handwriting, low contrast, and skewed scans degrade accuracy. For document workflows, send high detail, crop to the relevant region, and validate the extracted text against expected formats rather than trusting it blindly.

How to Use GPT-4 Vision for Image Analysis

What GPT-4 Vision unlocks

GPT-4 Vision (GPT-4V) lets a language model see. You send an image alongside a text instruction, and the model reasons about both together — describing a scene, reading the text in a photo, interpreting a chart, classifying content, or pulling structured fields out of a receipt or form. It turns image understanding, which used to require specialised computer-vision pipelines, into the same prompt-and-response loop you already use for text. The skill is no longer training a model; it is giving it the right image and a precise question.

How it works

You make a normal chat request, but one of the content blocks is an image_url. That URL can be a public link to a hosted image, or a base64-encoded data URL — data:image/jpeg;base64,... — for private or locally generated images. Alongside it you include a text block with your instruction: describe the image, extract all visible text, classify the product, or answer a specific question. A detail setting controls fidelity: low gives a fast, cheap overview, while high tiles the image so the model can read small text and fine features at a higher token cost. The estimator below shows roughly how many tokens an image will consume at each detail level so you can plan cost before sending anything.

Tips and cost notes

Specificity wins. “What is in this image?” yields vague output; “Extract every line item and price from this receipt as a JSON array” yields something usable. For OCR and fine detail, send a sharp, well-cropped, high-detail image and, when accuracy is critical, ask the model to quote exactly what it sees rather than summarise. Watch cost: high-detail images are tiled, and each tile adds tokens, so a large high-detail image can cost many times a low-detail one. Default to low detail for overviews and reserve high detail for tasks that genuinely need it. Always validate extracted data against an expected format — vision output is impressive but not infallible, especially on handwriting, low contrast, and skewed scans.