What GPT-4 Vision unlocks
GPT-4 Vision (GPT-4V) lets a language model see. You send an image alongside a text instruction, and the model reasons about both together — describing a scene, reading the text in a photo, interpreting a chart, classifying content, or pulling structured fields out of a receipt or form. It turns image understanding, which used to require specialised computer-vision pipelines, into the same prompt-and-response loop you already use for text. The skill is no longer training a model; it is giving it the right image and a precise question.
How it works
You make a normal chat request, but one of the content blocks is an image_url. That URL can be a public link to a hosted image, or a base64-encoded data URL — data:image/jpeg;base64,... — for private or locally generated images. Alongside it you include a text block with your instruction: describe the image, extract all visible text, classify the product, or answer a specific question. A detail setting controls fidelity: low gives a fast, cheap overview, while high tiles the image so the model can read small text and fine features at a higher token cost. The estimator below shows roughly how many tokens an image will consume at each detail level so you can plan cost before sending anything.
Tips and cost notes
Specificity wins. “What is in this image?” yields vague output; “Extract every line item and price from this receipt as a JSON array” yields something usable. For OCR and fine detail, send a sharp, well-cropped, high-detail image and, when accuracy is critical, ask the model to quote exactly what it sees rather than summarise. Watch cost: high-detail images are tiled, and each tile adds tokens, so a large high-detail image can cost many times a low-detail one. Default to low detail for overviews and reserve high detail for tasks that genuinely need it. Always validate extracted data against an expected format — vision output is impressive but not infallible, especially on handwriting, low contrast, and skewed scans.