What Is Computer Vision? How AI Sees and Understands Images

CNNs, ViTs, and the tasks that let machines perceive the visual world

Ad placeholder (leaderboard)

What computer vision is

Computer vision is the branch of AI that gives machines the ability to interpret the visual world. A digital image is just a grid of numbers representing pixel brightness and colour; computer vision is what turns that grid into useful understanding — recognising that a photo contains a dog, locating every car in a street scene, or outlining a tumour in a medical scan. The field has gone from a slow, hand-engineered discipline to one of the great success stories of deep learning, and it now underpins technologies from face unlock on your phone to self-driving cars, medical diagnostics, and automated quality control in factories.

The core tasks

Most computer vision work falls into a few canonical tasks of increasing detail. Image classification answers “what is this a picture of?” by assigning one or more labels to the whole image. Object detection goes further, finding each object and drawing a bounding box around it — essential for counting items or locating people in a scene. Segmentation is the most precise: it labels every individual pixel, distinguishing the exact silhouette of each object (semantic segmentation groups pixels by class, while instance segmentation separates individual objects of the same class). Beyond this trio sit tasks such as pose estimation (locating body joints), depth estimation, object tracking across video, and image captioning, which bridges vision and language by describing an image in words.

Convolutional neural networks: the breakthrough

The technology that made modern computer vision work is the convolutional neural network (CNN). A CNN scans an image with small learnable filters that slide across it, each detecting a local pattern — an edge, a corner, a texture. Stacking these layers builds a hierarchy: early layers find simple features, deeper layers combine them into shapes, parts, and finally whole objects. Pooling layers shrink the spatial dimensions so the network sees larger context, and weight sharing makes CNNs efficient and translation-invariant — a cat is recognised wherever it appears. The 2012 success of AlexNet on the ImageNet benchmark kicked off the deep-learning era, followed by deeper, smarter architectures like ResNet, whose skip connections allowed networks hundreds of layers deep, and EfficientNet, which balanced depth, width, and resolution for better accuracy per unit of compute.

Vision transformers and self-attention

More recently, the vision transformer (ViT) borrowed the architecture behind language models. Instead of sliding filters, a ViT chops the image into a grid of patches, treats each patch like a token, and uses self-attention so every patch can directly relate to every other patch in the image. This captures long-range relationships — how distant parts of a scene relate — that a purely local CNN struggles with. ViTs are data-hungry and originally needed enormous training sets to beat CNNs, but with sufficient data they match or surpass them, and hybrid designs that combine convolution with attention are now common. The practical upshot is that the field has more than one dominant architecture, chosen based on data scale and task.

Foundation models and where vision is heading

The newest shift mirrors what happened in language: large foundation vision models pre-trained on massive datasets that adapt to many downstream tasks. CLIP learned a shared space for images and text, enabling zero-shot classification and powering text-to-image generators. Segment Anything can isolate arbitrary objects from a simple prompt without task-specific training. These models change the workflow from “collect data and train a model for each task” to “adapt one capable general model,” dramatically lowering the barrier to building vision applications. Combined with multimodal models that reason jointly over images and text, computer vision is increasingly less a standalone discipline and more one sense in a broader, general-purpose AI system.

Ad placeholder (rectangle)