What a CNN is
A convolutional neural network (CNN) is a type of deep neural network built specifically for data with a grid structure — most famously images, but also audio spectrograms and other 2D signals. Before CNNs, getting a computer to recognise objects in photographs required painstaking hand-engineering of features. CNNs replaced that with a model that learns its own features from data, and their success on the ImageNet benchmark in 2012 triggered the modern deep-learning revolution in vision. The core idea is simple and powerful: instead of connecting every input pixel to every neuron, a CNN scans the image with small reusable filters that detect local patterns, dramatically reducing the number of parameters while exploiting the spatial structure of images.
Convolutional layers and filters
The heart of a CNN is the convolutional layer. It applies a set of small filters (also called kernels) — typically 3×3 or 5×5 grids of learnable weights — that slide across the input. At each position, the filter multiplies its weights against the local pixels and sums the result, producing one value in an output feature map. A filter that learns to respond to horizontal edges will light up wherever a horizontal edge appears in the image. Because the same filter is reused at every location (weight sharing), the layer is efficient and translation-invariant: it detects a pattern no matter where it sits in the frame. A single layer uses many filters in parallel, each learning a different pattern, so the layer outputs a stack of feature maps.
Pooling, receptive fields, and the feature hierarchy
Two more ideas make CNNs work. Pooling layers downsample feature maps — for example, max pooling keeps only the strongest response in each small region. This shrinks the spatial dimensions, cuts computation, and adds robustness to small shifts. As convolution and pooling layers stack, each neuron’s receptive field — the slice of the original image that affects it — grows steadily. Early layers, with small receptive fields, detect primitives like edges and colours; middle layers combine those into textures and parts like eyes or wheels; deep layers, seeing most of the image, recognise whole objects. This emergent feature hierarchy, learned automatically from data, is exactly what made CNNs so much better than the hand-crafted features that preceded them.
Landmark architectures
A handful of architectures defined the CNN era. AlexNet (2012) was the model that won ImageNet by a huge margin and proved deep CNNs on GPUs could crush classical methods. VGG showed that depth with simple stacked 3×3 convolutions worked well. ResNet (2015) solved the problem that very deep networks became hard to train, introducing residual (skip) connections that let gradients flow through hundreds of layers — enabling networks far deeper than before and becoming a near-universal backbone. EfficientNet later showed how to scale depth, width, and input resolution together for the best accuracy per unit of compute. Each step pushed accuracy higher while teaching general lessons about training very deep networks.
Where CNNs stand today
CNNs remain workhorses of practical computer vision: face recognition, medical imaging, manufacturing inspection, self-driving perception, and countless mobile vision features run on them, often in efficient forms designed for phones. The newer vision transformers (ViTs) challenge CNNs by using self-attention over image patches and can surpass them given enough data, and many modern systems blend convolution with attention. But the convolutional building blocks — local filters, weight sharing, pooling, and a learned feature hierarchy — remain foundational ideas in deep learning, and understanding them is the clearest way to grasp how machines learned to see.