Question 1

What is computer vision in simple terms?

Accepted Answer

Computer vision is the field of AI that lets machines extract meaning from images and video. Instead of just storing pixels, a computer vision system can say what objects are present, where they are, and sometimes what is happening. It turns raw visual data into structured information that software can act on.

Question 2

What are the main computer vision tasks?

Accepted Answer

The core tasks are classification (what is in this image), object detection (what objects are present and where, marked with boxes), and segmentation (which exact pixels belong to each object). Beyond these are tasks like pose estimation, depth prediction, tracking across video frames, and image captioning, which combines vision with language.

Question 3

What is the difference between a CNN and a vision transformer?

Accepted Answer

A convolutional neural network processes images with local filters that slide across the image to detect patterns like edges and textures, building up to complex shapes. A vision transformer splits the image into patches and uses self-attention so every patch can relate to every other, capturing long-range relationships. ViTs often match or beat CNNs when trained on very large datasets.

Question 4

What are foundation vision models?

Accepted Answer

Foundation vision models are large models pre-trained on huge image datasets that can be adapted to many tasks with little extra data. Examples include CLIP, which links images and text, and Segment Anything, which segments arbitrary objects from a prompt. They shift the workflow from training a model per task to adapting one powerful general model.

What Is Computer Vision? How AI Sees and Understands Images

What computer vision is

The core tasks

Convolutional neural networks: the breakthrough

Vision transformers and self-attention

Foundation models and where vision is heading