Question 1

What is CLIP?

Accepted Answer

CLIP, short for Contrastive Language-Image Pre-training, is a model from OpenAI that learns a shared embedding space for images and text. It was trained on hundreds of millions of image-caption pairs to recognise which captions belong to which images.

Question 2

How does CLIP do zero-shot classification?

Accepted Answer

Instead of a fixed list of labels, you give CLIP candidate text descriptions like a photo of a cat. It embeds the image and each description, then picks the description whose embedding is closest. This lets it classify into categories it was never explicitly trained on.

Question 3

What is contrastive learning in CLIP?

Accepted Answer

Contrastive learning trains the model to pull matching image-caption pairs close together in embedding space while pushing mismatched pairs apart. Over millions of pairs this teaches CLIP a shared space where related images and text land near each other.

Question 4

How is CLIP used in diffusion models?

Accepted Answer

Many text-to-image diffusion models use CLIP's text encoder to turn a prompt into an embedding that conditions the image generation. Because CLIP already aligns text with visual concepts, its embeddings give the generator a strong signal of what to draw.

CLIP (AI Glossary)

Definition

How contrastive pre-training works

Zero-shot classification

CLIP inside diffusion models

Why it matters