Definition
CLIP — short for Contrastive Language-Image Pre-training — is a model released by OpenAI in 2021 that learns to connect images and text in a single shared embedding space. Trained on roughly 400 million image-caption pairs scraped from the web, CLIP learns which captions go with which images, giving it a broad, flexible understanding of visual concepts described in natural language.
How contrastive pre-training works
CLIP has two encoders: one for images, one for text. During training, a batch of image-caption pairs is fed in, and the model is trained with a contrastive objective:
- Pull together the embeddings of each image and its true caption.
- Push apart the embeddings of every image and the captions that do not belong to it.
Over hundreds of millions of pairs, this forces matching images and text to land close together in the embedding space and mismatches to land far apart — without ever needing fixed category labels.
Zero-shot classification
CLIP’s signature trick is zero-shot image classification. Rather than training a classifier for a fixed label set, you supply candidate text prompts such as “a photo of a dog” or “a photo of a car.” CLIP embeds the image and each prompt, then chooses the prompt whose embedding is nearest the image. Because the categories are just text, you can classify into novel classes the model was never explicitly trained on simply by writing new prompts.
CLIP inside diffusion models
CLIP’s text encoder became a foundational component of text-to-image diffusion models. The encoder converts a prompt into an embedding that already captures visual meaning, and that embedding conditions the diffusion process so the generated image matches the prompt. Because CLIP has already aligned language with imagery, it gives generators a strong starting signal of what to draw.
Why it matters
CLIP demonstrated that a single model could learn open-ended visual concepts straight from web text, without hand-labelled datasets. That idea — aligning modalities in a shared embedding space through contrastive learning — underpins much of today’s multimodal AI, from image search and content moderation to the text-to-image systems that brought generative imagery to the mainstream.