Definition
Transfer learning is the technique of reusing knowledge a model has already acquired on one task to accelerate learning on a different task. Rather than training a fresh model from random weights, you start from a model that has already learned useful, general-purpose representations and adapt it to your specific problem. It is the central idea behind today’s foundation models: train once on a massive general corpus, then specialise many times.
The pre-train then fine-tune paradigm
Modern AI almost universally follows a two-stage recipe:
- Pre-training — a model learns broad patterns from an enormous, often unlabelled dataset. For language models this means predicting the next token across the internet; for vision models it means learning edges, textures, and shapes from millions of images.
- Fine-tuning — that pre-trained model is then trained further on a much smaller, task-specific dataset (legal documents, medical images, your support tickets) so it specialises in the target task.
The pre-trained model carries general competence; fine-tuning steers it toward a particular job.
Why it works — and why it saves resources
The costly part of building an AI system is learning the general representations — how language is structured, what visual features matter. Once a model has those, the remaining task is mostly adaptation, which is far cheaper. As a result, transfer learning dramatically reduces the labelled data and compute a downstream task needs: a few thousand examples and a short training run can rival what would otherwise require millions of examples trained from scratch.
Approaches: feature extraction vs fine-tuning
There is a spectrum of how much of the original model you change:
- Feature extraction — freeze the pre-trained network and train only a small new “head” on top. Fast, cheap, and resistant to overfitting on small datasets.
- Full or partial fine-tuning — unfreeze some or all original weights and update them too. More expensive and data-hungry, but capable of higher accuracy when the new task differs more from the original.
- Parameter-efficient methods (e.g. LoRA) — update only a small number of added parameters, capturing much of fine-tuning’s benefit at a fraction of the cost.
Where you see it
Transfer learning is everywhere in practice. Every fine-tuned LLM, every image classifier built on a pre-trained backbone, and every embedding model adapted to a domain is an instance of it. It is also why zero-shot and few-shot prompting work at all: a sufficiently pre-trained model already carries enough transferable knowledge to handle new tasks from instructions or a handful of examples, sometimes with no weight updates required.