The core idea
LoRA — Low-Rank Adaptation — is a way to fine-tune a large model without touching most of its weights. Instead of updating the full, enormous weight matrices, LoRA freezes the original model and learns a small, low-rank correction that is added on top. The insight is that the change needed to adapt a model to a new task usually lives in a low-dimensional space, so it can be captured by two small matrices whose product approximates the full weight update. You get most of the benefit of fine-tuning while training a tiny fraction of the parameters.
How the maths works (gently)
A weight matrix W in a transformer might be, say, 4096×4096 — millions of numbers. Full
fine-tuning would learn an update ΔW of the same size. LoRA instead represents that update
as the product of two thin matrices, ΔW = B · A, where A is r × 4096 and B is
4096 × r, with the rank r small (often 8 or 16). The frozen W is used as-is, and
only A and B are trained. At inference, W + (alpha/r)·B·A behaves like a fully
fine-tuned matrix. Because r is tiny, A and B together hold orders of magnitude fewer
numbers than ΔW would.
The key hyperparameters: r and alpha
Two settings control LoRA. Rank r sets the adapter’s capacity: larger r can capture
more complex adaptations but uses more memory and can overfit small datasets. Alpha is a
scaling factor; the LoRA update is multiplied by roughly alpha / r, so alpha and rank are
tuned as a pair rather than independently. A common, safe starting point is r = 8–16 with
alpha = 16–32. You also choose which layers get adapters — attention query/value
projections are the classic targets, though adapting more projection matrices can help on
harder tasks.
Why it is so efficient
Full fine-tuning has to store gradients and optimiser state for every weight, which is the
real memory hog. LoRA only optimises the small A and B matrices, cutting trainable
parameters — and the associated optimiser memory — by 99% or more. That makes it
possible to fine-tune large models on a single GPU. QLoRA pushes this further by
quantising the frozen base model to 4-bit precision and training LoRA adapters on top, so
even very large models fit on modest hardware with little measurable quality loss. The
frozen base also means you can ship just the small adapter file (often a few megabytes) and
load it onto the shared base on demand.
Practical benefits and trade-offs
Beyond memory savings, LoRA’s biggest practical win is modularity: one frozen base model can host many task-specific adapters, swapped in and out cheaply, and adapters can be merged back into the weights for zero-overhead inference when needed. The trade-offs are real but manageable. Because capacity is limited by the rank, LoRA may slightly underperform full fine-tuning on tasks that require sweeping changes to the model’s behaviour, and choosing which layers to adapt and what rank to use takes some experimentation. For the vast majority of instruction-tuning and domain-adaptation jobs, though, LoRA (or QLoRA) gives nearly the quality of full fine-tuning at a tiny fraction of the cost — which is why it has become the default way to fine-tune open models.