What Is LoRA? Low-Rank Adaptation for Efficient LLM Fine-Tuning

Fine-tune a giant model by updating only a tiny fraction of its weights

Ad placeholder (leaderboard)

The core idea

LoRA — Low-Rank Adaptation — is a way to fine-tune a large model without touching most of its weights. Instead of updating the full, enormous weight matrices, LoRA freezes the original model and learns a small, low-rank correction that is added on top. The insight is that the change needed to adapt a model to a new task usually lives in a low-dimensional space, so it can be captured by two small matrices whose product approximates the full weight update. You get most of the benefit of fine-tuning while training a tiny fraction of the parameters.

How the maths works (gently)

A weight matrix W in a transformer might be, say, 4096×4096 — millions of numbers. Full fine-tuning would learn an update ΔW of the same size. LoRA instead represents that update as the product of two thin matrices, ΔW = B · A, where A is r × 4096 and B is 4096 × r, with the rank r small (often 8 or 16). The frozen W is used as-is, and only A and B are trained. At inference, W + (alpha/r)·B·A behaves like a fully fine-tuned matrix. Because r is tiny, A and B together hold orders of magnitude fewer numbers than ΔW would.

The key hyperparameters: r and alpha

Two settings control LoRA. Rank r sets the adapter’s capacity: larger r can capture more complex adaptations but uses more memory and can overfit small datasets. Alpha is a scaling factor; the LoRA update is multiplied by roughly alpha / r, so alpha and rank are tuned as a pair rather than independently. A common, safe starting point is r = 8–16 with alpha = 16–32. You also choose which layers get adapters — attention query/value projections are the classic targets, though adapting more projection matrices can help on harder tasks.

Why it is so efficient

Full fine-tuning has to store gradients and optimiser state for every weight, which is the real memory hog. LoRA only optimises the small A and B matrices, cutting trainable parameters — and the associated optimiser memory — by 99% or more. That makes it possible to fine-tune large models on a single GPU. QLoRA pushes this further by quantising the frozen base model to 4-bit precision and training LoRA adapters on top, so even very large models fit on modest hardware with little measurable quality loss. The frozen base also means you can ship just the small adapter file (often a few megabytes) and load it onto the shared base on demand.

Practical benefits and trade-offs

Beyond memory savings, LoRA’s biggest practical win is modularity: one frozen base model can host many task-specific adapters, swapped in and out cheaply, and adapters can be merged back into the weights for zero-overhead inference when needed. The trade-offs are real but manageable. Because capacity is limited by the rank, LoRA may slightly underperform full fine-tuning on tasks that require sweeping changes to the model’s behaviour, and choosing which layers to adapt and what rank to use takes some experimentation. For the vast majority of instruction-tuning and domain-adaptation jobs, though, LoRA (or QLoRA) gives nearly the quality of full fine-tuning at a tiny fraction of the cost — which is why it has become the default way to fine-tune open models.

Ad placeholder (rectangle)