LoRA — Low-Rank Adaptation (AI Glossary)

Fine-tune a 70B model on a laptop by only updating a tiny fraction of weights.

Ad placeholder (leaderboard)

What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for adapting large neural networks — especially LLMs — without retraining all of their weights. Instead of updating the model’s billions of parameters directly, LoRA freezes the original weights and inserts small, trainable “adapter” matrices. Only those adapters are trained, which can be fewer than 1% of the model’s parameters. The result is a practical way to customise a giant model on modest hardware.

The low-rank idea

The insight behind LoRA is that the change you need to apply to a big weight matrix during fine-tuning is usually low-rank — it can be approximated well by much smaller matrices. So rather than learning a full update matrix, LoRA learns two slim matrices whose product approximates that update. If the original matrix is large, the two slim matrices together hold a tiny fraction of the values, yet they capture most of the useful adaptation.

Because the base weights stay frozen, the original model is untouched and the learned adapter is a small, separate file.

QLoRA: going even lighter

QLoRA extends LoRA by first quantising the frozen base model to 4-bit precision, then training LoRA adapters on top. Quantisation slashes the memory needed to hold the base model, so techniques that once required a cluster of high-end GPUs can run on a single consumer or prosumer card. QLoRA made fine-tuning very large open-weight models accessible to hobbyists and small teams.

Why people use LoRA

LoRA and QLoRA are popular because they deliver most of the benefit of full fine-tuning at a fraction of the cost:

  • Low memory — only the small adapters need gradients and optimiser state.
  • Fast and cheap — fewer trainable parameters means quicker training runs.
  • Tiny artefacts — adapter files are megabytes, not gigabytes, so you can keep many task-specific adapters and load them on demand.
  • Composable — you can swap adapters in and out of one shared base model, serving many fine-tuned behaviours without storing many full models.

The main trade-off is that for some demanding or large-scale adaptations, full fine-tuning can still squeeze out slightly higher quality. For the vast majority of customisation tasks, though, LoRA-style methods are the default choice.

Ad placeholder (rectangle)