How to Fine-Tune an LLM with Hugging Face

LoRA and QLoRA fine-tuning on any open-source model

Ad placeholder (leaderboard)

When fine-tuning is the right tool

Fine-tuning adjusts a model’s weights on your own examples so it reliably produces a particular style, format, or skill. It is not the first thing to reach for — prompting and retrieval-augmented generation are cheaper and solve most problems — but when you need consistent behaviour (a fixed JSON shape, a brand voice, a domain-specific task the base model fumbles), fine-tuning is the right tool. The modern Hugging Face stack makes this accessible: with LoRA and QLoRA you can fine-tune a capable open model on a single GPU in an afternoon.

Preparing the dataset

Everything starts with data, and the format matters. Most instruction fine-tuning uses a chat or instruction layout — each example is a prompt and the ideal completion, often as a list of role-tagged messages. You build this as a Hugging Face datasets.Dataset, apply the model’s chat template so special tokens match exactly what the base model expects, and split off a small validation set.

The single biggest lever on quality is the data itself: consistent formatting, correct answers, and coverage of the cases you care about. A few hundred to a few thousand clean examples typically outperform far larger noisy sets. Remove duplicates, fix inconsistent formatting, and read a random sample by hand before training — garbage in really is garbage out here.

Configuring LoRA and QLoRA

Full fine-tuning updates every weight and demands enormous memory. PEFT (Parameter-Efficient Fine-Tuning) avoids that. LoRA freezes the base model and inserts small, trainable low-rank matrices into the attention layers, so you train a tiny fraction of the parameters. QLoRA adds 4-bit quantisation of the frozen base via bitsandbytes, cutting VRAM enough to fine-tune a 7B–8B model on a single 16–24 GB GPU.

In code you load the base model (4-bit for QLoRA), wrap it with a LoraConfig specifying the rank r, lora_alpha, dropout, and which modules to target, then hand everything to the SFTTrainer from the TRL library. Key hyperparameters are the learning rate (small, e.g. 1–2e-4 for LoRA), batch size with gradient accumulation to fit memory, and one to three epochs — more than that usually overfits small datasets.

Training, merging, and shipping

Run the trainer, watch the validation loss, and stop when it stops improving. A rising validation loss while training loss falls is the classic overfitting signal — reduce epochs or add data. Once trained you have a small adapter, not a full model.

From here you have two paths. Merge the adapter into the base weights with merge_and_unload() to produce a single standalone checkpoint — simplest for deployment and for converting to formats like GGUF to run in Ollama. Or keep the adapter separate and load it on top of the base at inference, which lets you hot-swap several task-specific adapters on one base model. Either way, push the result to the Hub with push_to_hub() so it is versioned, shareable, and ready to deploy — see the companion guide on serving it on AWS.

Ad placeholder (rectangle)