When fine-tuning is the right tool
Fine-tuning adjusts a model’s weights on your own examples so it reliably produces a particular style, format, or skill. It is not the first thing to reach for — prompting and retrieval-augmented generation are cheaper and solve most problems — but when you need consistent behaviour (a fixed JSON shape, a brand voice, a domain-specific task the base model fumbles), fine-tuning is the right tool. The modern Hugging Face stack makes this accessible: with LoRA and QLoRA you can fine-tune a capable open model on a single GPU in an afternoon.
Preparing the dataset
Everything starts with data, and the format matters. Most instruction
fine-tuning uses a chat or instruction layout — each example is a prompt and the
ideal completion, often as a list of role-tagged messages. You build this as a
Hugging Face datasets.Dataset, apply the model’s chat template so special
tokens match exactly what the base model expects, and split off a small
validation set.
The single biggest lever on quality is the data itself: consistent formatting, correct answers, and coverage of the cases you care about. A few hundred to a few thousand clean examples typically outperform far larger noisy sets. Remove duplicates, fix inconsistent formatting, and read a random sample by hand before training — garbage in really is garbage out here.
Configuring LoRA and QLoRA
Full fine-tuning updates every weight and demands enormous memory. PEFT
(Parameter-Efficient Fine-Tuning) avoids that. LoRA freezes the base model
and inserts small, trainable low-rank matrices into the attention layers, so you
train a tiny fraction of the parameters. QLoRA adds 4-bit quantisation of the
frozen base via bitsandbytes, cutting VRAM enough to fine-tune a 7B–8B model on
a single 16–24 GB GPU.
In code you load the base model (4-bit for QLoRA), wrap it with a LoraConfig
specifying the rank r, lora_alpha, dropout, and which modules to target, then
hand everything to the SFTTrainer from the TRL library. Key hyperparameters
are the learning rate (small, e.g. 1–2e-4 for LoRA), batch size with gradient
accumulation to fit memory, and one to three epochs — more than that usually
overfits small datasets.
Training, merging, and shipping
Run the trainer, watch the validation loss, and stop when it stops improving. A rising validation loss while training loss falls is the classic overfitting signal — reduce epochs or add data. Once trained you have a small adapter, not a full model.
From here you have two paths. Merge the adapter into the base weights with
merge_and_unload() to produce a single standalone checkpoint — simplest for
deployment and for converting to formats like GGUF to run in Ollama. Or keep
the adapter separate and load it on top of the base at inference, which lets
you hot-swap several task-specific adapters on one base model. Either way, push
the result to the Hub with push_to_hub() so it is versioned, shareable, and
ready to deploy — see the companion guide on serving it on AWS.