Question 1

When should I fine-tune instead of using RAG or prompting?

Accepted Answer

Fine-tune when you need to change the model's behavior, tone, or output format consistently — not when you need it to know new facts. For injecting up-to-date or proprietary knowledge, retrieval-augmented generation (RAG) is usually cheaper, faster to update, and avoids retraining. A common pattern is to fine-tune for style and format, then layer RAG on top for facts.

Question 2

What is the difference between full fine-tuning, LoRA, and QLoRA?

Accepted Answer

Full fine-tuning updates every weight in the model and needs the most memory and compute. LoRA freezes the base weights and trains small low-rank adapter matrices, cutting memory and cost dramatically with little quality loss. QLoRA goes further by quantizing the base model to 4-bit while training the adapters, letting you fine-tune large models on a single consumer GPU.

Question 3

How much training data do I actually need?

Accepted Answer

Far less than people expect. For changing format or tone, 50 to 200 high-quality, consistent examples often outperform thousands of noisy ones. Quality and consistency matter more than volume — a handful of carefully written examples that all follow the same pattern teaches the model the pattern cleanly. Scale up only if evaluation shows the model is still inconsistent.

Question 4

How much does fine-tuning cost?

Accepted Answer

Hosted fine-tuning (OpenAI GPT-3.5) costs a few dollars for small datasets plus a higher per-token inference rate afterward. Open-weight fine-tuning with QLoRA on a rented cloud GPU can cost from a few dollars to tens of dollars per run depending on model size and epochs. The bigger ongoing cost is usually inference hosting, not the training run itself.

Fine-Tuning an LLM: Complete Beginner's Guide 2024

What fine-tuning actually does

Preparing your dataset

LoRA, QLoRA, and parameter-efficient methods

Training, evaluation, and avoiding overfitting

Common mistakes that waste money