Fine-tuning vs prompting decision helper
“Should we fine-tune?” is one of the most expensive questions a team can answer wrongly. Fine-tuning sounds powerful, but it adds data labelling, training cost, and a maintenance treadmill — and it solves the wrong problem if what you actually needed was retrieval or a better prompt. This helper asks the questions that separate those cases and points you to the cheapest path that closes your gap.
How it works
You answer a short set of questions: what the task is, whether the gap is knowledge (the model lacks your facts) or behaviour (it knows enough but won’t follow your format or tone), how much labelled data you have, how stable the task is, and your latency and budget constraints. The helper scores those toward four outcomes — prompt engineering, RAG, fine-tuning, or a hybrid — using the standard decision rules: knowledge gaps point to RAG, behaviour gaps with ample stable data point to fine-tuning, everything else starts with prompting. It explains the reasoning so the recommendation is auditable.
Tips and notes
Default to prompting first; it is reversible and often gets you 80% of the way for a fraction of the effort. Reach for RAG when the model needs facts that change or that it never saw in training. Only fine-tune when you have a behaviour or format gap, hundreds-plus clean labelled examples, and a stable task — and budget for re-training as your data drifts. In production the strongest setup is frequently a hybrid: RAG for fresh knowledge, a light fine-tune for consistent style, behind a sharp prompt. Plan the retrieval side with the RAG architecture planner once you land on RAG or hybrid.