Two levers, two layers
There are two fundamentally different ways to change how a language model behaves. Prompting works at inference time: you change the instructions, examples, and context you send, and the model’s underlying weights stay exactly the same. Fine-tuning works at the training layer: you continue training the model on example data so its weights shift permanently toward the behaviour you want. The distinction matters because it determines cost, speed, and how durable the change is. Prompting is a setting you adjust on every request; fine-tuning is a one-time investment that reshapes the model itself.
The trade-offs side by side
Cost. Prompting has no training cost — you pay only for the tokens you send. Fine-tuning has upfront training and data-preparation cost, but can lower per-call cost afterwards because a fine-tuned model needs shorter prompts.
Speed to iterate. Prompting is instant: edit the text, re-run, see the result. Fine-tuning requires curating a dataset, running a training job, and evaluating — a loop measured in hours or days, not seconds.
Reproducibility and consistency. Fine-tuning generally produces more consistent behaviour for a narrow task because the pattern is baked into the weights. A long prompt can drift if any part is edited, whereas a fine-tuned model applies its learned behaviour automatically.
Skill ceiling. Prompting is bounded by what the base model can already do when well instructed. Fine-tuning can push past that ceiling for a specific task by teaching patterns that no amount of prompting reliably elicits.
Tasks where prompting wins
Prompting is the right first move for the majority of work: drafting, answering questions, summarising, brainstorming, and one-off transformations. It wins whenever requirements change often, when you are still exploring what good output looks like, or when volume is low enough that prompt length does not hurt your budget. Few-shot prompting — adding a handful of examples directly in the prompt — covers a surprising amount of “make it behave like this” needs without any training at all. If a clear, well-structured prompt gets you there, stop: prompting is faster, cheaper, and fully reversible.
Tasks where fine-tuning wins
Fine-tuning earns its cost when you need reliable, repeated behaviour at scale. Strong cases include enforcing a strict output format (always valid JSON, always a fixed schema), a consistent brand voice across millions of messages, a narrow classification task with idiosyncratic labels, or shrinking long, expensive system prompts into the model so each request is cheaper and faster. It also helps when prompting has genuinely plateaued below the quality you need. The decision rule is simple: start with prompting, add retrieval (RAG) when you need fresh or private facts, and fine-tune only once the behaviour is well understood and the volume makes the upfront cost pay off.