Fine-Tuning Dataset Generation Cost Estimator

Estimate the cost of generating a synthetic fine-tuning dataset with a teacher model

Ad placeholder (leaderboard)

Fine-tuning dataset generation cost estimator

Distillation — using a strong teacher model like GPT-4o to generate training examples for a smaller student model — is one of the cheapest ways to build a fine-tuning dataset. But “cheap” still has a number. This estimator turns your target dataset size and per-example token usage into a total generation cost and a clean per-example unit cost.

How it works

Each generated example costs the teacher model’s input price for the prompt and output price for the completion. An optional validation pass (scoring or filtering each example) adds a second round of tokens:

gen_cost/example  = prompt_tokens/1e6 × in_price + completion_tokens/1e6 × out_price
val_cost/example  = val_prompt_tokens/1e6 × in_price + val_completion_tokens/1e6 × out_price
total             = (gen_cost + val_cost) × num_examples

The estimator reports both the all-in total and the cost per surviving example so you can compare against the price of human labelling.

Tips and notes

  • Generate in batches and validate aggressively — discarding low-quality examples early is cheaper than fine-tuning on noise and re-running.
  • A cheaper teacher (GPT-4o mini, Haiku) can produce most examples, reserving the expensive model only for the hard cases or the validation judge.
  • Remember this is generation cost only; budget separately for the provider’s per-token training fee, which is a distinct line item.
Ad placeholder (rectangle)