How accurate is this estimate?

It is a planning heuristic, not a guarantee. Real requirements depend on data quality, task difficulty, and label noise. Use the number to size your effort, then validate with a small pilot run and a held-out eval set.

Why does task type matter so much?

Teaching a model a fixed output format or a consistent tone needs far fewer examples than teaching it new domain knowledge or fine-grained classification. The estimator weights each task type accordingly.

What does the baseline slider change?

If the base model is already decent at your task, you mainly need examples to nudge its behaviour, so the count drops. If it is poor, you need more examples to move the needle, so the count rises.

Should I collect the recommended amount before starting?

No. Start near the minimum, measure on a held-out set, and add data only if the eval shows you have not reached your target. Over-collecting up front wastes labelling budget.

Is more data always better?

Not if it is noisy or redundant. A few hundred clean, diverse, correctly labelled examples usually beat thousands of low-quality ones. Prioritise quality and coverage over raw volume.

What is the Fine-Tuning Dataset Size Estimator?

Estimate a minimum and recommended number of fine-tuning examples based on task type, how strong the baseline model already is, and your target accuracy, with practical data-collection guidance for each scenario. It runs free in your browser on Gera Tools, with nothing uploaded.

Fine-Tuning Dataset Size Estimator

Name: Fine-Tuning Dataset Size Estimator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Fine-tuning dataset size estimator

“How many examples do I need?” is the first question of every fine-tuning project, and the honest answer is “it depends.” This tool turns that into a concrete starting number by weighing the three factors that matter most: what kind of task it is, how good the base model already is, and how high you need accuracy to go.

How it works

Each task type carries a base example count reflecting its difficulty. Teaching a fixed output format or adapting a writing style needs far fewer examples than teaching new classification boundaries, domain-specific reasoning, or novel factual knowledge. The estimator scales that base by two multipliers:

Baseline capability — a strong base model that already handles the task reasonably well needs fewer examples to nudge toward your target. A weak baseline requires more examples to move the needle at all.
Target accuracy — the last few percentage points of accuracy are disproportionately expensive in data terms. Moving from 70% to 80% costs far less data than moving from 92% to 97%.

The output is a minimum (safe starting point for a pilot run) and a recommended ceiling (what to plan toward if the minimum falls short).

Task-type guidance

Different task families have genuinely different data requirements:

Task type	Why data needs differ
Format / style adaptation	The model already understands the content; you are just changing the surface pattern
Binary or small-set classification	Clear boundaries; consistent labelling is the main challenge
Multi-class classification	More classes multiply the examples needed per boundary
Structured extraction (JSON, tables)	Needs many input formats represented to generalise
Domain knowledge injection	Hardest — the base model lacks the underlying facts
Instruction following	Moderate; usually a few hundred diverse examples work

Worked example

Suppose you want a model to extract JSON-structured entities from legal contracts, and the base model is decent at general extraction but has never seen legal language. A good target accuracy is 90%. The estimator would return something like: minimum 400 examples, recommended 800–1,200. You would collect 400 varied contracts, label them carefully, fine-tune, evaluate on a 100-example held-out set, and add more data only if accuracy lags.

Practical guidance

Start at the minimum, not the maximum. Collect the smaller number, run a pilot fine-tune, and measure on a held-out set before labelling more. Over-collecting wastes labelling budget.
Quality beats quantity. A few hundred clean, diverse, correctly labelled examples usually outperform thousands of noisy ones. Deduplicate inputs and check that labels are consistent.
Match the eval to the goal. Your held-out evaluation set should look exactly like real production inputs, not the training distribution. A misleading eval produces a misleading accuracy number.
Reserve a test split before you start labelling. Setting aside test data after seeing what the model struggles with can inflate reported accuracy.
Revisit the baseline slider as the model improves. After a successful pilot run, the base model is now your new baseline. Update the estimate before planning the next data collection wave.