How does Replicate bill?

Replicate charges per second of active compute on the hardware a model runs on, including cold-boot time when an instance spins up. The per-second rate depends on the GPU tier — a T4 is cheap, an H100 is expensive — so total cost is rate multiplied by run time.

Why are the run times only estimates?

Actual run time depends on input size, the number of steps or output frames, and whether the model is warm or cold-starting. The presets give typical figures; measure a few real predictions and enter your average for an accurate projection.

Do the presets reflect real model defaults?

The presets pair each model with the hardware tier it commonly runs on and a representative run time. Many Replicate models let you pick hardware, so confirm the tier on the specific model page you intend to use.

Is anything sent to Replicate by this tool?

No. The calculator runs entirely in your browser using published per-second rates. It makes no API calls and stores nothing.

Replicate Model Cost Estimator

Replicate model cost estimator

Replicate bills by the second of compute, not per image or per token, so two runs of the same model can cost very differently depending on the GPU tier and how long the prediction takes. That makes budgeting tricky — a Flux generation on an H100 and an SDXL run on an A40 land at different prices even when both produce one image. This estimator multiplies the per-second hardware rate by your run time and monthly volume to give a concrete cost.

How it works

Choose a model preset and the tool loads a typical hardware tier and run time; tweak either and the preset flips to Custom. It multiplies the tier’s per-second rate by your run-time seconds to get cost per run, then by your monthly volume for a total. The hardware dropdown shows each tier’s per-second price so you can see exactly where the money goes — moving from an H100 to an A40, for instance, roughly halves the rate.

Tips for accurate estimates

Measure real run times. Cold starts add seconds; run a handful of live predictions and use the average rather than the preset.
Right-size the GPU. Smaller models often run fine on a T4 or L40S at a fraction of an H100’s rate — don’t pay for hardware the model can’t saturate.
Account for failures and retries. Failed runs still consume compute; pad your monthly volume slightly if your inputs are unreliable.
Compare to dedicated APIs. For very high volume, a fixed-price endpoint (Stability, OpenAI) can beat per-second pricing — estimate both before committing.