Where do these benchmark numbers come from?

They are representative figures aggregated from community benchmarks at 25–30 steps with the Euler/DPM++ samplers. Your real numbers vary with sampler, VRAM, drivers, and whether xFormers or torch.compile is enabled.

Why is SDXL so much slower than SD 1.5?

SDXL has a larger UNet and runs at a native 1024px versus SD 1.5's 512px, so each step processes roughly four times the pixels and a bigger model, cutting throughput several-fold.

Does step count change the speed linearly?

Mostly yes — generation time is roughly proportional to step count, since each step is one UNet pass. Halving steps from 30 to 15 nearly halves the time, at some cost to detail.

Will a cloud A100 always beat my local 4090?

Not for single images. A 4090 often matches or beats an A100 on one SD image because of higher clocks; the A100's advantage is large VRAM and batch throughput for many concurrent jobs.

What is the AI Image Inference Speed Benchmarks?

Reference table of inference speed benchmarks (images per minute) for SD 1.5, SDXL, and Flux.1 across common GPUs (RTX 3060, 3090, 4090, A100) and resolutions. Pick a model, GPU, and resolution to estimate throughput. It runs free in your browser on Gera Tools, with nothing uploaded.

AI Image Inference Speed Benchmarks

Name: AI Image Inference Speed Benchmarks
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

AI image inference speed benchmarks

Generation speed for Stable Diffusion and Flux.1 depends on three things: the model (UNet size and native resolution), the GPU, and the output resolution. This reference estimates images per minute so you can size hardware, compare cloud GPUs, or decide between SD 1.5 and SDXL for a batch job.

Why throughput varies so dramatically between setups

A newcomer running SD 1.5 on a consumer GPU might see 5 images per minute; someone running the same model on a data-centre A100 with batching might hit 100+. The gap is not mainly about raw FLOPS — it is about VRAM capacity, memory bandwidth, and whether the whole model fits in GPU memory without swapping to system RAM.

The three practical bottlenecks:

VRAM ceiling. If the model and its intermediate activations don’t fit in VRAM, the GPU has to page data back and forth to system RAM over the PCIe bus. PCIe bandwidth is roughly 10x slower than VRAM bandwidth, which can drop throughput by 5–10x on large models like SDXL running on cards with less than 8 GB.

Memory bandwidth, not compute. Diffusion models are memory-bandwidth-bound, not compute-bound, especially at lower resolutions. A GPU with high TFLOPS but narrow memory bandwidth (some mid-range cards) underperforms its theoretical ceiling.

Sampler and step count. Each step is one UNet forward pass. Cutting from 30 to 15 steps with a distilled sampler (like LCM or DPM++ 2M Karras) nearly halves generation time at minimal quality cost.

How it works

Each diffusion step is one forward pass through the UNet, so total time is roughly:

time_per_image ≈ steps × time_per_step(model, gpu, resolution)
images_per_min ≈ 60 / time_per_image

The benchmark table is calibrated at a baseline step count (25–30 steps). When you change the step slider, the estimate scales linearly because the number of UNet passes scales linearly. Resolution scales roughly with pixel count — moving SDXL from 1024px to 1536px more than doubles the per-image cost.

Model comparison at a glance

Model	Native resolution	Relative speed	Min VRAM (fp16)
SD 1.5	512×512	Fastest	~4 GB
SDXL	1024×1024	~3–5x slower than SD 1.5	~8 GB
Flux.1 (dev)	1024×1024	Slowest of the three	~12–16 GB

SD 1.5 processes roughly 4× fewer pixels per step than SDXL, which accounts for most of the speed difference. Flux.1 uses a different transformer-based architecture (DiT) with no UNet, and at current optimisation levels it is slower step-for-step than SDXL but produces competitive quality with fewer steps.

Notes and caveats

Enable xFormers or SDPA. Both cut memory and add 20–40% speed on most GPUs with minimal code changes.
Batch for throughput, not latency. A100/H100 cards shine when generating many images at once; for one image at a time, a high-clock RTX 4090 is often faster than a server GPU.
VRAM gates the model, not the speed. A 3060 (12 GB) can run SDXL but offloads to system RAM if weights and activations don’t fit, which is far slower than a 24 GB card holding everything resident.
Samplers matter. DPM++ 2M Karras can reach good quality in around 20 steps; older Euler-a needs 30+, directly changing images-per-minute. LCM and Turbo distillations push this further, hitting plausible results in 4–8 steps.
These are representative figures aggregated from community benchmarks at 25–30 steps. Your real numbers vary with driver version, torch.compile status, operating system overhead, and thermal throttling on consumer cards in sustained batch jobs.