Question 1

What is Flash Attention?

Accepted Answer

Flash Attention is an IO-aware algorithm for computing the standard attention operation in transformers. It produces exactly the same result as ordinary attention but reorganises the computation to minimise slow reads and writes to GPU high-bandwidth memory. The payoff is 2 to 4 times faster attention and far lower memory use.

Question 2

Is Flash Attention an approximation?

Accepted Answer

No. Unlike sparse or low-rank attention methods that trade accuracy for speed, Flash Attention computes exact, mathematically identical attention outputs. Its gains come purely from a smarter memory-access pattern, so you can adopt it without worrying about changing model quality.

Question 3

Why is standard attention slow if it is not compute-bound?

Accepted Answer

Standard attention writes the full attention matrix to GPU high-bandwidth memory and reads it back, and that memory traffic, not the arithmetic, is the bottleneck. The operation is memory-bandwidth-bound. Flash Attention avoids materialising the large matrix by processing it in tiles inside fast on-chip memory, eliminating most of the slow traffic.

Question 4

What is the difference between Flash Attention 1 and 2?

Accepted Answer

Flash Attention 2 keeps the same exact-attention idea but improves the work partitioning across GPU threads and warps, reduces non-matmul operations, and parallelises better over the sequence dimension. The result is roughly a further 2x speedup over the original, getting closer to peak GPU throughput.

What Is Flash Attention? Faster Transformers Without Approximation

The problem it solves

The key idea: be IO-aware

Why it is exact, not approximate

The payoff

Versions and where it fits