Question 1

What is sparse attention?

Accepted Answer

Sparse attention is a family of techniques that let each token attend to only a subset of other tokens instead of all of them. By skipping most token pairs, it reduces attention's cost from growing with the square of the sequence length toward something closer to linear, making much longer sequences feasible.

Question 2

Why is full attention expensive for long sequences?

Accepted Answer

Standard dense attention compares every token with every other token, so for a sequence of length n it does on the order of n-squared comparisons and stores an n-by-n matrix. Doubling the sequence length quadruples the cost. This quadratic scaling is what makes very long contexts impractical with dense attention alone.

Question 3

What is the difference between sliding-window and global attention?

Accepted Answer

Sliding-window (local) attention lets each token attend only to a fixed number of nearby tokens, which captures local context cheaply. Global attention designates a few special tokens that attend to, and are attended by, everything. Models like Longformer combine both so local detail and long-range signals are both preserved.

Question 4

Does sparse attention reduce accuracy?

Accepted Answer

It can, because it deliberately drops some token interactions, so it is an approximation of full attention rather than an exact computation. Well-designed patterns like those in Longformer and BigBird are built to retain the connections that matter most, so quality loss is often small, but it is a genuine trade-off unlike exact methods such as Flash Attention.

What Is Sparse Attention? Scaling Transformers to Longer Sequences

The quadratic problem

The core idea

Common sparse patterns

Two influential models

Trade-offs and how it relates to Flash Attention