What Is Sparse Attention? Scaling Transformers to Longer Sequences

When full quadratic attention is too slow, sparse patterns cut the cost

Ad placeholder (leaderboard)

The quadratic problem

Standard transformer attention is dense: every token attends to every other token. For a sequence of length n, that means roughly pairwise comparisons and an n × n attention matrix to store. The consequence is that doubling the sequence length quadruples the work and memory. This quadratic scaling is the central obstacle to long-context transformers — processing a document of tens of thousands of tokens with dense attention is prohibitively expensive in both compute and memory. Sparse attention is the family of techniques designed to break this scaling.

The core idea

Sparse attention rests on a simple observation: most tokens do not need to attend to all other tokens. Nearby words usually matter most, and only a few distant tokens carry long-range signal. So instead of computing the full attention matrix, sparse methods let each token attend to a carefully chosen subset of positions. By zeroing out most of the matrix, the cost drops from quadratic toward linear or near-linear in sequence length, which is what makes very long inputs practical.

Common sparse patterns

Several patterns are widely used, often in combination:

  • Sliding-window (local) attention: each token attends only to a fixed window of nearby tokens. This is cheap and captures local context well, and stacking layers lets information propagate gradually across the whole sequence. Mistral uses sliding-window attention.
  • Global attention: a handful of designated tokens (for example a classification or summary token) attend to everything and are attended by everything, providing a route for long-range information to flow.
  • Dilated or strided patterns: tokens attend to positions at regular gaps, covering a wide span without attending to every position in between.
  • Random attention: each token attends to a few randomly chosen tokens, which helps approximate the connectivity of full attention.

Two influential models

Longformer combines sliding-window local attention with global attention on a few task-specific tokens, giving it linear scaling while preserving both local detail and document-level signal — well suited to long documents and QA. BigBird blends three ingredients — local windows, a set of global tokens, and random connections — and showed theoretically that this mixture can match the expressive power of full attention while scaling linearly. Both demonstrated that thoughtfully designed sparsity, rather than a single fixed rule, is the path to handling long sequences.

Trade-offs and how it relates to Flash Attention

The key trade-off is exactness. Sparse attention deliberately drops some token interactions, so it is an approximation of full attention; if an important long-range dependency falls outside the chosen pattern, the model can miss it. Good patterns minimise this risk, but it is a real difference from exact methods. This is precisely where Flash Attention complements rather than competes with sparse attention: Flash Attention makes dense attention as fast and memory-efficient as possible without any approximation, while sparse attention reduces how much attention you compute at all. Modern long-context systems often use both — an efficient exact kernel for the attention they do compute, and a sparse pattern to keep the amount of computation manageable.

Ad placeholder (rectangle)