What Is Beam Search? Better Text Generation Through Exploration

Keeping the K best partial sequences at each step to avoid greedy mistakes

Ad placeholder (leaderboard)

Definition

Beam search is a decoding strategy for sequence generation models that, at every step, keeps the K most probable partial sequences rather than committing to only the single best next token. K is called the beam width. By carrying several promising candidates forward in parallel, beam search can back away from a token that looked best in isolation but leads to a poor overall sentence — a mistake that simpler greedy decoding cannot undo. It is a staple of machine translation and summarisation systems.

Why greedy decoding falls short

Greedy decoding picks the single highest-probability token at each step and never reconsiders. The problem is that the locally optimal token is not always part of the globally optimal sequence: a slightly less likely first word might open the door to a much more probable continuation. Once greedy decoding commits, it has no way to recover. Beam search hedges against this by exploring multiple paths, so a strong continuation can rescue a weaker opening.

How beam search works

At each generation step, beam search expands every surviving beam by all possible next tokens, computes the cumulative log-probability of each resulting sequence, and then prunes back to only the top-K highest-scoring sequences. This repeats until beams hit an end-of-sequence token or a maximum length. The final answer is the complete beam with the best score. Larger K explores more thoroughly but multiplies the compute and memory required at every step.

The length-penalty problem

Each appended token contributes a probability below 1.0, so multiplying them together means longer sequences always score lower. Left uncorrected, beam search systematically favours short, truncated outputs. The fix is a length penalty: the cumulative score is divided by the sequence length raised to a tunable exponent (often around 0.6–1.0), normalising across lengths so a complete, well-formed answer competes fairly against a terse one.

When beam search wins — and when it loses

Beam search shines on tasks with a relatively constrained, high-fidelity target: translation, summarisation, speech recognition, and grammatical error correction, where you want the most probable faithful output. It is a poor fit for open-ended generation. With wide beams, modern LLMs tend to produce dull, repetitive, and oddly generic text because the highest-probability sequence is often the most predictable one. For chat, storytelling, and brainstorming, stochastic methods such as top-p (nucleus) sampling and temperature sampling produce far more natural, varied results. Choosing a decoding strategy is therefore task-driven: beam search for fidelity, sampling for creativity.

Ad placeholder (rectangle)