Definition
The softmax function converts a vector of raw, unbounded scores — called logits — into a proper probability distribution: each output lies between 0 and 1, and all the outputs sum to exactly 1. It is the standard way neural networks express “how likely is each option” at the end of a classification or generation step. The name reflects that it is a smooth, differentiable approximation to the hard “argmax” of simply picking the single highest score.
The formula
For each element, softmax exponentiates the logit and divides by the sum of all the exponentials: softmax(z)_i = e^(z_i) / Σ_j e^(z_j). Exponentiating does two jobs at once — it guarantees every value is positive (a requirement for probabilities) and it amplifies gaps, so a logit that is clearly larger ends up with a disproportionately bigger share. The division by the sum is the normalisation step that forces the outputs to add up to one. In practice implementations subtract the maximum logit first for numerical stability.
Softmax in language models
LLMs apply softmax at the output layer to turn their logits over the entire vocabulary into a probability for each possible next token. Sampling that distribution (or taking its most likely token) is how the model decides what to generate. Softmax also appears inside attention: the raw attention scores between a query and all the keys are passed through softmax to produce attention weights that sum to one, so each position’s output is a clean weighted average of the values.
Temperature scaling
A temperature parameter modifies softmax by dividing the logits before exponentiation. Low temperature (below 1) sharpens the distribution, pushing probability mass onto the top candidates and making output more deterministic. High temperature (above 1) flattens the distribution toward uniform, making output more random and varied. A temperature of exactly 1 leaves the distribution unchanged, and a temperature near 0 approaches a hard argmax. This single knob is the main lever for trading off reliability against creativity in generation.
Why softmax matters
Softmax is the bridge between a network’s internal scores and an interpretable, sampleable distribution. It pairs naturally with cross-entropy loss during training — the two are designed to work together so gradients are clean — and it underpins both the final token choice and the attention mechanism in every transformer. Understanding softmax (and its temperature) demystifies both how models pick their next word and why turning one dial changes their whole personality.