Question 1

What is the softmax function?

Accepted Answer

Softmax takes a vector of real-valued scores (logits) and converts it into a probability distribution: every output is between 0 and 1, and they all sum to 1. It does this by exponentiating each score and dividing by the sum of all the exponentials.

Question 2

Where is softmax used in language models?

Accepted Answer

Softmax is applied at the output layer to turn the model's raw logits over the vocabulary into a probability for each possible next token. It is also used inside attention to turn raw attention scores into normalised weights that sum to one.

Question 3

What is temperature in softmax?

Accepted Answer

Temperature divides the logits before the exponential. Low temperature sharpens the distribution toward the highest-scoring option (more deterministic), while high temperature flattens it toward uniform (more random and creative). Temperature of 1 leaves the distribution unchanged.

Question 4

Why exponentiate the scores?

Accepted Answer

Exponentiation makes every value positive (so they can be probabilities) and amplifies differences between scores, so a clearly higher logit gets a much larger share. Dividing by the sum then normalises the results so they add up to exactly one.

Softmax (AI Glossary)

Definition

The formula

Softmax in language models

Temperature scaling

Why softmax matters