Definition
An activation function is a mathematical transformation applied to the output of a neuron — typically after the weighted sum of its inputs. Its job is to introduce non-linearity into the network. Without it, a deep stack of layers would collapse into a single linear transformation, capable only of modelling straight-line relationships. Activation functions are what give neural networks the power to approximate the complex, curved functions found in real data.
Why non-linearity matters
A neuron without an activation function computes a simple weighted sum. Chaining many such linear operations together still yields a linear operation overall — so no amount of depth would help. By inserting a non-linear function between layers, each layer can bend and reshape the representation, allowing the full network to learn intricate patterns such as the structure of language or images. This is the single mechanism that makes “deep” learning meaningful.
Common activation functions
- ReLU (Rectified Linear Unit) — outputs the input if positive, else zero
(
max(0, x)). Cheap, simple, and effective; the long-time default for deep networks. Its main weakness is “dead” neurons that get stuck outputting zero. - GELU (Gaussian Error Linear Unit) — a smooth, probabilistic variant of ReLU widely used in transformers such as BERT and GPT.
- SiLU / Swish —
x * sigmoid(x), another smooth alternative to ReLU that often improves training and is used in many modern models. - Softmax — converts a vector of raw scores into a probability distribution that sums to one. It is used in attention to produce weights and in the output layer for classification.
How they fit into transformers
In a transformer, softmax turns attention scores into normalised weights, while the position-wise feed-forward network typically uses GELU or SiLU to add non-linear processing capacity between attention layers. The choice of activation is a small but real architectural decision that affects training stability and final quality.
Why it matters
Activation functions are deceptively small components with outsized impact. Choosing a well-behaved activation helps gradients flow during backpropagation, avoids vanishing or exploding signals, and ultimately determines whether a deep network trains at all. They are a foundational building block beneath every modern AI model.