Question 1

What is an activation function?

Accepted Answer

An activation function is a mathematical transformation applied to the output of a neuron in a neural network. It introduces non-linearity, which lets the network learn complex, non-straight-line relationships rather than just weighted sums of its inputs.

Question 2

Why do neural networks need non-linearity?

Accepted Answer

Without a non-linear activation, stacking layers is equivalent to a single linear layer no matter how deep the network is, so it could only model straight-line relationships. Non-linearity is what gives deep networks the power to approximate complex functions.

Question 3

What is ReLU and why is it popular?

Accepted Answer

ReLU (Rectified Linear Unit) outputs the input if it is positive and zero otherwise. It is popular because it is extremely cheap to compute, avoids the vanishing-gradient problem for positive inputs, and trains deep networks quickly and reliably.

Question 4

Which activation functions do transformers use?

Accepted Answer

Modern transformers commonly use GELU or SiLU (also called Swish) in their feed-forward layers because they are smooth variants of ReLU that tend to improve training. Softmax is used in the attention mechanism and output layer to turn scores into probabilities.

Activation Function (AI Glossary)

Definition

Why non-linearity matters

Common activation functions

How they fit into transformers

Why it matters