Dropout (AI Glossary)

Randomly disabling neurons during training to prevent overfitting.

Ad placeholder (leaderboard)

What is dropout?

Dropout is a regularisation technique for neural networks that randomly “drops” — sets to zero — a fraction of neuron activations during each training step. The fraction dropped is controlled by a single hyperparameter, the dropout probability (often written p). With p = 0.5, roughly half of the units in a layer are ignored on any given forward and backward pass, and a different random subset is dropped each step.

By preventing the network from leaning too heavily on any individual neuron, dropout encourages redundant, distributed representations and reduces overfitting — the tendency of a model to memorise training data rather than learn patterns that generalise.

Why dropout works

The most influential way to think about dropout is as implicit ensembling. Each training step trains a slightly different “thinned” network, because a different random subset of neurons is active. Over thousands of steps the procedure trains an enormous number of overlapping subnetworks that share weights. At test time the full network is used, which approximates averaging the predictions of all those subnetworks — and ensembles are well known to generalise better than any single model.

A second intuition is that dropout adds noise to the activations. To keep performing well despite this noise, the network must avoid brittle “co-adapted” features where neurons only work in tight combination. It must instead learn features that are individually useful.

The dropout probability and inverted dropout

The dropout probability is a hyperparameter you choose before training:

  • Low values (0.1–0.2) apply mild regularisation, suitable near input layers or for smaller models.
  • Higher values (0.3–0.5) apply stronger regularisation, useful for large fully-connected layers prone to overfitting.

Modern frameworks use inverted dropout: during training the surviving activations are scaled up by 1 / (1 − p). This keeps the expected sum of activations constant, so no rescaling is needed at inference — you simply turn dropout off and the numbers already match.

When to use dropout

Dropout is most effective in large, fully-connected layers and in settings with limited training data, where overfitting is a real risk. In convolutional networks it is used more sparingly, and in modern transformer architectures it is typically applied at modest rates alongside other regularisers such as weight decay and layer normalisation.

If a model underfits — high error on both training and validation data — reduce or remove dropout. If it overfits — low training error but high validation error — increasing the dropout rate is one of the first remedies to try.

Ad placeholder (rectangle)