What is a hyperparameter?
A hyperparameter is a configuration value you set before training begins and that the learning algorithm cannot adjust by itself. Hyperparameters control how a model learns — the size of each optimisation step, how many examples are processed at once, how deep the network is — rather than what it learns. They are choices made by the practitioner, not discovered from the data.
Picking good hyperparameters is one of the most important and time-consuming parts of building a machine-learning model, because the same architecture can train beautifully or fail completely depending on these settings.
Hyperparameters vs parameters
The distinction is easy to confuse but fundamental:
- Parameters are the weights and biases inside the model. There can be billions of them, and the training algorithm (gradient descent) learns their values automatically from data. You never set them by hand.
- Hyperparameters are the settings around training. There are usually a handful, and you choose them. The model does not optimise them as part of normal training.
In short: you tune hyperparameters; the model fits parameters.
Common hyperparameters
The most influential hyperparameters in deep learning include:
- Learning rate — how big a step the optimiser takes; usually the single most important setting.
- Batch size — how many examples are processed before each weight update.
- Number of epochs — how many full passes are made over the training data.
- Architecture choices — number of layers, hidden dimension, number of attention heads.
- Regularisation — dropout rate, weight decay.
- Optimiser choice — for example SGD, Adam, or AdamW.
At inference time, generation settings such as temperature and top-p are sometimes informally called hyperparameters, since they too are configured rather than learned.
Tuning strategies
Because hyperparameters interact, finding a good combination usually means searching. The main approaches are:
- Grid search — define a grid of values for each hyperparameter and try every combination. Thorough but expensive, and it scales badly as the number of hyperparameters grows.
- Random search — sample combinations at random within sensible ranges. It is often more efficient than grid search because it explores more distinct values of the hyperparameters that actually matter.
- Bayesian optimisation — build a model of how settings affect performance and use it to propose the most promising next combination, focusing effort where it is likely to pay off.
Every strategy evaluates candidates on a held-out validation set rather than the training data, so the chosen settings reflect genuine generalisation rather than memorisation.