Question 1

What is gradient descent?

Accepted Answer

Gradient descent is the optimisation algorithm that trains most neural networks. It computes the gradient — the slope — of the loss function with respect to each parameter, then nudges every parameter a small step in the direction that reduces the loss. Repeated millions of times, this walks the model down towards a minimum.

Question 2

What is the difference between SGD and Adam?

Accepted Answer

Plain SGD updates parameters using the raw gradient times a fixed learning rate. Adam adapts the step size per parameter using running averages of past gradients and their squares, which makes it converge faster and tune more easily. AdamW is Adam with corrected weight decay and is the default for most modern models.

Question 3

What is stochastic gradient descent?

Accepted Answer

Stochastic gradient descent estimates the gradient from a small random batch of data rather than the whole dataset. This makes each step far cheaper and the added noise can actually help the model escape poor minima, at the cost of noisier updates.

Question 4

What is a learning rate schedule?

Accepted Answer

A learning rate schedule changes the step size over the course of training — for example a short warmup that ramps the rate up, followed by cosine decay that gradually shrinks it. Schedules help the model make fast early progress, then settle precisely into a good minimum.

Gradient Descent (AI Glossary)

What is gradient descent?

Stochastic gradient descent

Adam and AdamW

Learning rate scheduling