Question 1

What is a reward model in simple terms?

Accepted Answer

A reward model is a separate neural network trained to predict how much a human would like a given model output, returning a single score. During reinforcement learning it stands in for a human rater so the main model can be optimised against millions of outputs without a person reviewing each one.

Question 2

How is a reward model trained?

Accepted Answer

It is trained on pairwise human preference data: people are shown two responses to the same prompt and pick the better one. The reward model learns to assign a higher score to the preferred response. This turns subjective human judgement into a numeric signal the optimiser can use.

Question 3

Why not just have humans score every output during training?

Accepted Answer

Reinforcement learning needs feedback on a huge number of generated samples, far more than humans could ever rate live. The reward model acts as a fast, cheap proxy that approximates human preferences, letting training run at machine speed while humans only label a much smaller comparison dataset.

Question 4

What is reward hacking?

Accepted Answer

Reward hacking happens when the policy finds outputs that score highly under the reward model but are not actually good — for example, being verbose, sycophantic, or confidently wrong. Because the reward model is an imperfect proxy, optimising too hard against it (over-optimisation) can degrade real quality, so training is usually constrained with a KL penalty.

What Is a Reward Model in RLHF?

What a reward model actually is

How a reward model is trained

How the reward model drives optimisation

Why reward models are imperfect — and risky

Where reward models fit in the bigger picture