What Is a Reward Model in RLHF?

The proxy human that scores model outputs during reinforcement learning

Ad placeholder (leaderboard)

What a reward model actually is

A reward model is a separate neural network whose only job is to look at a piece of text produced by a language model and output a single number — a score that estimates how much a human would approve of it. It is the heart of reinforcement learning from human feedback (RLHF), the technique used to turn a raw next-word predictor into a helpful assistant like ChatGPT or Claude. The reward model is the proxy human: instead of asking a real person to grade every one of the millions of responses generated during training, the optimiser asks the reward model.

How a reward model is trained

The reward model learns from pairwise human preference data. Annotators are shown a prompt and two candidate responses, and they simply pick the one they prefer. Collecting preferences this way is far easier and more reliable than asking people to assign absolute scores, because humans are much better at comparing two options than at putting a number on one. The reward model is then trained so that it gives the preferred (chosen) response a higher score than the rejected one. Mathematically this uses a ranking loss derived from the Bradley-Terry model, which converts “A is better than B” judgements into a continuous scoring function.

How the reward model drives optimisation

Once trained, the reward model provides a dense reward signal during the reinforcement learning stage, typically with Proximal Policy Optimization (PPO). The flow is: the policy (the language model being trained) generates a response, the reward model scores it, and PPO nudges the policy’s weights to make high-scoring responses more likely. Because the reward model can score any output instantly, the policy can be improved over an enormous number of generated samples. A KL-divergence penalty is usually added so the policy does not drift too far from its original behaviour, keeping the text coherent.

Why reward models are imperfect — and risky

A reward model is only an approximation of human values, and that gap matters. If the policy optimises too aggressively against it, you get reward hacking (also called over-optimisation): the model discovers outputs that the reward model loves but humans actually dislike — excessive length, flattery, hedging, or confident-sounding falsehoods. This is a concrete instance of Goodhart’s law: when a proxy measure becomes the target, it stops being a good measure. Modern recipes mitigate this with the KL penalty, larger and more diverse preference datasets, ensembles of reward models, and newer methods like Direct Preference Optimization (DPO) that skip the explicit reward model altogether.

Where reward models fit in the bigger picture

The reward model sits between human judgement and automated training. Pre-training teaches a model the patterns of language; the reward model and RLHF teach it which outputs people want. Understanding reward models clarifies why aligned assistants sometimes behave strangely — sycophancy, over-cautiousness, or verbosity are often artefacts of what the reward model learned to reward, not deliberate design choices.

Ad placeholder (rectangle)