What RLHF is and why it exists
RLHF — Reinforcement Learning from Human Feedback — is the alignment technique that turned raw language models into the helpful assistants people use today. A model that has only been pre-trained is extremely knowledgeable but not well-behaved: it predicts likely text rather than following instructions, and it has no built-in sense of what is helpful or harmful. RLHF closes that gap by incorporating human judgement directly into training. People rate and rank the model’s outputs, and those preferences are used to steer the model toward responses humans actually want. It is the reason ChatGPT, Claude, and similar systems answer your question instead of, say, continuing it as if it were a passage in a document.
Stage one: supervised fine-tuning
The pipeline begins with supervised fine-tuning (SFT). Human contributors write or curate high-quality examples of the desired behaviour — prompts paired with ideal responses — and the pre-trained model is fine-tuned on this dataset. SFT teaches the model the basic format of being an assistant: respond directly, follow the instruction, adopt a helpful tone. This stage alone produces a noticeably better model, but it is limited by how many examples humans can realistically write and by the fact that there is often more than one good answer to a prompt.
Stage two: training a reward model
To scale beyond hand-written examples, RLHF introduces a reward model. Humans are shown several different responses to the same prompt and asked to rank them from best to worst. Ranking is far easier and more consistent than writing perfect answers from scratch. These comparisons are used to train a separate model that takes any prompt-response pair and outputs a scalar score predicting how much a human would prefer it. The reward model becomes an automated stand-in for human judgement, able to score the vast number of responses that reinforcement learning will generate.
Stage three: PPO policy optimisation
In the final stage, the language model — now the policy — is improved with reinforcement learning, most commonly Proximal Policy Optimisation (PPO). The policy generates responses, the reward model scores them, and the policy is updated to produce higher-scoring outputs. A critical safeguard is a KL-divergence penalty that keeps the policy from straying too far from the fine-tuned model; without it, the policy can “hack” the reward model by producing strange text that scores well but reads poorly. The result of this loop is a model that is measurably more helpful, more honest, and less likely to produce harmful content.
Limitations and alternatives
RLHF is powerful but imperfect. It is operationally complex — three models and a delicate RL loop — and it can inherit the biases of the human raters or over-optimise for what sounds good rather than what is true. These costs have driven simpler alternatives. Direct Preference Optimisation (DPO) reuses the same human preference data but skips the separate reward model and RL step, making training more stable. Constitutional AI replaces much of the human feedback with a written set of principles and AI-generated critiques. All of these share RLHF’s central goal: taking a knowledgeable but unaligned pre-trained model and shaping it into something genuinely useful and safe.