RLHF — Reinforcement Learning From Human Feedback (AI Glossary)

The training technique that turns a capable LLM into a helpful AI assistant

Ad placeholder (leaderboard)

Definition

RLHF (Reinforcement Learning from Human Feedback) is the training technique that turns a raw, capable language model into a helpful, well-behaved AI assistant. A pretrained LLM is excellent at predicting plausible text, but prediction alone does not make it follow instructions, stay safe, or give answers people find useful. RLHF aligns the model with human preferences by collecting human judgements about which responses are better and then optimising the model to produce more of what humans prefer. It is widely credited as the breakthrough that made ChatGPT, Claude, and their peers feel genuinely useful.

The three-phase pipeline

Classic RLHF proceeds in three stages:

  1. Supervised fine-tuning (SFT) — the pretrained model is fine-tuned on high-quality example conversations written or curated by humans, teaching it the basic shape of helpful, instruction-following responses.
  2. Reward model training — humans are shown pairs (or rankings) of model outputs and asked which is better. A separate reward model is trained on these comparisons to predict a scalar “human preference” score for any response.
  3. Reinforcement learning — the LLM (the “policy”) generates responses, the reward model scores them, and an RL algorithm — most commonly PPO (Proximal Policy Optimization) — nudges the model toward higher-scoring outputs while a penalty keeps it from drifting too far from the SFT model.

The role of the reward model

The reward model is the heart of RLHF. Because it is impractical for humans to rate every one of the millions of samples generated during reinforcement learning, the reward model acts as a fast, automated stand-in for human judgement. Its quality sets a ceiling on the whole process: if it misjudges what humans want, the policy will optimise for the wrong thing — a failure mode known as reward hacking, where the model games the score without actually being better.

Alternatives: DPO and RLAIF

RLHF’s reinforcement-learning loop is complex and unstable, so simpler alternatives have emerged. DPO (Direct Preference Optimization) achieves similar alignment by optimising directly on preference pairs with a single loss function, skipping the explicit reward model and RL loop entirely. RLAIF (RL from AI Feedback) replaces human labellers with an AI model that judges responses against a set of principles, dramatically scaling up feedback collection — this is the basis of Anthropic’s Constitutional AI.

Why it matters

RLHF is why modern AI assistants are helpful rather than merely fluent. It is the primary lever labs use to make models follow instructions, refuse harmful requests, and match a desired tone. Understanding it clarifies both the strengths of today’s assistants and their limits — including sycophancy, over-refusal, and reward hacking, all of which trace back to how preference data is collected and optimised.

Ad placeholder (rectangle)