Question 1

What is RLHF?

Accepted Answer

RLHF (Reinforcement Learning from Human Feedback) is a training technique that aligns a language model with human preferences. People rank model outputs, those rankings train a reward model, and the LLM is then optimised to produce responses the reward model scores highly.

Question 2

Why is RLHF needed if a model is already pretrained?

Accepted Answer

Pretraining teaches a model to predict text, not to be helpful, honest, or safe. RLHF shapes that raw capability toward what users actually want — following instructions, declining harmful requests, and giving useful answers — which is what made assistants like ChatGPT and Claude feel usable.

Question 3

What is the reward model in RLHF?

Accepted Answer

The reward model is a separate network trained on human preference data to predict how much a human would like a given response. During reinforcement learning it acts as an automated judge, scoring the LLM's outputs so the policy can be optimised without a human rating every sample.

Question 4

What are DPO and RLAIF?

Accepted Answer

DPO (Direct Preference Optimization) skips the explicit reward model and reinforcement loop, optimising the model directly from preference pairs — simpler and more stable. RLAIF (RL from AI Feedback) replaces human labellers with an AI judge to generate preference data at scale, as used in Constitutional AI.

RLHF — Reinforcement Learning From Human Feedback (AI Glossary)

Definition

The three-phase pipeline

The role of the reward model

Alternatives: DPO and RLAIF

Why it matters