Question 1

What does RLHF actually do to a model?

Accepted Answer

RLHF takes a model that can already produce fluent text and teaches it to produce text that humans prefer — more helpful, honest, and harmless. It does not add new knowledge; it shapes behaviour and tone. This is why a base model and its RLHF-tuned version can know the same facts but behave very differently.

Question 2

Why use a reward model instead of having humans score every output?

Accepted Answer

Humans cannot label the millions of outputs needed during reinforcement learning. Instead, humans rank a smaller set of responses, and a reward model is trained to predict those preferences. The reward model then provides a cheap, automatic preference score for every output the policy generates during training.

Question 3

What is PPO and why is it used in RLHF?

Accepted Answer

PPO (Proximal Policy Optimization) is the reinforcement learning algorithm that updates the model's weights to score higher on the reward model, while a penalty keeps it from drifting too far from the original model. The penalty prevents the model from finding degenerate outputs that game the reward without staying coherent.

Question 4

Is RLHF still the standard, or has it been replaced?

Accepted Answer

RLHF is still widely used, but simpler alternatives like DPO (Direct Preference Optimization) train directly on preference data without a separate reward model or RL loop. Many labs now use DPO or hybrid approaches because they are easier and more stable, but the underlying idea — learn from human preferences — is the same.

RLHF Explained: How AI Models Learn from Human Feedback

The starting point: a pretrained model

Step one: collect human preferences

Step two: train a reward model

Step three: optimise with reinforcement learning

Limits and what comes next