Question 1

What is Direct Preference Optimization (DPO)?

Accepted Answer

DPO is a method for aligning a language model to human preferences using pairs of responses where one is preferred over the other. Instead of training a separate reward model and then doing reinforcement learning, DPO derives a single classification-style loss that adjusts the model directly. The model itself implicitly becomes the reward model.

Question 2

How is DPO different from RLHF with PPO?

Accepted Answer

Classic RLHF has three stages: instruction tuning, training a reward model on preference data, then optimising the policy with reinforcement learning (usually PPO). DPO collapses the last two stages into one supervised-style step. There is no separate reward model and no sampling loop, which removes the main sources of RLHF instability and tuning difficulty.

Question 3

What does the DPO loss actually do?

Accepted Answer

The DPO loss increases the model's relative log-probability of the chosen response over the rejected response, while a reference model (usually the frozen instruction-tuned checkpoint) anchors the update so the model does not drift too far. A beta hyperparameter controls how strongly preferences are enforced versus staying close to the reference.

Question 4

When should I use DPO instead of full RLHF?

Accepted Answer

DPO is a strong default when you have paired preference data and want a simpler, cheaper, more stable training run. Full RLHF with PPO can still help when you need online exploration, reward shaping beyond pairwise preferences, or very large-scale alignment. For most teams, DPO reaches comparable quality with far less engineering.

What Is DPO? Direct Preference Optimization Explained

The core idea

How DPO compares to RLHF

What the loss does

Data and practical workflow

Strengths, limits, and variants