What Is DPO? Direct Preference Optimization Explained

Fine-tuning on human preferences without a reward model — simpler than RLHF

Ad placeholder (leaderboard)

The core idea

Direct Preference Optimization (DPO) is a way to teach a language model what people prefer without the moving parts of classic reinforcement learning. You collect pairs of responses to the same prompt — a chosen answer humans liked and a rejected answer they liked less — and DPO adjusts the model so it becomes more likely to produce chosen-style answers and less likely to produce rejected ones. The key insight from the 2023 DPO paper is that the reward model used in RLHF can be expressed analytically in terms of the policy itself, so you can skip training a separate reward model entirely and optimise the language model directly with a simple loss.

How DPO compares to RLHF

Traditional RLHF (reinforcement learning from human feedback) runs in three phases: supervised instruction tuning, then training a reward model on preference data, then optimising the model with an RL algorithm like PPO that samples outputs, scores them, and nudges the policy. That pipeline is powerful but fiddly — PPO needs careful reward shaping, a value network, KL penalties, and lots of tuning, and it can be unstable. DPO removes the reward-model and RL stages. It treats alignment as a single supervised objective over preference pairs, which makes runs faster, cheaper, and far easier to reproduce.

What the loss does

DPO compares two models on each preference pair: the model being trained (the policy) and a frozen reference model, usually the instruction-tuned checkpoint you started from. The loss raises the policy’s log-probability of the chosen response relative to the rejected response, measured against the reference. A temperature-like hyperparameter called beta controls the trade-off: a higher beta enforces preferences more aggressively but risks drifting away from the reference and degrading general capability; a lower beta keeps the model conservative. Because the reference model anchors the update, DPO does not need a separate KL-penalty term the way PPO does.

Data and practical workflow

The input to DPO is a dataset of (prompt, chosen, rejected) triples. These can come from human labellers ranking two model outputs, or from automated sources such as a stronger model judging a weaker one. A typical workflow is: start from a base model, run instruction tuning to produce an assistant-style checkpoint, freeze a copy of that checkpoint as the reference, then run DPO on your preference pairs. Training looks like ordinary fine-tuning — forward and backward passes over batches — with no sampling loop, so it slots cleanly into standard trainers and works well with parameter-efficient methods.

Strengths, limits, and variants

DPO’s strengths are simplicity and stability: fewer hyperparameters, no reward model to maintain, and reproducible results. Its main limitation is that it learns only from the pairwise preferences you give it — it cannot explore new behaviours the way online RL can, and the quality of alignment is capped by the quality of your preference data. Several variants have appeared to address edge cases, including IPO (which changes the objective to reduce overfitting to noisy labels), KTO (which works from unpaired good/bad labels), and ORPO (which folds preference optimisation into instruction tuning in a single step). For most teams aligning an open model today, DPO or one of these relatives is the pragmatic choice over full PPO-based RLHF.

Ad placeholder (rectangle)