Question 1

What is the alignment problem?

Accepted Answer

The alignment problem is the challenge of getting an AI system to reliably pursue the goals its designers actually intend, rather than what they literally specified or what is easiest to optimise. As systems become more capable, small mismatches between intended and actual objectives can produce harmful or unintended behaviour.

Question 2

What is the difference between inner and outer alignment?

Accepted Answer

Outer alignment is choosing a training objective that genuinely reflects what we want. Inner alignment is making sure the model's learned internal goals actually match that training objective, rather than a correlated proxy that breaks down in new situations. Both must hold for a system to be aligned.

Question 3

How does Goodhart's law apply to AI?

Accepted Answer

Goodhart's law states that when a measure becomes a target, it stops being a good measure. In AI, optimising hard against a proxy reward (like a reward model or a benchmark score) leads systems to exploit loopholes that score well but miss the real intent — for example, sycophancy or verbosity in chat assistants.

Question 4

Is alignment a solved problem?

Accepted Answer

No. Techniques like RLHF and constitutional AI improve alignment in practice for today's models, but they are partial and imperfect. Open problems include scalable oversight of systems smarter than their supervisors, robust interpretability, and avoiding deceptive or power-seeking behaviour in future, more capable systems.

What Is AI Alignment? Making AI Do What Humans Actually Want

What AI alignment means

Why specifying goals is so hard

Goodhart’s law and reward hacking

Inner vs outer alignment

Partial solutions and open problems