Question 1

What is AI interpretability?

Accepted Answer

AI interpretability is the effort to understand what is happening inside a model — why it produces a particular output and what its internal components are doing. Because modern neural networks are huge and opaque, interpretability methods aim to make their reasoning visible, auditable, and trustworthy.

Question 2

What is mechanistic interpretability?

Accepted Answer

Mechanistic interpretability tries to reverse-engineer the actual internal algorithms a network has learned, identifying circuits, features, and the roles of specific neurons and attention heads. The goal is a causal, gears-level understanding of how a behaviour is computed, rather than just correlating inputs with outputs.

Question 3

What is the difference between mechanistic and post-hoc interpretability?

Accepted Answer

Mechanistic interpretability studies the model's internal mechanisms directly. Post-hoc methods like SHAP and LIME instead explain a single prediction from the outside by measuring how changing the inputs changes the output. Post-hoc methods are easier to apply but give attributions rather than a true account of internal computation.

Question 4

Why does interpretability matter for AI safety?

Accepted Answer

If we can see what a model is actually doing internally, we can detect deception, hidden capabilities, or unsafe reasoning before they cause harm, and we can debug and correct behaviour with confidence. Without interpretability, we are forced to judge powerful systems only by their outputs, which can hide dangerous internal failures.

What Is AI Interpretability? Opening the Black Box

What interpretability is

Mechanistic interpretability: reverse-engineering the model

Post-hoc methods: explaining one prediction at a time

Probing classifiers and representation analysis

Why interpretability is a safety priority