What interpretability is
AI interpretability is the field devoted to understanding what is actually happening inside a neural network — not just what it predicts, but why and how. Modern models like large language models are enormous webs of billions of numbers, and even the engineers who build them cannot fully explain any specific output. Interpretability (also called explainable AI, or XAI) tries to crack open that black box, turning an opaque statistical process into something humans can inspect, audit, and trust. As AI is deployed in high-stakes settings, being able to explain a model’s behaviour becomes both a practical and a safety necessity.
Mechanistic interpretability: reverse-engineering the model
The most ambitious approach is mechanistic interpretability, which aims to reverse-engineer the internal algorithms a network has learned. Researchers study circuits — small groups of neurons and attention heads that work together to implement a specific behaviour, such as detecting that a pronoun refers to a particular noun, or copying a token seen earlier in the context. Famous findings include “induction heads” that learn to continue repeated patterns. The dream is a gears-level, causal account of how a model computes a behaviour, which would let us predict and edit its reasoning rather than just observe outputs.
Post-hoc methods: explaining one prediction at a time
A more practical and widely used family is post-hoc interpretability, which explains an individual prediction from the outside without dissecting the model’s internals. SHAP (based on Shapley values from game theory) and LIME (which fits a simple, local model around a single prediction) both estimate how much each input feature contributed to the output. These methods are model-agnostic and easy to apply, but they produce attributions — which inputs mattered — rather than a true description of the internal computation, and they can be unstable or misleading if used carelessly.
Probing classifiers and representation analysis
Between the two sits probing: training a small, simple classifier on a model’s internal activations to test whether a particular piece of information is represented there. If a probe can read, say, part-of-speech or sentiment off a hidden layer, that suggests the model encodes that concept at that point. Probing helps map what information lives where inside a network, complementing mechanistic work that asks how that information is used.
Why interpretability is a safety priority
Interpretability is widely regarded as one of the load-bearing pillars of AI safety. If we can see a model’s internal reasoning, we can catch deception, hidden capabilities, or unsafe behaviour before they reach users, and we can debug and correct models with real confidence rather than guesswork. The alternative — judging ever more powerful systems only by their outputs — risks missing failures that are invisible from the outside. That is why leading labs invest heavily in interpretability research, even though a complete understanding of large models remains a hard, unsolved problem.