What “emergent” means in AI
In large language models, an emergent capability is a skill that smaller models simply cannot do, but that appears — often quite suddenly — once a model crosses a certain size. The defining feature is that performance does not improve smoothly as the model grows; instead it stays near random for small models and then jumps to genuinely useful levels past a threshold. Crucially, nobody explicitly trained the model for these tasks. The abilities seem to fall out of scaling up parameters, training data, and compute, which is why they are described as emergent rather than designed.
Examples that put emergence on the map
The term gained traction from large-scale evaluations like BIG-Bench and from work at labs scaling models from millions to hundreds of billions of parameters. Frequently cited examples include multi-digit arithmetic, answering university-level multiple-choice exam questions, in-context learning (picking up a new task from a few examples in the prompt), and the ability to benefit from chain-of-thought prompting. In each case, plotting accuracy against model scale shows a flat, near-random line for small models that suddenly bends upward once the model is large enough.
The phase-transition phenomenon
The sharpness of these jumps is what makes emergence striking — researchers liken it to a phase transition, like water turning to ice at a specific temperature. The leading intuition is compositional: a task may require several sub-skills to all function at once, and small models manage only some of them, so they score near zero. Only when a model is large enough to hold all the necessary sub-skills together does the full task suddenly become solvable. This is also why emergence is hard to predict in advance: you often cannot see a capability coming by extrapolating from smaller models.
The “mirage” debate
Emergence is genuinely contested. A widely discussed 2023 paper argued that many apparent emergent abilities are a mirage — an artefact of the metric rather than the model. Harsh, all-or-nothing measures like exact-match accuracy hide steady underlying improvement: a model getting 4 of 5 digits right scores zero on exact match even though it is clearly learning. Swap in a smoother, partial-credit metric and the same capability improves gradually and predictably with scale, with no sudden jump. The current consensus is nuanced: some emergence reflects real qualitative change, while a meaningful share is an illusion created by how we score the tests.
Why it matters
Emergence is central to debates about AI safety and forecasting. If important capabilities — including potentially dangerous ones — can appear abruptly and unpredictably as models scale, then evaluating a small model tells you little about what its larger successor will do. That unpredictability is a key reason labs invest in rigorous evaluation and red-teaming before and after scaling up, and why understanding emergence remains an active research priority.