How Do Large Language Models Work? Inside the Black Box

Tokens, attention, and next-word prediction: the core mechanics of LLMs

Ad placeholder (leaderboard)

The whole pipeline in one sentence

A large language model turns your text into numbers, processes those numbers through many layers to model the relationships between every part of the input, and then predicts the most likely next token — over and over — to build a response. That is the entire loop. The demo below walks your own sentence through the major stages so the abstract pipeline becomes concrete. Everything that feels like “understanding” emerges from this mechanical sequence run at enormous scale.

How it works, stage by stage

Tokenization comes first: your text is split into tokens (words, word-pieces, punctuation) drawn from a fixed vocabulary, because the model cannot operate on raw characters. Each token is then mapped to an embedding — a long list of numbers (a vector) that places the token in a meaning-space, so that related words sit near each other. Next, stacked attention layers let every token look at every other token and weigh which ones matter for interpreting it; this is how the model connects pronouns to nouns, verbs to subjects, and clauses to context. Finally, the model produces a probability distribution over the next token via a softmax, and samples one. The chosen token is appended to the input and the whole process repeats to generate the next token, and the next.

Tips for reading the demo

Use the interactive panel to build intuition rather than to simulate a real model exactly — it uses a simplified, illustrative tokenizer and toy numbers. Watch how short common words and rare long words tokenize differently, since this directly affects how many tokens (and how much cost) a prompt consumes. Pay attention to the attention weights: notice that the model’s “focus” is not uniform, which is the heart of how transformers work. And experiment with the temperature control on the next-token distribution — turning it down concentrates probability on the top choice (predictable output) while turning it up spreads it out (more varied output), exactly the trade-off you tune when calling a real model’s API.

Ad placeholder (rectangle)