What is the single core thing an LLM does?

It predicts the next token. Given the text so far, the model outputs a probability for every possible next token and samples one. Repeating this token by token produces a full response. Everything else — embeddings, attention, the many layers — exists to make that one prediction accurate.

A token is a chunk of text the model treats as a unit — often a word, a word-piece, or a punctuation mark. Text is split into tokens before processing because models operate on a fixed vocabulary of tokens, not on raw characters. On average one token is roughly four characters or about three-quarters of a word in English.

What does attention actually do?

Attention lets each token look at the other tokens in the context and decide which are relevant to it, then blend their information accordingly. It is how the model resolves what 'it' refers to or connects a verb to its subject across a sentence. Stacking many attention layers builds up the model's understanding of the whole input.

Why does the same prompt give different answers?

Because the model samples from a probability distribution rather than always taking the single most likely token. The temperature setting controls how much randomness is allowed: low temperature makes output nearly deterministic, high temperature makes it more varied and creative. This sampling step is why responses differ between runs.

How Do Large Language Models Work? Inside the Black Box

The whole pipeline in one sentence

A large language model turns your text into numbers, processes those numbers through many layers to model the relationships between every part of the input, and then predicts the most likely next token — over and over — to build a response. That is the entire loop. The demo below walks your own sentence through the major stages so the abstract pipeline becomes concrete. Everything that feels like “understanding” emerges from this mechanical sequence run at enormous scale.

How it works, stage by stage

Tokenization comes first: your text is split into tokens (words, word-pieces, punctuation) drawn from a fixed vocabulary, because the model cannot operate on raw characters. Each token is then mapped to an embedding — a long list of numbers (a vector) that places the token in a meaning-space, so that related words sit near each other. Next, stacked attention layers let every token look at every other token and weigh which ones matter for interpreting it; this is how the model connects pronouns to nouns, verbs to subjects, and clauses to context. Finally, the model produces a probability distribution over the next token via a softmax, and samples one. The chosen token is appended to the input and the whole process repeats to generate the next token, and the next.

Tips for reading the demo

Use the interactive panel to build intuition rather than to simulate a real model exactly — it uses a simplified, illustrative tokenizer and toy numbers. Watch how short common words and rare long words tokenize differently, since this directly affects how many tokens (and how much cost) a prompt consumes. Pay attention to the attention weights: notice that the model’s “focus” is not uniform, which is the heart of how transformers work. And experiment with the temperature control on the next-token distribution — turning it down concentrates probability on the top choice (predictable output) while turning it up spreads it out (more varied output), exactly the trade-off you tune when calling a real model’s API.