Context Window vs Long-Term Memory in AI Agents

What the model sees right now vs what it can recall across sessions

Ad placeholder (leaderboard)

Two different kinds of “memory”

People often say an AI assistant “remembers” a conversation, but two very different mechanisms hide behind that word. The context window is the model’s short-term working memory: the full block of text — system prompt, instructions, conversation history, and retrieved data — that the model can see in a single request. Long-term memory is information stored outside the model, in a database or file, that an application chooses to pull back into the context window when it becomes relevant. The model itself is stateless: between two API calls it remembers nothing at all. Any continuity you experience is engineered by the system wrapped around it.

The context window: fast but finite

Everything inside the context window is available to the model instantly and is reasoned over directly — this is what makes in-context information so powerful. But the window is finite, measured in tokens, and every token costs money and adds latency. When a conversation or document exceeds the limit, something has to give: older messages are truncated or compressed. Models also tend to attend most reliably to the start and end of a long input, so simply cramming in more text is not a guaranteed win. The context window is best thought of as a desk: spacious enough for the task at hand, but not a filing cabinet.

Long-term memory: external stores

To remember across sessions or beyond the window, agents lean on external stores. Three common patterns dominate. Vector databases hold embeddings of past messages and documents, so the system can retrieve the most semantically similar chunks for a new query — the core of retrieval-augmented generation (RAG). Rolling summaries condense old turns into a short paragraph that travels in the prompt, trading detail for compactness. Key-value records store structured facts — a user’s name, preferences, or prior decisions — that are looked up and injected deterministically. These stores are the agent’s filing cabinet: large, persistent, and selectively consulted.

How the two work together

A capable agent constantly moves information between these layers. New input arrives in the context window; important facts are written out to long-term storage; and when a future query needs them, they are retrieved and placed back into the window. A well-designed loop keeps the window small and focused while drawing on an effectively unlimited memory store. The art is deciding what to persist, what to summarize, and what to retrieve — too little and the agent feels forgetful, too much and it becomes slow, expensive, and distracted by irrelevant context.

Practical takeaways

Treat the context window as precious and the external store as cheap. Push raw history into long-term memory and retrieve only what each step needs. Prefer targeted retrieval over dumping everything into a giant prompt, even when a huge context window is available, because focused inputs are cheaper, faster, and often more accurate. Summarize aging conversation turns rather than dropping them silently, and store stable facts as structured records you can look up reliably. Done well, the combination gives an agent the responsiveness of a small window with the recall of a vast archive.

Ad placeholder (rectangle)