A normal chatbot forgets everything the moment a conversation ends — its only memory is the message history you resend, and that is bounded by the context window. Long-term memory breaks that ceiling: instead of carrying every past word, you store distilled facts about the user, then retrieve only the few relevant ones when they matter. The pattern is RAG applied to per-user memories. This tutorial walks the three stages — summarise and store, retrieve, inject — and the assembler below shows exactly how retrieved memories land in the prompt.
Step 1 — Summarise and store memories
Do not store raw transcripts. After a conversation (or periodically during one), ask the model to extract a handful of durable facts: preferences, goals, constraints, and decisions. Things like “prefers Python over JavaScript,” “is migrating off AWS,” “timezone is GMT.”
Embed each memory statement into a vector and store the text plus the vector in a database, keyed to the user. Storing distilled facts rather than transcripts keeps the memory store small, relevant, and cheap to search — and it is the difference between a memory system that scales and one that collapses under its own weight.
Step 2 — Retrieve relevant memories
When a new message arrives, embed it with the same model and run a similarity search against that user’s memories, returning the top few closest matches. This is identical to document retrieval in RAG, just scoped per user.
The key discipline is selectivity. You do not inject every memory — you inject only the handful relevant to the current message. A question about deployment pulls the “migrating off AWS” memory; a question about syntax pulls the “prefers Python” memory. Irrelevant memories are noise that wastes context and can mislead the model.
Step 3 — Inject memories into the prompt
Place the retrieved memories in the system prompt, before the live conversation, framed clearly:
You are a helpful assistant. Here is what you know about this user:
- Prefers Python over JavaScript
- Is migrating off AWS
Use this context when relevant. Then continue the conversation.
Mind the token budget — memories, the system prompt, and the live history all share the context window, so cap how many memories you inject. And because you are now persisting personal data, scope memories strictly per user, let users view and delete them, encrypt the store, and never mix one user’s memories into another’s context.
Use the assembler below to see how memories, system prompt, and live turns fit into a single request within a token budget, then review the chat API basics and semantic search for the retrieval mechanics underneath.