Why not just keep the whole conversation history?

Conversation history is short-term memory and it has a hard ceiling — the context window. Resending the entire history of every past chat is impossible and expensive long before you hit the limit. Long-term memory solves this by storing distilled facts and retrieving only the few relevant ones per message, so the model can recall something from weeks ago without carrying every word along.

What should I actually store as a memory?

Durable, useful facts about the user and their context — preferences, goals, constraints, decisions made, and recurring topics — not the raw transcript. After a conversation, ask the model to extract a handful of concise memory statements like "prefers Python over JavaScript" or "is migrating off AWS." Storing distilled facts keeps the memory store small, relevant, and cheap to search.

How does the bot decide which memories are relevant?

By embedding the incoming message and running a similarity search against the user's stored memory vectors, returning only the top few closest matches. This is the same retrieval pattern as RAG, applied to per-user memories instead of documents. You inject only the relevant handful into the prompt rather than every memory, which keeps the context focused and within budget.

How do I stop the memory from growing forever?

Cap it and curate it. Store distilled facts rather than transcripts, de-duplicate near-identical memories, and let newer facts supersede stale ones — for example, a new "uses Postgres" memory should retire an old "uses MySQL." Some systems add a recency or importance score and prune the lowest-value memories once a per-user cap is reached.

Is storing user memories a privacy concern?

Yes, and you must treat it as one. You are persisting personal data, so scope memories per user with strict access control, let users view and delete their memories, encrypt the store, and follow the data-protection rules that apply to you. Never mix one user's memories into another's context, and be transparent that the assistant remembers things across sessions.

How to Build a Chatbot with Long-Term Memory

A normal chatbot forgets everything the moment a conversation ends — its only memory is the message history you resend, and that is bounded by the context window. Long-term memory breaks that ceiling: instead of carrying every past word, you store distilled facts about the user, then retrieve only the few relevant ones when they matter. The pattern is RAG applied to per-user memories. This tutorial walks the three stages — summarise and store, retrieve, inject — and the assembler below shows exactly how retrieved memories land in the prompt.

Step 1 — Summarise and store memories

Do not store raw transcripts. After a conversation (or periodically during one), ask the model to extract a handful of durable facts: preferences, goals, constraints, and decisions. Things like “prefers Python over JavaScript,” “is migrating off AWS,” “timezone is GMT.”

Embed each memory statement into a vector and store the text plus the vector in a database, keyed to the user. Storing distilled facts rather than transcripts keeps the memory store small, relevant, and cheap to search — and it is the difference between a memory system that scales and one that collapses under its own weight.

Step 2 — Retrieve relevant memories

When a new message arrives, embed it with the same model and run a similarity search against that user’s memories, returning the top few closest matches. This is identical to document retrieval in RAG, just scoped per user.

The key discipline is selectivity. You do not inject every memory — you inject only the handful relevant to the current message. A question about deployment pulls the “migrating off AWS” memory; a question about syntax pulls the “prefers Python” memory. Irrelevant memories are noise that wastes context and can mislead the model.

Step 3 — Inject memories into the prompt

Place the retrieved memories in the system prompt, before the live conversation, framed clearly:

You are a helpful assistant. Here is what you know about this user:
- Prefers Python over JavaScript
- Is migrating off AWS

Use this context when relevant. Then continue the conversation.

Mind the token budget — memories, the system prompt, and the live history all share the context window, so cap how many memories you inject. And because you are now persisting personal data, scope memories strictly per user, let users view and delete them, encrypt the store, and never mix one user’s memories into another’s context.

Use the assembler below to see how memories, system prompt, and live turns fit into a single request within a token budget, then review the chat API basics and semantic search for the retrieval mechanics underneath.