What is retrieval-augmented generation (RAG)?

RAG is the standard pattern for an LLM knowledge base. You split documents into chunks, store an embedding of each in a vector index, and at query time retrieve the most relevant chunks and pass them to the model as context. The model answers from those chunks rather than its training data, which keeps answers grounded in your actual content and citable.

How big should my chunks be?

A common range is 300 to 800 tokens with some overlap between adjacent chunks. Smaller chunks give more precise retrieval but can lose context; larger chunks keep context but dilute relevance and cost more per query. Start around 500 tokens with 50 to 100 tokens of overlap and tune based on whether your answers feel under- or over-specified.

How do I keep the knowledge base fresh?

Re-index on a schedule and on change. Webhooks from Notion, Confluence, or Google Drive can trigger re-embedding of just the documents that changed, while a nightly or weekly full sweep catches anything missed. Only re-embed changed chunks, not the whole corpus, to keep recurring cost low — the planner on this page estimates that cost.

Do I need a dedicated vector database?

Not always. For a few thousand documents, the pgvector extension on a database you already run is usually enough. Reach for a dedicated store like Pinecone, Qdrant, or Weaviate when you scale into the millions of chunks, need managed sharding, or want very low latency under heavy concurrent load.

How do I stop it answering from outside the docs?

Instruct the model to answer only from the retrieved context and to say it does not know when the answer is not present, then show the source chunks alongside every answer so users can verify. Grounding plus visible citations is what makes a knowledge base trustworthy rather than a confident guesser.

How to Build a Knowledge Base Powered by an LLM

What an LLM-powered knowledge base is

A knowledge base powered by a large language model lets your team ask questions in plain English and get answers drawn from your own documents — your Notion wiki, Confluence space, or Google Docs — instead of hunting through pages. The standard architecture is retrieval-augmented generation (RAG): you split documents into chunks, store a vector embedding of each chunk in an index, and at query time retrieve the most relevant chunks and hand them to the model as context. The model answers from your content, with citations, rather than from its training data. The planner below sizes the corpus and estimates the one-time and recurring cost so you can budget before you build.

How it works

There are four stages. Ingest: pull documents from your sources via their APIs, normalise to clean text, and strip boilerplate. Index: split each document into chunks (typically 300–800 tokens with overlap), embed every chunk with an embedding model, and store the vectors plus metadata in a vector index. Query: embed the user’s question, retrieve the top matching chunks, and pass them to the LLM with an instruction to answer only from the provided context. Refresh: re-embed documents when they change — driven by source webhooks for immediacy plus a periodic full sweep — so the base never goes stale. Re-embed only what changed to keep the recurring bill small.

Sizing, cost, and freshness

The two numbers that drive cost are the one-time embedding of your whole corpus and the recurring re-embedding of what changes. The tool above turns your document count, average length, chunk size, and change rate into chunk counts, token volume, an estimated embedding cost, and a storage footprint, then projects the monthly cost of keeping it fresh at your chosen cadence. Two practical notes: embedding is cheap relative to generation, so do not over optimise it — the per-query LLM cost usually dominates at scale. And ground every answer in retrieved chunks with visible citations and an explicit “I don’t know” path, which is what separates a trustworthy knowledge base from a confident guesser. For keeping the running system honest after launch, see how to monitor AI apps in production.