What problem does RAG actually solve?

RAG lets a general-purpose LLM answer questions about private or recent data it was never trained on. Instead of relying on parametric memory, the model reads relevant passages you retrieve at query time, which sharply reduces hallucination and lets you cite sources.

How big should my chunks be?

A common starting point is 200-500 tokens per chunk with 10-20 percent overlap. Smaller chunks give more precise retrieval but more fragments to manage; larger chunks preserve context but can dilute relevance. Tune against your own questions.

Do I need a dedicated vector database?

Not always. For small corpora you can keep vectors in memory or use pgvector inside Postgres. Dedicated stores like Pinecone, Weaviate, or Qdrant add managed scaling, filtering, and hybrid search once you pass tens of thousands of chunks.

How many chunks should I retrieve per query?

Start with the top 3-5 most similar chunks. Retrieving too many wastes context window and adds noise; too few risks missing the answer. Re-ranking the retrieved set with a cross-encoder often improves quality more than simply raising k.

Why does my RAG system still hallucinate sometimes?

Usually the retrieval missed the right passage, the chunk was too large to be specific, or the prompt did not firmly instruct the model to answer only from context. Add an instruction to say "I don't know" when the context lacks the answer, and inspect what was retrieved.

How to Set Up RAG (Retrieval-Augmented Generation)

What RAG is and why it matters

Retrieval-Augmented Generation (RAG) is a pattern that gives a large language model access to external knowledge at query time. Instead of fine-tuning a model on your documents, you store those documents as vectors, retrieve the most relevant passages for each question, and paste them into the prompt as context. The model then answers from the supplied text rather than from memory — which keeps answers current, citable, and far less prone to hallucination.

How the pipeline works

A RAG system has two phases. The indexing phase runs once (and on updates): documents are split into chunks, each chunk is embedded into a vector, and the vectors plus their source text are stored in a vector database. The query phase runs per request: the user’s question is embedded with the same model, a similarity search returns the top-k closest chunks, those chunks are assembled into a context block, and the model is prompted to answer using only that block.

The previewer below lets you paste a few document chunks and a question, then shows the exact grounded prompt a RAG system would send to the model — including the system instruction that forces the answer to stay inside the retrieved context.

Tips for production RAG

Always store source metadata (title, URL, section) alongside each chunk so you can show citations. Add an explicit instruction telling the model to reply “I don’t know” when the context does not contain the answer — this is the single biggest hallucination guard. Measure retrieval quality separately from generation quality: if the right chunk never makes the top-k, no prompt tuning will fix the answer. Finally, re-embed your corpus whenever you change embedding models, since vectors from different models are not comparable.