Retrieval-Augmented Generation injects relevant documents into the model's prompt at query time, so the LLM answers from your private, current knowledge without retraining. It is the standard way to ground an assistant in facts that change or that the model never saw in training.

How does corpus size change the recommendation?

Small corpora can use a lightweight or in-process vector store and simple chunking, while large corpora need a scalable managed vector database, metadata filtering, and often hybrid (keyword + vector) search to keep recall high and latency acceptable.

Which embedding model should I pick?

For most use cases a strong general embedding model (such as a current OpenAI or open-source model) is the right default. Choose larger-dimension models for nuanced semantic recall, smaller/cheaper ones when cost and latency dominate, and a multilingual model if your corpus spans languages.

Why does the planner stress evaluation?

RAG quality lives or dies on retrieval. Without measuring recall and answer faithfulness on a labelled question set, you cannot tell whether a bad answer came from retrieval missing the chunk or the model ignoring it. The planner always recommends a small golden eval set.

What is the RAG Architecture Planner?

A guided planner that recommends a chunking strategy, embedding model, vector store, and retrieval evaluation approach for your corpus size, document types, latency target, and budget — then summarises the full RAG stack to copy. Runs locally. It runs free in your browser on Gera Tools, with nothing uploaded.

RAG Architecture Planner

Name: RAG Architecture Planner
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

RAG architecture planner

A Retrieval-Augmented Generation system has four decisions that determine whether it works: how you chunk documents, which embedding model you use, which vector store holds them, and how you evaluate retrieval. Get them wrong and the assistant confidently answers from the wrong context. This planner turns a few facts about your corpus into a coherent recommendation for all four.

How it works

You describe your document types, corpus size, latency target, and budget. The planner applies the standard trade-offs: small corpora get simple fixed-size chunking and a lightweight or in-process vector store; large or mixed corpora get semantic/structure-aware chunking, a scalable managed vector database, metadata filtering, and hybrid keyword-plus-vector search. It recommends an embedding model sized to your quality-versus-cost posture and always includes a retrieval evaluation step — a golden question set scored on recall and faithfulness. The result is a copy-ready stack summary.

The four RAG decisions

1. Chunking strategy

Chunking is how you break source documents into pieces small enough to embed and retrieve individually. The choice matters because a retrieved chunk that contains the answer but also a lot of irrelevant text dilutes the signal the LLM receives.

Fixed-size chunking splits at a fixed character or token count with an overlap. Simple and fast, but can cut across sentences and paragraphs, losing context. Works well for uniform, prose-heavy documents.

Structure-aware chunking respects document boundaries: headings, paragraphs, list items, code blocks. Produces more semantically coherent chunks. Works best for Markdown, HTML, or documents with a predictable structure.

Semantic chunking groups adjacent sentences that are semantically similar by embedding each sentence and splitting where similarity drops. Produces the most coherent chunks but adds compute overhead and is harder to implement.

A practical starting point for most projects: paragraph-level chunks with a 10–15% overlap to preserve context at boundaries.

2. Embedding model

The embedding model converts a chunk of text into a dense vector. Retrieval quality depends heavily on the model. Considerations:

Dimension — higher-dimensional embeddings capture more nuance but cost more to store and search. Common choices range from 768 to 3,072 dimensions.
Context window — some embedding models accept up to 8,192 tokens; others cap at 512. If your chunks are long, the model needs a large enough window to read the whole chunk.
Multilingual support — if your corpus spans multiple languages, choose a model trained on multilingual data.
Cost — proprietary embedding APIs charge per token. For large corpora or high query volumes, an open-source model deployed on your own infrastructure may be more economical.

3. Vector store

The vector store holds the embeddings and performs nearest-neighbour search to find the chunks most similar to a query embedding.

For small corpora (thousands of documents), an in-process library is practical — no server to run, minimal latency, easy to set up. Suitable for prototyping, internal tools, and small-scale applications.

For medium to large corpora (tens of thousands to millions of documents), a dedicated managed vector database is the right choice. It handles indexing, sharding, replication, and filtering. Most support hybrid search — combining vector similarity with keyword (BM25) scoring — which dramatically improves recall for queries that include exact terms, product codes, or named entities.

4. Evaluation

Retrieval evaluation is the step most teams skip and later regret. Without measuring how often the correct chunk is actually returned (recall), you cannot distinguish a retrieval problem from a generation problem when the system gives a wrong answer.

The standard approach is a golden question set: a small collection of questions with known correct source chunks. Run your pipeline against it and measure:

Recall@k — is the correct chunk in the top k retrieved results?
Faithfulness — does the generated answer stay grounded in the retrieved chunks, or does it hallucinate?
Answer relevance — does the answer actually address the question?

Build this eval set on day one and re-run it on every change to chunking, embedding, or retrieval parameters.

Tips and notes

Start with retrieval quality, not the LLM — most “the model is dumb” complaints in RAG are actually retrieval misses. Chunk on natural boundaries (headings, paragraphs) with a small overlap rather than blind fixed windows, and store rich metadata so you can filter before you rank. Add hybrid search early if your queries include exact terms, codes, or names that pure vector search fluffs. Build a small labelled eval set on day one and re-run it on every change. When your gap is behaviour or format rather than knowledge, RAG is the wrong tool — check the fine-tuning decision helper first.