What are the core stages of a RAG pipeline?

Load documents, split them into chunks, embed each chunk, store vectors, retrieve the top matches for a query, optionally rerank, then generate an answer with the retrieved context. The first four run at ingest time; retrieval and generation run per query.

When do I need a reranker?

Add a reranker when top-k vector results contain near-misses. A cross-encoder reranker re-scores the retrieved chunks against the query for higher precision, at extra latency and cost — worth it for high-stakes answers.

How do I choose chunk size and top-k?

Smaller chunks (200-500 tokens) improve precision; larger chunks preserve context. Start around top-k 5 and chunk 400, then tune against your eval set. The tool reflects your choices in the exported pseudocode.

Is the Mermaid output ready to paste?

Yes. The diagram uses standard Mermaid flowchart syntax that renders in GitHub, Notion, and most docs tools. Copy it straight in.

What is the RAG Pipeline Designer?

Toggle and configure the stages of a RAG pipeline — loader, splitter, embedder, vector store, retriever, reranker, generator — then export a copy-ready Mermaid diagram and Python-style pseudocode for your stack. It runs free in your browser on Gera Tools, with nothing uploaded.

RAG Pipeline Designer

Name: RAG Pipeline Designer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Design your RAG pipeline visually

Retrieval-augmented generation has a standard backbone but many tunable stages. This designer lets you toggle each stage — loader, splitter, embedder, vector store, retriever, optional reranker, and generator — set its key parameters, and export both a Mermaid diagram for your documentation and a pseudocode skeleton to start building.

How a RAG pipeline fits together

A RAG system has two phases. At ingest time you load documents, split them into chunks, embed each chunk, and store the vectors. At query time you embed the user’s question, retrieve the most similar chunks, optionally rerank them for precision, and pass them as context to the generator. Optional stages — query rewriting before retrieval and reranking after — trade extra latency and cost for better answer quality.

What each stage does

Stage	When it runs	Role
Loader	Ingest	Reads raw files (PDF, HTML, DOCX) and normalises them to plain text
Splitter	Ingest	Breaks text into overlapping chunks of a fixed token size
Embedder	Ingest + Query	Converts text to dense vector representation
Vector store	Ingest	Persists vectors for approximate nearest-neighbour search
Retriever	Query	Finds the top-k most similar chunks for the user query
Reranker	Query (optional)	Cross-encodes query+chunk pairs to re-score for precision
Generator	Query	Feeds retrieved context into an LLM and returns the answer

The query-rewrite stage is an optional pre-retrieval step that rewrites a conversational question into a standalone search query, which is especially useful in multi-turn chat pipelines.

Choosing chunk size and top-k

Chunk size controls the trade-off between precision and context continuity. Small chunks (200–400 tokens) pinpoint specific sentences but may lack surrounding context. Large chunks (600–1 000 tokens) carry more context but push out other relevant material in the prompt. A common starting point is 400 tokens with 50-token overlap and top-k of 5, then tune from there using an eval set.

The reranker is most valuable when top-k is large (10–20) and you need to compress down to 3–5 high-precision results before sending to the generator.

Reading the exported outputs

The Mermaid diagram uses flowchart syntax and renders directly in GitHub, Notion, Confluence, and most documentation tools. Paste it into a code fence labelled mermaid and it renders as an interactive diagram. The pseudocode skeleton shows function calls in Python-style notation so you can map it directly to LangChain, LlamaIndex, or a custom implementation.

Tips

Ingest once, query many. Keep expensive embedding work in the ingest phase; the query path should be fast.
Add the reranker only if you need it. It noticeably improves precision but adds a model call per query.
Log retrieved chunks. Most RAG quality problems are retrieval problems — inspect what was fetched before blaming the generator.
Start simple. A loader → splitter → embedder → retriever → generator pipeline with no optional stages is fast to build and often already 80% of the way to a good system. Add stages only when eval scores plateau.