Why rewrite a question before vector search?

Conversational questions are full of pronouns and shorthand ("what about the second one?") that mean nothing to an embedding model. Rewriting them into self-contained, keyword-rich queries dramatically improves which chunks the retriever returns.

What does adding conversation history do?

It lets the rewriter resolve references like "it," "that," or "the previous step" into explicit terms, producing a query that stands on its own without the chat context the vector store does not have.

How does specifying the domain help?

Naming the domain steers the rewrite toward the vocabulary your documents actually use — medical, legal, or product-specific terms — so the query embeds near the right chunks instead of generic phrasing.

Does it generate multiple query variants?

The generated prompt can ask for several phrasings of the same intent, which you can embed and search with in parallel (multi-query retrieval) to improve recall before re-ranking the combined results.

Where does this fit in a RAG pipeline?

It sits between the user and the retriever. The user's raw turn goes in, a clean standalone query comes out, you embed that query, search the vector store, then pass the retrieved chunks plus the original question to your generation step.

What is the RAG Query Rewriter?

Builds a prompt that rewrites a conversational user question into a standalone, semantically rich search query optimized for retrieval from a vector database, resolving pronouns and folding in conversation context. It runs free in your browser on Gera Tools, with nothing uploaded.

RAG Query Rewriter — Gera Tools

Name: RAG Query Rewriter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

RAG query rewriter

In a retrieval-augmented generation pipeline, the quality of your answer is capped by the quality of your retrieval — and retrieval is capped by the query you embed. A raw user turn like “and what about the cheaper one?” embeds to noise. This tool builds a prompt that rewrites conversational questions into standalone, semantically rich queries that resolve references and use the right domain vocabulary, so your vector store returns the chunks that actually answer the question.

The problem this solves

Embedding models convert text to dense vectors based on semantic meaning. When a user asks “what about the second approach?” in a multi-turn chat, the embedding sees only those words — it has no access to the preceding conversation that defines “the second approach.” The embedded vector sits near generic phrases about approaches and comparisons rather than near the specific technical content your corpus contains.

A query rewriter uses an LLM that does have the conversation history to translate the ambiguous turn into something like “compare the performance of gradient boosting versus neural networks on tabular data,” which embeds accurately and retrieves the right chunks.

How it works

You provide the user’s question, optionally the recent conversation history, and the domain of your documents. The tool composes a prompt that instructs an LLM to do three things: resolve every pronoun and implicit reference using the history, expand the question with the domain terminology your corpus is likely to use, and output a self-contained query that needs no chat context to make sense. It can also request several alternative phrasings for multi-query retrieval. You run this prompt as the first step of your pipeline, embed the rewritten query, search, and then pass the retrieved chunks plus the original question to your generation model.

Where in the pipeline this sits

User turn → [Query Rewriter] → standalone query → Embedder → Vector Store → Chunks → Generator

The rewriter only touches the retrieval path. The generator still receives the original user question alongside the retrieved chunks so the answer addresses what the user actually asked.

Practical example

Original turn: “Can you show me how to do that in Python?”

Context in history: The user was asking about parsing JSON from an API response.

Rewritten query: “Python code example for parsing JSON API response with error handling”

The rewritten query embeds near code examples and tutorials in your corpus rather than near generic Python introductions.

Tips and notes

Always feed the rewriter the conversation history when the question contains references — that is the entire point, and without it “the second option” resolves to nothing. Name your domain precisely; “internal HR policy documents” produces a sharper rewrite than “documents.” Consider the multi-query variant for high-recall use cases: embedding three phrasings and merging the hits before re-ranking catches relevant chunks a single query misses. Keep the original user question for the generation step, though — the rewrite is for retrieval, while the answer should still address what the user actually asked in their own framing.

The latency cost of the rewrite step is a single fast LLM call before retrieval. For most pipelines this adds 200–500ms but can meaningfully improve answer quality on multi-turn conversational systems.