What does RAG stand for?

RAG stands for Retrieval-Augmented Generation. It is a pattern where a language model retrieves relevant text from an external knowledge source before generating its answer, instead of relying solely on what it memorised during training.

Why does RAG reduce hallucinations?

RAG grounds the model in real, retrieved passages rather than its parametric memory. When the model is instructed to answer only from the supplied context, it has actual source text to draw on and can cite it, which sharply reduces the rate of confidently made-up facts compared with answering from memory alone.

How is RAG different from fine-tuning?

Fine-tuning changes the model's weights to bake in new behaviour or knowledge, which is costly and slow to update. RAG leaves the model untouched and instead injects fresh, relevant text at query time. RAG is far easier to keep current, while fine-tuning is better for teaching style, format, or skills rather than facts.

What is chunking and why does it matter?

Chunking splits long documents into smaller passages before embedding them. Chunk size matters: too large and retrieval returns noisy, off-topic text; too small and you lose the surrounding context needed to answer. Good chunking — often a few hundred tokens with some overlap — is one of the biggest levers on RAG quality.

RAG Explained: Retrieval-Augmented Generation in Plain English

What RAG actually is

Retrieval-Augmented Generation (RAG) is a way of letting a language model answer using information it was never trained on. Instead of relying purely on the facts baked into its weights, the system first looks things up in an external knowledge source — your documents, a wiki, a product catalogue — and feeds the most relevant passages into the prompt. The model then generates its answer from that supplied context. The payoff is twofold: the model can use up-to-date, private, or domain-specific information, and because it answers from real retrieved text, it hallucinates far less and can cite where each claim came from. The explorer below walks through each stage of the pipeline.

The pipeline, stage by stage

RAG splits cleanly into an offline indexing phase and an online query phase.

Chunking. Source documents are split into passages — typically a few hundred tokens, often with a little overlap so context is not severed mid-thought. Chunk size is a real quality lever: too big and retrieval drags in irrelevant text, too small and the answer loses surrounding meaning.

Embedding. Each chunk is passed through an embedding model that turns it into a vector — a list of numbers capturing its meaning. Similar meanings land near each other in this vector space. These vectors are stored in a vector database once, ahead of time.

Retrieval. When a user asks a question, the same embedding model converts the question into a vector, and the vector store returns the chunks whose vectors are nearest to it. This is semantic search: it matches on meaning, not keywords.

Augmentation and generation. The retrieved passages are inserted into the prompt as context, the model is instructed to answer only from that context, and it generates a grounded response — ideally citing the passages it used.

Why teams reach for RAG

RAG is popular because it solves the two biggest weaknesses of a plain LLM: stale knowledge and made-up facts. Updating a RAG system means re-indexing documents — no expensive retraining — so a knowledge base can change daily. Grounding answers in retrieved text also makes them auditable: you can show the user exactly which source each statement came from. This is why RAG underpins most production question-answering over private data, from internal support bots to legal and medical document assistants.

Where RAG goes wrong

RAG is not magic, and most failures trace back to retrieval, not generation. If the right passage is never retrieved, the model cannot use it — so embedding quality, chunk size, and the number of chunks returned all matter enormously. Other common pitfalls include retrieving contradictory passages, stuffing so much context that the model loses the thread, and failing to instruct the model to refuse when the answer is not in the context. The practical discipline is to evaluate retrieval separately from generation: first confirm the right chunks come back, then confirm the model uses them faithfully. Tune chunking and the retrieval count before you blame the model.