What you are building
Retrieval-augmented generation (RAG) is how you make a language model answer from your own documents instead of its training data. You split your text into chunks, embed each chunk into a vector, and at query time retrieve the chunks most similar to the question and hand them to the model as context. LlamaIndex packages this entire pipeline — loading, chunking, embedding, storing, and retrieving — into a handful of calls on top of the OpenAI API. This tutorial takes you from a folder of files to a queryable, streaming service you can deploy.
How it works
The flow is four steps. Load: SimpleDirectoryReader("data").load_data() reads every file in a folder into documents. Index: VectorStoreIndex.from_documents(documents) chunks the text and calls the OpenAI embeddings API to turn each chunk into a vector, storing them in an in-memory vector store by default — no separate database needed to start. Query: query_engine = index.as_query_engine() then query_engine.query("your question") embeds the question, retrieves the most similar chunks, and sends them with the question to the chat model, which answers grounded in your data. Stream and serve: pass streaming=True for a responsive feel, then wrap the query engine in a FastAPI endpoint so a frontend can POST a question and receive the answer over HTTP. Build the index once at startup or persist it so you are not re-embedding on every restart. The estimator below helps you size the embedding and per-query cost before you commit a large corpus.
Tips for a RAG app that actually works
When answers are vague or wrong, the culprit is almost always retrieval, not generation — if the right chunks never reach the model, no model can save you. Inspect the retrieved nodes first. Tune chunk size and overlap so each chunk is a coherent idea; too large dilutes relevance, too small loses context. Raise top-k to retrieve more chunks when answers feel incomplete, but watch the cost, since each retrieved chunk inflates the prompt and the chat completion dominates your bill. Keep the API key in an environment variable, never in code. Start with the in-memory index for learning, and move to a persistent vector store like Chroma, pgvector, or Pinecone only when you need scale or to stop re-indexing on restart. Above all, improve the source documents — clean, well-structured input is the cheapest accuracy gain you will ever get.