How is LlamaIndex different from LangChain?

LlamaIndex is purpose-built for the indexing and retrieval side of RAG — connectors, node parsing, indexes, and query engines — and makes building a document Q&A system very concise. LangChain is a broader orchestration framework. Many teams use LlamaIndex for retrieval and LangChain for the surrounding application logic.

What is a node in LlamaIndex?

A node is a chunk of a document plus its metadata and relationships to neighbouring nodes. LlamaIndex splits your Documents into Nodes with a node parser, embeds each node, and retrieves nodes at query time. Tuning the node parser (chunk size and overlap) is the main retrieval-quality lever.

What is sub-question decomposition?

The SubQuestionQueryEngine breaks a complex question into smaller sub-questions, answers each against the most relevant data source, and combines the results. It shines when a question spans multiple documents or requires comparing several facts that no single chunk contains.

Do I have to use a separate vector database?

No. By default LlamaIndex keeps the index in memory, which is fine for development and small corpora. For production you swap in a vector store integration such as Qdrant, Pinecone, Chroma, or pgvector by passing a StorageContext, with no change to your query code.

How do I persist an index so I don't re-embed every run?

Call index.storage_context.persist() to write the index to disk, then load it later with load_index_from_storage. This avoids re-embedding the whole corpus on every restart, which saves both time and embedding cost.

How to Use LlamaIndex for RAG Applications

What LlamaIndex does

LlamaIndex is a framework focused on the retrieval half of RAG: getting your data into a form an LLM can search, and getting the right pieces out at query time. It handles the unglamorous parts — reading files, splitting them into chunks, embedding them, storing the vectors, and retrieving the best matches — behind a small, consistent API. Where a hand-rolled RAG pipeline is dozens of lines, a basic LlamaIndex query engine is about five.

How it works

The flow has four stages, each with a named LlamaIndex concept. Data connectors (like SimpleDirectoryReader) load files into Document objects. A node parser splits those documents into Node chunks with metadata and relationships to their neighbours. A VectorStoreIndex embeds every node and stores the vectors — in memory by default, or in an external store like Qdrant or pgvector via a StorageContext. Finally a query engine (index .as_query_engine()) embeds the user’s question, retrieves the most similar nodes, and synthesises an answer with citations.

For harder questions that span several sources, the SubQuestionQueryEngine decomposes the question into sub-questions, routes each to the right data source, and merges the answers — far more reliable than stuffing everything into one retrieval. The generator below builds a complete, runnable LlamaIndex query engine from your choices, including persistence and an optional external vector store.

Tips for production use

Persist your index with storage_context.persist() so you embed the corpus once rather than on every run — this is the biggest time and cost saving. Tune the node parser’s chunk size and overlap against your real questions before reaching for a bigger model. Add metadata (source, page, section) to nodes so answers can cite where they came from. Use the SubQuestionQueryEngine when a question needs facts from multiple documents. And swap the in-memory store for a real vector database before you scale past a few thousand nodes — the query code stays identical, only the storage context changes.