What is a serverless index and should I use it?

A Pinecone serverless index decouples storage from compute and scales automatically, billing you for storage plus read and write units rather than a fixed pod size. For most RAG apps it is the simplest and most cost-effective choice — you create it once and never tune pod counts. Pod-based indexes still exist for very specialised high-throughput needs.

How should I batch upserts?

Upsert in batches of around 100 vectors per request rather than one at a time, and keep each request under Pinecone's size limit. Batching cuts round trips dramatically when ingesting thousands of chunks. Use stable, deterministic ids so re-running ingestion updates rather than duplicates vectors.

Do I store the chunk text in Pinecone?

Yes — put the chunk text in the vector's metadata so a query returns both the match and the text to feed the model, avoiding a second lookup. Keep metadata lean (text, source, page, tenant) because Pinecone caps metadata size per vector and large blobs raise cost.

How does metadata filtering work?

You attach a metadata filter to the query, such as restricting results to a tenant id or a document set, and Pinecone applies it during the search. This is essential for multi-tenant safety so one customer's query can never return another's documents. Index only the metadata fields you actually filter on.

When is Pinecone worth it over pgvector?

Pinecone shines when you want zero database operations, automatic scaling into the tens of millions of vectors and beyond, and consistent low latency without tuning. If you already run PostgreSQL and have a few million vectors or fewer, pgvector avoids a new dependency. Choose Pinecone to move fast and stay hands-off on infrastructure.

How to Build RAG with Pinecone

What you are building

This tutorial builds a retrieval-augmented generation (RAG) pipeline on Pinecone, a fully managed vector database. Pinecone owns the hard parts — sharding, replication, scaling, low-latency search — so you focus on embedding your content and writing prompts. The result: a system where a user asks a question, the relevant passages are retrieved by similarity, and an LLM answers grounded in those passages with citations. It is the fastest path to production RAG when you do not want to run any database yourself.

How the pipeline works

You start by creating a serverless index whose dimension matches your embedding model and whose metric is cosine for normalised embeddings. During ingestion you chunk each document, embed every chunk, and upsert vectors in batches of ~100, attaching metadata — the chunk text, source, page, and tenant id. At query time you embed the user’s question, call query with a top_k and an optional metadata filter, and Pinecone returns the closest chunks with their stored text. You assemble those into a context block and prompt the model to answer using only that context. Because the chunk text lives in metadata, a single query gives you everything the prompt needs.

Tips and the planner below

Use deterministic vector ids (a hash of source plus chunk index) so re-ingesting updates rather than duplicates. Keep metadata lean — Pinecone caps metadata size, and bloated payloads raise storage cost. Always apply a tenant filter on multi-tenant queries; it is your isolation boundary. And instruct the model to admit when retrieved context does not contain the answer, the single best guard against hallucination. The planner below estimates your index size, the read and write units a workload consumes, and the rough monthly embedding and storage cost from your own document and traffic profile.