Do I need to fine-tune a model to chat with a PDF?

No. Fine-tuning teaches a model a skill or style; answering questions about a specific document is a retrieval problem. You store the PDF as searchable chunks and feed the relevant ones into the prompt at query time, which is faster and far cheaper than fine-tuning.

How do I handle scanned PDFs with no text layer?

A scanned PDF is just images, so a plain text extractor returns nothing. Run the pages through OCR (such as Tesseract or a vision model) to produce text first, then feed that text into the same chunk-embed-retrieve pipeline.

What chunk size works best for documents?

Start with roughly 300-500 tokens per chunk and 10-20 percent overlap. Chunking on paragraph or heading boundaries beats fixed character counts because it keeps related sentences together, which improves retrieval precision.

How many chunks should I send to the model per question?

Usually the top 3-6 most similar chunks. Sending more wastes context budget and adds noise that can confuse the model; sending too few risks missing the passage that holds the answer. Tune k against a handful of real questions.

How much does running a PDF bot cost?

Embedding a typical 50-page document costs a fraction of a cent and only happens once. Each question costs the embedding of the query plus the answer generation — usually well under a cent with a small model. The planner below estimates it for your own document.

How to Build a PDF Q&A Bot

What you are building

A PDF Q&A bot lets a user upload a document and ask plain-English questions about it — “What is the cancellation policy?”, “Summarise section 4” — and get answers grounded in the actual text. Under the hood this is a Retrieval-Augmented Generation (RAG) pipeline narrowed to a single document. You never fine-tune a model; instead you make the PDF searchable and hand the model the relevant passages each time someone asks.

How the pipeline works

There are two phases. Ingestion runs once per document: extract the text, split it into overlapping chunks, embed each chunk into a vector, and store the vectors with their text and page numbers in a vector database. Querying runs per question: embed the question, run a similarity search to pull the closest chunks, assemble them into a context block, and prompt the model to answer using only that context — citing page numbers so the user can verify.

The biggest quality lever is chunking. Split on natural boundaries (paragraphs, headings) rather than blind character counts, keep a small overlap so a sentence that straddles two chunks survives in at least one, and store the source page so you can show citations. The planner below estimates how many chunks your document produces and roughly what ingestion and each question will cost.

Tips and gotchas

Always instruct the model to reply “I couldn’t find that in the document” when the retrieved context lacks the answer — this is the single biggest guard against confident hallucination. For scanned PDFs, OCR the pages first or the extractor returns empty text. Cache the embeddings so re-asking does not re-embed the whole document. And evaluate retrieval separately from generation: if the right chunk never reaches the top-k, no amount of prompt tweaking will fix the answer.