Why send the schema instead of the whole CSV?

The model only needs column names, types, and a few sample rows to write correct code — it does not need every row. Sending the schema keeps prompts small and cheap, avoids hitting context limits on large files, and keeps the bulk of potentially sensitive data out of the LLM request entirely.

Is it safe to execute LLM-generated code?

Only inside a locked-down sandbox. Run the code in an isolated process or container with no network access, a CPU and memory cap, a timeout, and a whitelist of allowed libraries. Never execute model-written code directly in your main application process or with access to secrets.

Should the model write pandas or SQL?

Either works. Pandas is natural when the data is already a dataframe and you want charts; SQL suits data that lives in a database and benefits from the engine's query planner. The pattern is identical — give the schema, get code, execute it safely, explain the result.

How do I handle very large files?

Sample for the schema and exploration, but execute the generated code against the full dataset so answers are exact. For files too large for memory, load them into a database or a columnar format and have the model target that, rather than holding everything in a single dataframe.

How do I make results trustworthy?

Show the generated code alongside the answer so users can audit it, render the actual computed numbers rather than letting the model guess, and have the model explain results strictly from the execution output. Never let the model invent figures that were not produced by the code.

How to Build an AI Data Analyst (Chat with Your CSV)

Building an AI data analyst

A chat-with-your-data tool lets a non-technical user upload a spreadsheet and ask questions in plain English — “which region grew fastest last quarter?” — and get back a real answer computed from the data. The key insight is that the LLM does not analyse the numbers itself; it writes code that a sandbox executes against the real dataframe. The model handles language and logic; the runtime handles arithmetic. This guide covers the pipeline, and the prototyper below turns a question plus a schema into the analysis prompt your backend would send.

How the pipeline works

There are four stages. Parse reads the upload into a dataframe and extracts a schema — column names, inferred types, and a few sample rows. Plan sends the user’s question and that schema (not the raw data) to the model, which returns pandas or SQL code. Execute runs the generated code in an isolated sandbox with a timeout, memory cap, and no network access. Explain feeds the real execution output back to the model so it can summarise the result in plain English and, if asked, render a chart.

Keeping the data out of the prompt is what makes this scale: the model needs only the shape of the data to write correct code, so files with millions of rows still produce a tiny, cheap prompt.

Tips and safety notes

Never execute model-written code in your main process — always sandbox it with resource limits and no secret access. Show the generated code next to the answer so users can verify the logic. Render the numbers the code actually produced; do not let the model freehand figures. Cap the schema sample to a handful of rows to protect privacy and save tokens. Finally, validate that the generated code only touches the uploaded dataframe before you run it.