Why use an LLM instead of CSS selectors?

CSS or XPath selectors are faster, cheaper, and exact — use them when a site has a stable structure. LLM extraction shines when structure varies across pages, when the layout changes often, or when you need to interpret messy free text (like extracting a price and currency from a sentence). The robust approach is selectors where you can, LLM where you must.

Should I send the whole HTML to the model?

No — raw HTML is mostly scripts, styles, and markup that waste tokens and confuse the model. Strip those first and send only the visible text or the relevant content region. This cuts cost dramatically and improves accuracy, since the model is not distracted by boilerplate. For large pages, send only the section likely to contain your target data.

How do I make sure the output is valid JSON?

Use the provider's structured-output or JSON mode, which constrains the model to emit parseable JSON matching your schema. Then still validate the result against the schema in your code before storing it. Telling the model to return null for missing fields rather than guessing also prevents fabricated values from entering your dataset.

How do I keep costs down at scale?

Use cheap deterministic selectors for the structured parts and reserve the LLM for the genuinely variable fields. Clean and trim HTML aggressively before sending it, cache extractions so you never re-process an unchanged page, batch where the provider supports it, and pick the smallest model that meets your accuracy bar. Extraction is usually a short task a small model handles well.

How to Scrape the Web with AI

Q: Is AI web scraping legal?

Scraping legality depends on the site's terms, the data's nature, and your jurisdiction — using an LLM to parse the result changes none of that. Respect robots.txt, do not scrape personal or copyrighted data you have no right to, rate-limit your requests, and check the site's terms of service. The extraction method is a technical detail; the legal and ethical obligations are the same as any scraping.

Why combine scrapers with LLMs

Traditional scraping breaks the moment a site changes its layout, and it cannot read intent — extracting “£12.99 incl. VAT” into a number and currency from prose defeats a CSS selector. LLMs are the opposite: slower and costlier, but resilient to messy, varying structure and able to interpret free text. The winning pattern is a hybrid: fetch and clean with normal tooling, then hand the messy content to a model to extract clean, structured JSON. This guide covers that pipeline, and the builder below generates the extraction prompt and schema for any target.

How the fetch-clean-extract pipeline works

Fetch. Pull the page with an HTTP client, or a headless browser if the content is rendered by JavaScript. Respect robots.txt and rate-limit your requests.

Clean. Strip scripts, styles, navigation, and footers, leaving only the meaningful content. This is the highest-leverage step: it cuts token cost dramatically and removes the boilerplate that confuses the model. For big pages, isolate the region that holds your target data and send only that.

Extract. Send the cleaned content plus a target JSON schema and instruct the model to fill it, returning only valid JSON and null for anything missing — never a guess. Use the provider’s structured-output mode so the response always parses.

Validate and store. Check each extracted object against your schema in code before saving, so a malformed extraction never pollutes your dataset. For multi-page sources, detect the next-page link and loop until exhausted.

Tips, costs, and ethics

Use deterministic selectors wherever the structure is stable and reserve the LLM for the genuinely variable or free-text fields — this keeps cost and latency low. Always clean HTML before sending it; raw markup is mostly waste. Cache extractions so an unchanged page is never reprocessed, and pick the smallest model that hits your accuracy bar, since extraction is a short task. On ethics and law: the extraction method changes nothing about your obligations — honour robots.txt and terms of service, rate-limit, and do not collect personal or copyrighted data you have no right to. Use the builder below to scaffold a schema and extraction prompt for your target site.