How to Use AI for Data Cleaning and Preparation

Normalise, deduplicate, and enrich messy data with LLMs

Ad placeholder (leaderboard)

Data cleaning is the unglamorous 80% of most data work — and the part where AI genuinely shines, because so much of the mess is fuzzy in ways traditional code handles badly. “123 Main St” versus “123 Main Street,” “IBM” versus “International Business Machines,” a column called cust_nm that should map to customer_name: these are interpretation problems, and interpretation is what LLMs are good at. The skill is knowing when to use the model and how to keep it honest.

Use code where you can, AI where you must

The first principle is do not reach for an LLM by default. Deterministic transformations — trimming whitespace, parsing dates, applying a known lookup table — should stay in plain code: it is faster, cheaper, perfectly reproducible, and free. Reserve the model for tasks that genuinely resist rules: normalising free-text addresses, deciding whether two messy strings refer to the same real-world entity, or mapping a chaotic source schema onto your target. AI is the expensive specialist, not the everyday workhorse.

Normalisation and schema mapping

For normalisation, give the model a strict target format and a batch of records, with one worked example: “Convert each address to {street, city, postcode, country} as JSON; return null for any part you cannot determine.” For schema mapping, paste the source column names with a few sample values and the target schema, and ask the model to propose a column-to-column mapping with a confidence note. Keep temperature low so identical inputs produce identical outputs, and validate every response against your schema in code — retry or flag anything that does not parse cleanly.

Deduplication and imputation

Entity deduplication is a classic fuzzy-matching win, but you must not ask the model to compare every record with every other — that is quadratic and expensive. First apply cheap blocking: group records that share a postcode, a name prefix, or a phone number, so only plausible candidates are compared. Then ask the model to judge each candidate pair — “Do these two records refer to the same company? Answer yes or no with a one-line reason.” This pairs scalable filtering with the model’s strength at recognising “same thing, written differently.”

For missing values, the rule is strict: the model may infer a value deducible from the row (a country from a city, a category from a description) but must never invent a fact. Instruct it to return “unknown” rather than guess, and tag every imputed value with a provenance flag so downstream users can tell original data from inferred data.

The guardrails that make it production-safe

The defining risk of AI data cleaning is silent, plausible error — the model confidently merges two real customers or normalises an address to the wrong town, and because the output looks tidy, nobody notices until it causes damage. Four guardrails contain this: validate every output against rules and schemas; attach confidence flags and route low-confidence cases to review; keep a provenance column marking which fields AI touched so they can be audited and reversed; and spot-check a human sample of every batch. With these in place, AI turns the worst part of data work into a fast, reviewable first pass.

To handle the upstream problem of getting clean data out of documents in the first place, see how to build a document intelligence app, and read how LLMs work to understand exactly why those confident-but-wrong outputs occur.

Ad placeholder (rectangle)