Why is removing names not enough?

Direct identifiers like names are only part of the risk. Quasi-identifiers — date of birth, postcode, job title, rare conditions — can combine to single out one person even with no name present. The classic example is that gender, ZIP, and date of birth uniquely identify most people.

What is a quasi-identifier?

An attribute that is not unique on its own but becomes identifying in combination with others, such as age, location, occupation, or a specific event date. De-identification has to manage these, not just direct identifiers.

Does a clean result mean the data is anonymous?

No. A low score means fewer obvious residual signals, not a guarantee. True anonymisation under GDPR is a high bar that depends on the whole dataset and realistic re-identification means, not a single text snippet.

Is my text uploaded anywhere?

No. All pattern matching runs locally in your browser with JavaScript. Nothing you paste leaves the page, which matters because the text may still contain sensitive information.

How should I reduce re-identification risk?

Generalise (age ranges instead of exact age), bucket locations (region instead of postcode), suppress rare values, and remove unusual event details. Re-test until the risky combinations are gone, and document your reasoning.

What is the NLP De-identification Tester?

Paste de-identified text and run re-identification risk tests — checking for quasi-identifiers, unique attribute combinations, and context signals that could let a motivated adversary re-identify individuals. Runs entirely in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

NLP De-identification Tester

Name: NLP De-identification Tester
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

NLP de-identification tester

Stripping names out of text feels like anonymisation, but it usually is not. Quasi-identifiers — a date of birth, a postcode, an unusual job title, a rare medical condition — can combine to pinpoint a single individual even when every obvious identifier is gone. This tester scans de-identified text for the signals that make re-identification possible, so you can find and fix the weak spots before you share or publish.

How it works

The tool runs entirely client-side. It scans your text for categories of residual risk: dates and ages, locations and postcodes, job titles and organisations, contact patterns that slipped through, and rare/specific phrases. It counts how many distinct quasi-identifier categories appear together, because the danger is in the combination — three or four quasi-identifiers in one record is often enough to single someone out. The output is a risk level, the categories found, and the specific spans that triggered each one.

The difference between anonymisation and pseudonymisation

Under GDPR, these are legally distinct categories with very different implications:

Pseudonymised data — identifiers are replaced with codes or tokens, but the mapping exists somewhere and re-identification is possible with the right key. This is still personal data and GDPR applies in full.
Anonymised data — it is no longer reasonably possible to re-identify the individuals, even with all means reasonably likely to be used. GDPR does not apply to genuinely anonymous data.

The GDPR “reasonably likely” standard is the key challenge. A dataset that looks anonymous in isolation may become re-identifiable when combined with publicly available data — electoral rolls, LinkedIn profiles, court records. This tester can help identify obvious residual signals, but the full anonymisation assessment requires considering the whole context and realistic attack vectors.

Common de-identification mistakes

Removing names but keeping dates — “the patient was admitted on 14 March 2019 following an accident” is far more identifying than it appears if the accident was reported publicly or the individual’s hospital admission is known by people close to them.

Generic job titles that narrow the population — “the only female consultant gastroenterologist at the Bristol hospital” combined with an age range is very identifying, even without a name.

Rare conditions or unusual combinations — if someone has an unusual combination of three chronic conditions, specifying all three may uniquely identify them even in a large dataset.

Event descriptions that are searchable — describing a specific public incident (a court case, a news event, an unusual workplace accident) makes the record effectively identifiable by anyone who can search for the event.

Practical reduction techniques

After running the tester and seeing which categories were flagged, apply these in order of impact:

Generalise dates — replace specific dates with months, quarters, or years
Bucket ages — use 10-year age ranges (35-44) instead of exact ages
Aggregate locations — replace postcodes with regions or broad areas
Suppress rare values — if fewer than (for example) 5 people in your dataset share an attribute combination, suppress or generalise that record
Remove event details — replace specific incident descriptions with general categories

Re-test after each change, since removing one quasi-identifier sometimes makes remaining ones more distinctive.