Sitemap Privacy Auditor

Paste a sitemap.xml and flag URLs that expose user IDs, tokens, or internal paths

Ad placeholder (leaderboard)

Your sitemap.xml is a public list of URLs you are actively asking search engines to crawl and index. If it includes personalised pages, tokenised links, or internal admin paths, you are advertising those resources to the whole web. This auditor parses the sitemap you paste and flags the URLs most likely to over-expose users or infrastructure.

How it works

The tool parses your pasted XML with the browser’s parser and extracts every <loc> URL from the urlset (it also reads sitemapindex entries). For each URL it inspects the path segments and the query string against a set of heuristics:

  • User and account IDs — segments like /users/4821, /account/00219, /orders/8830, or UUID-shaped segments following an identity-style path.
  • Session and access tokens — long, high-entropy hex or base64-like strings, and query parameters named token, sid, session, auth, key, access_token, or signature.
  • Admin and internal prefixes — paths beginning with or containing admin, internal, staging, dashboard, private, debug, or wp-admin.
  • Personal data in the query — parameters named email, phone, name, or values that match an email pattern.

Each match is reported with the URL and the reason. The classifier weights context: a numeric ID under /users/ is high risk, while the same number under /products/ is not flagged. As with any heuristic, treat results as candidates to confirm, not absolute verdicts.

Example

Given a sitemap containing:

<url><loc>https://example.com/users/4821/invoices</loc></url>
<url><loc>https://example.com/reset?token=a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6</loc></url>

The auditor flags the first URL for exposing a numeric user ID under an identity path (anyone can increment it), and the second for carrying a long high-entropy token in a query parameter named token, which should never be indexable.

Tips and notes

  • Generate sitemaps from your public, canonical route list only — never by crawling authenticated sessions.
  • Disallow admin and internal prefixes in robots.txt and keep them out of every sitemap.
  • Never put tokens in URLs you intend to be shareable or indexable; use short-lived, single-use tokens delivered out of band.
  • All parsing is local to your browser; nothing you paste is transmitted.
Ad placeholder (rectangle)