Skip to main content
AI crawlers discover new content through sitemaps. Without an accessible sitemap, your newest pages may not surface in AI answers for weeks — long enough that competitors get cited instead.

Methodology

Cited fetches your sitemap and tests three properties that determine whether AI crawlers can use it to find your content: existence, scale, and freshness signals. We try /sitemap.xml first, then fall back to /sitemap_index.xml. Both standard locations are accepted by every major AI crawler. If neither responds with valid XML, the signal scores 0/8 and we report the file as missing. For each AI crawler discovering your site, we evaluate: Is there a valid sitemap? A response containing <urlset> (regular sitemap) or <sitemapindex> (index of child sitemaps) counts as valid. We don’t require a specific schema — every major sitemap format works as long as the root element is recognizable. A valid sitemap scores 5/8 on its own. Does it index enough pages? Sitemaps with more than 50 URLs add 2 points to the score. Tiny sitemaps (under 50 URLs) still pass but don’t earn the scale bonus — they often signal a site that’s missing pages from the index. For sitemap indexes, we sum URLs across up to 5 child sitemaps. Does it carry freshness signals? Sitemaps with <lastmod> dates earn 1 point. These dates tell AI crawlers when to re-fetch a page. Without lastmod, crawlers fall back to crawl-budget heuristics, which often means weekly or monthly re-fetches instead of same-day refresh. We also run a canonical cross-check — sampling up to 5 URLs from your sitemap and comparing each against the <link rel="canonical"> tag on the corresponding page. Mismatches are reported as evidence but don’t change the score. The most common mismatch is a www/non-www variant: your sitemap lists https://www.yoursite.com/page but your canonical points to https://yoursite.com/page. AI crawlers indexing the sitemap version then conflict with the canonical, and citations may attribute the wrong URL. The signal scores out of 8. Sites with a 50+ URL sitemap, lastmod dates, and matching canonicals score 8/8. Sites with a valid sitemap but no lastmod score 5-7/8. Sites with no sitemap score 0/8 and are flagged as critical.

Verification

You can verify our finding yourself in a browser, no tools required. Step 1: Open your sitemap. Visit https://yoursite.com/sitemap.xml directly. If you see XML starting with <?xml version="1.0" and either <urlset> or <sitemapindex>, you have a valid sitemap. If you get a 404, try https://yoursite.com/sitemap_index.xml. If both 404, you don’t have an accessible sitemap. Step 2: Check URL count. If your sitemap is a regular <urlset>, search the source for <url> (use your browser’s view-source then Cmd+F or Ctrl+F). Each occurrence is one URL. Sitemap indexes nest under <sitemap> elements pointing to child sitemap URLs — open the children and repeat. Aim for 50+ across all sitemaps to earn the scale bonus. Step 3: Look for <lastmod> dates. Search the source for <lastmod>. Each URL entry should have one in YYYY-MM-DD format. If you find zero <lastmod> entries, you’re missing the freshness signal. If you find some but not all, that’s still OK — AI crawlers use the lastmods that exist and fall back to heuristics for the rest. Step 4: Check the canonical alignment. Open one URL from your sitemap in a browser. View the page source, search for rel="canonical". Compare the canonical’s href to the URL you opened — they should match exactly, including the www. prefix (or lack of it). Mismatches mean AI crawlers may attribute citations to the wrong URL variant. If your verification disagrees with Cited’s finding, that’s a bug — let us know.

Technical detail

Sitemaps are governed by the Sitemaps XML format specification, originally published by Google, Yahoo, and Microsoft in 2006 and now an open standard. The format supports up to 50,000 URLs and 50MB per sitemap file (uncompressed); larger sites use sitemap indexes to chain multiple sitemaps together. Parsing logic. Cited’s scanner uses a regex-based XML parser that handles:
  • Both <urlset> regular sitemaps and <sitemapindex> index files at the root
  • Sitemap indexes with up to 5 child sitemaps fetched and parsed in parallel
  • <url> entry counting via case-insensitive matching
  • <lastmod> presence detection (we check for the tag’s existence, not date validity)
  • <loc> URL extraction for the canonical cross-check sample
Canonical cross-check sampling. From a regular sitemap (or the first child of a sitemap index), the scanner samples up to 5 URLs evenly spread across the file. For each, it fetches the page with a 5-second timeout and extracts the <link rel="canonical"> href. Trailing slashes are normalized before comparison. When sitemap and canonical disagree, we attempt to detect www/non-www mismatch patterns specifically — that’s the most common cause and warrants a specific recommendation. Edge cases the scanner handles:
  • Sitemap index with more than 5 children — We sample the first 5 and report the total count. URLs in unsampled children aren’t counted toward the 50+ scale threshold, which is a known limitation for very large sites.
  • HTTP 5xx or timeout on sitemap fetch — Treated as “not found.” We don’t retry; a sitemap that takes more than 8 seconds to respond is effectively unusable for AI crawlers anyway.
  • Sitemap returns HTML instead of XML — Some misconfigured servers return the homepage at /sitemap.xml. We detect this via the missing root element (<urlset> / <sitemapindex>) and treat as not found.
  • Compressed sitemaps (gzip) — The fetch accepts gzip encoding but the regex parser operates on the decompressed text. Compressed sitemaps work transparently.
  • URLs with absolute vs relative paths in canonical — The canonical check compares hostnames and paths independently, so <link rel="canonical" href="/page"> matches https://yoursite.com/page correctly.
What this signal does not measure:
  • Whether your sitemap is referenced in robots.txt. RFC 9309 lets you declare sitemap URLs there, but absence doesn’t break AI crawler discovery — the standard /sitemap.xml location is universally tried.
  • Whether <lastmod> dates are accurate. A sitemap can claim every page was updated yesterday; the scanner trusts the declaration. AI crawlers may eventually penalize sites with stale or fake lastmods, but we don’t model that today.
  • Plain-text sitemaps (one URL per line). The format is allowed by the spec but rare in production. We don’t currently parse it.
  • Image, video, or news sitemaps. These XML extensions exist but don’t influence the AI crawler discovery path Cited measures.
For most D2C and B2B brands, the highest-leverage fix is adding <lastmod> dates. The sitemap itself usually exists; the freshness signal is what’s missing. AI crawlers with accurate lastmods can refresh your content the same day it changes. See also: robots.txt AI Crawler Access, Page Crawlability.