Methodology
Cited fetches yourrobots.txt file and tests it against 21 AI crawler user agents — the bots that read the web to build AI training data and ground AI search responses. The list covers OpenAI’s GPTBot and ChatGPT-User, Anthropic’s ClaudeBot, Perplexity’s PerplexityBot, Google’s Google-Extended, Meta-ExternalAgent, Bytespider (TikTok/ByteDance), and others.
For each AI crawler, we evaluate three things:
Is the bot allowed? A User-agent: GPTBot block with Disallow: / means GPTBot can’t read any of your pages. We flag this as a critical failure — it removes you from that AI’s index entirely.
Which paths are blocked? Even when crawlers are allowed at the root, specific paths can be disallowed. We categorize each blocked path against your detected platform (Shopify, WordPress, Webflow, or generic):
- Critical paths — content that AI answers cite. For a Shopify store, this includes
/products/,/collections/,/pages/, and/blogs/. Blocking these means AI can’t find your products, categories, or content marketing. - Expected paths — paths that should be blocked. Cart, checkout, admin, account, and API endpoints. We don’t flag these.
- Other paths — paths we couldn’t categorize. We surface them but don’t score them as problems.
Disallow: /products/*? blocks query strings on product pages but allows the canonical URLs. Disallow: /products (no trailing slash, no wildcard) blocks everything under /products. We parse the actual rule patterns and tell you what they really do.
The signal scores out of 10. Sites with no AI crawler restrictions score 10/10. Sites blocking critical paths drop fast — typically 5-6/10. Sites with full User-agent: * Disallow: / blocks score 0/10 and are flagged as critical.
Verification
You can verify our finding yourself without any tools. Step 1: Open your robots.txt. Visithttps://yoursite.com/robots.txt in any browser. Every public site serves robots.txt at the same path. If you get a 404, you don’t have a robots.txt file — which is fine for AI crawlers (no restrictions) but means AI bots have to assume everything is fair game.
Step 2: Look for AI crawler blocks. Search the file for these strings:
GPTBot— OpenAI’s training crawlerChatGPT-User— bot that fetches pages when ChatGPT users ask questionsClaudeBot— Anthropic’s training crawlerPerplexityBot— Perplexity’s search index crawlerGoogle-Extended— Google’s AI training opt-out crawler
User-agent: block followed by Disallow: /, that bot is blocked from your entire site. If you find them with specific path disallows, those specific paths are blocked.
Step 3: Check the paths we flagged. For each critical path Cited flagged (e.g., /blogs/), test it directly:
- Visit
https://yoursite.com/blogs/in a browser. If you see content, the path exists. - Check it against your robots.txt — does any
Disallow:rule cover this path? Compare the path prefix to eachDisallow:directive. - If a rule like
Disallow: /blogs/*matches, AI crawlers will not fetch any URL under/blogs/.
GPTBot) and a URL from your site, and see whether the rule blocks it. This confirms what your robots.txt actually does, not just what you intend.
If your verification disagrees with Cited’s finding, that’s a bug — let us know.
Technical detail
robots.txt is governed by RFC 9309 (the Robots Exclusion Protocol), formalized in September 2022. AI crawlers generally honor it, though enforcement varies. Parsing logic. Cited’s scanner uses a standards-compliant robots.txt parser that handles:- Multiple
User-agent:blocks per file, with the most-specific match winning for any given bot - Wildcard matching (
*for path segments,$for end-of-string anchors) - Case-insensitive user-agent matching (the RFC requires this)
- Inheritance from
User-agent: *when a bot doesn’t have its own block Allow:directives that overrideDisallow:for specific pathsCrawl-delay:directives (we capture them but don’t score against them)
- Shopify — detected via
cdn.shopify.comin assets,myshopify.comredirects, orX-Shopify-Stageresponse header - WordPress — detected via
wp-content/paths,wp-includes/references, or<meta name="generator" content="WordPress">tags - Webflow — detected via
webflow.comreferences or<meta name="generator" content="Webflow">tags - Generic — fallback when none of the above match. Critical-path categorization uses a stack-agnostic list (
/blog/,/articles/,/about/, etc.)
- HTTP 5xx on robots.txt fetch. The RFC treats this as “crawl forbidden entirely.” Cited flags this as a critical failure because AI crawlers will likely skip your site.
- HTTP 4xx on robots.txt fetch. The RFC treats this as “no restrictions.” Cited treats this as 10/10 — no AI crawler is blocked.
- Cloudflare-managed challenges. Some sites return a Cloudflare challenge page to bots. Cited detects this via response headers and reports it separately — it’s a problem, but not a robots.txt problem.
- www vs apex domain mismatch. Cited checks both
www.yoursite.com/robots.txtandyoursite.com/robots.txtand flags if they return different rules. - Injected robots.txt blocks. Some CMSes inject robots.txt rules at the application layer that aren’t visible in the file you serve at
/robots.txt. We detect these viaX-Robots-Tagheaders on individual pages.
- Whether the AI crawler actually honors your robots.txt. Most do, but enforcement is voluntary.
- Whether your content is actually being cited in AI answers. That’s the AI Visibility surface, not Site Health.
- Whether you should block AI crawlers. Some businesses choose to. We measure access; we don’t recommend.