robots.txt AI Crawler Access

If AI crawlers can’t read your site, you don’t show up in AI answers. That’s the whole game.

Methodology

Cited fetches your robots.txt file and tests it against 21 AI crawler user agents — the bots that read the web to build AI training data and ground AI search responses. The list covers OpenAI’s GPTBot and ChatGPT-User, Anthropic’s ClaudeBot, Perplexity’s PerplexityBot, Google’s Google-Extended, Meta-ExternalAgent, Bytespider (TikTok/ByteDance), and others. For each AI crawler, we evaluate three things: Is the bot allowed? A User-agent: GPTBot block with Disallow: / means GPTBot can’t read any of your pages. We flag this as a critical failure — it removes you from that AI’s index entirely. Which paths are blocked? Even when crawlers are allowed at the root, specific paths can be disallowed. We categorize each blocked path against your detected platform (Shopify, WordPress, Webflow, or generic):

Critical paths — content that AI answers cite. For a Shopify store, this includes /products/, /collections/, /pages/, and /blogs/. Blocking these means AI can’t find your products, categories, or content marketing.
Expected paths — paths that should be blocked. Cart, checkout, admin, account, and API endpoints. We don’t flag these.
Other paths — paths we couldn’t categorize. We surface them but don’t score them as problems.

Does the rule actually do what you think? Wildcards in robots.txt are subtle. Disallow: /products/*? blocks query strings on product pages but allows the canonical URLs. Disallow: /products (no trailing slash, no wildcard) blocks everything under /products. We parse the actual rule patterns and tell you what they really do. The signal scores out of 10. Sites with no AI crawler restrictions score 10/10. Sites blocking critical paths drop fast — typically 5-6/10. Sites with full User-agent: * Disallow: / blocks score 0/10 and are flagged as critical.

Verification

You can verify our finding yourself without any tools. Step 1: Open your robots.txt. Visit https://yoursite.com/robots.txt in any browser. Every public site serves robots.txt at the same path. If you get a 404, you don’t have a robots.txt file — which is fine for AI crawlers (no restrictions) but means AI bots have to assume everything is fair game. Step 2: Look for AI crawler blocks. Search the file for these strings:

GPTBot — OpenAI’s training crawler
ChatGPT-User — bot that fetches pages when ChatGPT users ask questions
ClaudeBot — Anthropic’s training crawler
PerplexityBot — Perplexity’s search index crawler
Google-Extended — Google’s AI training opt-out crawler

If any of these appear in a User-agent: block followed by Disallow: /, that bot is blocked from your entire site. If you find them with specific path disallows, those specific paths are blocked. Step 3: Check the paths we flagged. For each critical path Cited flagged (e.g., /blogs/), test it directly:

Visit https://yoursite.com/blogs/ in a browser. If you see content, the path exists.
Check it against your robots.txt — does any Disallow: rule cover this path? Compare the path prefix to each Disallow: directive.
If a rule like Disallow: /blogs/* matches, AI crawlers will not fetch any URL under /blogs/.

Step 4: Validate with a robots.txt tester. Google Search Console has a robots.txt Tester at search.google.com/search-console/robots-txt-tester. You can enter a user agent (try GPTBot) and a URL from your site, and see whether the rule blocks it. This confirms what your robots.txt actually does, not just what you intend. If your verification disagrees with Cited’s finding, that’s a bug — let us know.

Technical detail

robots.txt is governed by RFC 9309 (the Robots Exclusion Protocol), formalized in September 2022. AI crawlers generally honor it, though enforcement varies. Parsing logic. Cited’s scanner uses a standards-compliant robots.txt parser that handles:

Multiple User-agent: blocks per file, with the most-specific match winning for any given bot
Wildcard matching (* for path segments, $ for end-of-string anchors)
Case-insensitive user-agent matching (the RFC requires this)
Inheritance from User-agent: * when a bot doesn’t have its own block
Allow: directives that override Disallow: for specific paths
Crawl-delay: directives (we capture them but don’t score against them)

Tech stack detection. Before categorizing paths, the scanner detects your platform from the homepage HTML and response headers:

Shopify — detected via cdn.shopify.com in assets, myshopify.com redirects, or X-Shopify-Stage response header
WordPress — detected via wp-content/ paths, wp-includes/ references, or <meta name="generator" content="WordPress"> tags
Webflow — detected via webflow.com references or <meta name="generator" content="Webflow"> tags
Generic — fallback when none of the above match. Critical-path categorization uses a stack-agnostic list (/blog/, /articles/, /about/, etc.)

Stack detection is single-signal — one strong marker is enough. No multi-vote heuristic, no confidence threshold. Sites that don’t match any known stack get the generic categorization, which uses a more conservative critical-path list. Edge cases the scanner handles:

HTTP 5xx on robots.txt fetch. The RFC treats this as “crawl forbidden entirely.” Cited flags this as a critical failure because AI crawlers will likely skip your site.
HTTP 4xx on robots.txt fetch. The RFC treats this as “no restrictions.” Cited treats this as 10/10 — no AI crawler is blocked.
Cloudflare-managed challenges. Some sites return a Cloudflare challenge page to bots. Cited detects this via response headers and reports it separately — it’s a problem, but not a robots.txt problem.
www vs apex domain mismatch. Cited checks both www.yoursite.com/robots.txt and yoursite.com/robots.txt and flags if they return different rules.
Injected robots.txt blocks. Some CMSes inject robots.txt rules at the application layer that aren’t visible in the file you serve at /robots.txt. We detect these via X-Robots-Tag headers on individual pages.

What this signal does not measure:

Whether the AI crawler actually honors your robots.txt. Most do, but enforcement is voluntary.
Whether your content is actually being cited in AI answers. That’s the AI Visibility surface, not Site Health.
Whether you should block AI crawlers. Some businesses choose to. We measure access; we don’t recommend.

For most D2C and B2B brands, blocking AI crawlers means losing AI search visibility. But for proprietary content sites (paywalled, gated), blocking is a strategic decision. Cited surfaces the state; the decision is yours.

Get Started

Concepts

Methodology

Signals

Playbooks

MCP

Glossary

Changelog

robots.txt AI Crawler Access

Methodology

Verification

Technical detail

​Methodology

​Verification

​Technical detail

Methodology

Verification

Technical detail