Page Crawlability

If individual pages block AI crawlers via noindex tags or non-200 responses, AI answers about your brand cite competitor pages instead. This is the most-missed signal — robots.txt looks clean, but the pages themselves quietly fail.

Methodology

robots.txt is the site-level gate. Page Crawlability is the per-page gate — and it’s where AI visibility often quietly fails. A site can pass robots.txt with a 10/10 and still score 0/7 here if the actual pages return non-200 or carry noindex. Cited samples up to 5 pages from your site and tests each one for two crawlability properties: Does the page return a successful response? A page that returns HTTP 200-399 (success or redirect) is crawlable. Anything else — 404, 410, 500, 503, gateway timeouts — and AI crawlers move on without indexing the content. Each crawlable page contributes 1.4 points to the score (5 pages × 1.4 = max 7). Does the page allow indexing? Pages with <meta name="robots" content="noindex"> or <meta name="googlebot" content="noindex"> tell crawlers to fetch but not index. AI crawlers honor this directive, which means the content can be read but won’t be cited. Each page carrying noindex deducts 2 points from the score — heavier than the 1.4 a crawlable page earns, because noindex is the more common silent failure. The page sample isn’t random. The scanner selects across content slots: the homepage (always), one blog or article page (preferring deep paths from your sitemap with recent lastmod dates), one product or service page, one about or information page, and one other page. Sitemap URLs are preferred over homepage-link discovery — they’re higher quality and the platform already considers them indexable. The signal scores out of 7. Sites with 5 crawlable pages and zero noindex tags score 7/7. Sites with all 5 crawlable but one noindex score 5/7. Sites with non-200 responses on multiple pages drop fast — typically 1-3/7. Sites where every sampled page is broken score 0/7 and are flagged as critical.

Verification

You can verify our finding yourself using your browser’s developer tools. Step 1: Find the pages we sampled. Cited reports the specific URLs we tested. Open each one in a new browser tab. The list usually includes your homepage, a blog post, a product page, an about page, and one other. Step 2: Check the HTTP response code. Open the browser’s developer tools (Right-click → Inspect or Cmd+Option+I on Mac, Ctrl+Shift+I on Windows). Click the Network tab, then reload the page. Find the first row — the document request for the URL itself. The Status column shows the response code. 200, 301, 302 are all crawlable. 404, 410, 500, 503 are not. Step 3: Search for noindex tags. Right-click anywhere on the page, then View Page Source (or use Cmd+U / Ctrl+U). Search the source (Cmd+F / Ctrl+F) for the string noindex. If you find it inside a <meta name="robots"> or <meta name="googlebot"> tag, that page is telling AI crawlers not to index its content. Case-insensitive — Noindex, NOINDEX, and noindex all count. Step 4: Spot-check what crawlers actually see. Many sites render content with JavaScript, which means the source code in View Page Source may not show the final content. Use a tool like Browserling SEO Crawl Test or Google Search Console’s URL Inspection tool to fetch the page as a crawler would. If the rendered HTML is empty or missing key content, AI crawlers won’t see it either — Cited’s scanner waits 3 seconds for JavaScript hydration, but heavily client-rendered sites need verification. If your verification disagrees with Cited’s finding, that’s a bug — let us know.

Technical detail

The HTTP status code semantics for crawlability are defined by RFC 9110 (HTTP Semantics, formalized in June 2022). The noindex meta tag is defined by the Robots Meta Tag spec maintained by Google and honored by most major AI crawlers. Page sampling. The scanner selects up to 5 pages via hierarchical slots: homepage (always first), blog/article, product/service, about/info, and one other. Sources are weighted — sitemap URLs are preferred over homepage-discovered links because the platform has already declared them indexable. Within each slot:

Blog candidates prefer deep paths (path depth ≥ 3, e.g. /blog/post-title over /blog) and recency from sitemap <lastmod> dates
Product candidates filter against patterns like /products/, /shop/, /items/
About candidates match /about, /team, /company, etc.

Same-hostname filtering normalizes www vs apex, so www.yoursite.com and yoursite.com are treated as one domain. Rendering. Each sampled page is fetched with Puppeteer using domcontentloaded wait condition plus a fixed 3-second delay for JavaScript hydration. This is conservative — 3 seconds covers most Shopify, Webflow, and WordPress hydration paths but may miss heavily client-rendered React/Next.js sites that take longer. noindex detection. The scanner uses DOM parsing (not regex) to find <meta name="robots"> and <meta name="googlebot"> tags, then case-insensitively checks the content attribute for the string noindex. Both name="robots" and name="googlebot" are checked because Google-specific tags take precedence over generic robots tags. Edge cases the scanner handles:

Redirects — HTTP 301 and 302 redirects count as crawlable; AI crawlers follow them. The final URL after redirect is what gets evaluated for noindex.
Cloudflare or WAF challenge pages — Some sites return Cloudflare’s “Just a moment…” or similar bot-check pages instead of content. The scanner detects 7 common challenge indicators in the page title and marks the page as crawlable but reports the challenge separately.
Empty HTML responses — A page returning HTTP 200 with empty or near-empty content (under a threshold) is treated as crawlable for this signal but penalized by content-quality signals downstream.
Timeouts — Page load is bounded at 12 seconds for navigation plus 15-second default for other operations. Pages exceeding this timeout get statusCode 0 and are treated as non-crawlable.
Negative score from multiple noindex tags — If three sampled pages all carry noindex, the score calculation drops below zero. The result is clamped to 0/7.

What this signal does not measure:

X-Robots-Tag HTTP headers. Some sites set indexing directives via response header instead of meta tag. The GEO Score scanner doesn’t check this header; Cited’s Crawl Radar product does, but it’s a separate measurement.
JavaScript-rendered noindex injection. Pages that add a noindex meta tag via JavaScript after page load may evade the 3-second hydration window. This is rare in practice but a known gap.
rel="canonical" pointing elsewhere. A page that’s crawlable but canonicalizes to a different URL is functionally non-indexed — the canonical target gets the citation. Sitemap Accessibility surfaces a related canonical cross-check, but this signal doesn’t model canonical implications.
Pages requiring authentication. The scanner doesn’t log in. If your most important content sits behind a login wall, it scores as a 401/403 — non-crawlable — which is correct for AI visibility purposes.

For most D2C and B2B brands, the highest-leverage fix here is auditing noindex tags. They’re often added by SEO plugins (Yoast, RankMath, Webflow’s built-in SEO panel) and accumulate over time — staging pages get the tag, then go live without it being removed. See also: robots.txt AI Crawler Access, Sitemap Accessibility.

Get Started

Concepts

Methodology

Signals

Playbooks

MCP

Glossary

Changelog

Methodology

Verification

Technical detail

​Methodology

​Verification

​Technical detail

Methodology

Verification

Technical detail