Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.getcited.in/llms.txt

Use this file to discover all available pages before exploring further.

Before optimizing content for AI visibility, your site needs to be technically accessible to AI crawlers. Two files control this: robots.txt (which tells crawlers what they can access) and llms.txt (which tells AI systems what content matters most). Getting these right is the highest-leverage technical fix most brands can make — it takes 15-30 minutes and removes the most common barrier to AI discoverability.

Check your current robots.txt

Every website should have a robots.txt file at its root (e.g., yourdomain.com/robots.txt). Most sites already have one. Open yours and check for three things:
  1. Blanket blocks. A Disallow: / rule under User-agent: * blocks all crawlers from your entire site — including AI crawlers.
  2. AI-specific blocks. Many sites added explicit blocks for AI crawlers in 2023-2024 when the instinct was to protect content from AI training. Look for rules targeting GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot, or xAI.
  3. Sitemap directive. A Sitemap: line pointing to your XML sitemap helps crawlers discover your full page inventory.
The common mistake: blanket AI crawler blocks made sense as a default in 2023, but they now actively hurt AI visibility. If your site blocks these crawlers, you are invisible in AI-generated answers regardless of how good your content is.

Configure robots.txt for AI crawlers

Add explicit allow rules for each AI crawler you want to be visible to. Place these after any general rules in your robots.txt:
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /
Each crawler serves a different purpose:
  • OpenAI: GPTBot handles training and retrieval. OAI-SearchBot is retrieval-only for ChatGPT Search. ChatGPT-User is the user-agent for real-time page fetching during user queries.
  • Anthropic: ClaudeBot handles training data crawling. Claude-User fetches pages in real-time during user queries. Claude-SearchBot indexes pages for Claude’s search results.
  • Perplexity: PerplexityBot indexes pages for Perplexity answers. Perplexity-User fetches pages in real-time during user queries.
  • Google: Google-Extended is Google’s AI training crawler (for Gemini and AI Overviews), distinct from Googlebot which handles traditional search indexing.
Allowing these crawlers does not mean giving up copyright. It means allowing AI platforms to discover and potentially cite your content. The alternative — blocking crawlers — means being invisible in AI answers.

Create or improve your llms.txt

The llms.txt file is a plain-text file at your site root that tells AI systems which pages are most important. Think of it as a curated table of contents specifically for AI crawlers — a way to signal priority when your site has hundreds or thousands of pages. The format follows the emerging standard at llmstxt.org:
# Your Brand Name
> One-sentence description of what your site is about

## Key Pages
- [Product Overview](https://yourdomain.com/product): What your product does
- [Pricing](https://yourdomain.com/pricing): Current pricing and plans
- [Documentation](https://yourdomain.com/docs): Technical documentation
- [Blog](https://yourdomain.com/blog): Latest articles and analysis
Best practices for llms.txt:
  • List your 10-20 most important pages, not every page on your site. This is a priority signal, not a sitemap.
  • Put the highest-priority pages first. Order communicates importance.
  • Include a one-line description for each page. This helps AI systems understand page purpose without crawling.
  • Update it when you publish significant new content. A stale llms.txt is better than none, but a current one is better still.

Verify your changes

After updating both files, verify they are working:
  1. Visit yourdomain.com/robots.txt in your browser and confirm the new allow rules are live.
  2. Visit yourdomain.com/llms.txt and confirm it renders as expected.
  3. Test with curl using AI crawler user-agents to verify they receive 200 responses: curl -A "PerplexityBot" https://yourdomain.com/ should return your page content, not a block page or 403 error.
  4. Wait 1-2 weeks for crawlers to discover the changes — they do not check instantly.
  5. Monitor mention rate changes in your tracking over the following 4-8 weeks.

Common mistakes

Blocking crawlers at the CDN or hosting level while allowing them in robots.txt. Cloudflare, Vercel, and similar platforms have their own bot management settings that can block AI crawlers at the infrastructure layer. Both layers need to allow access — robots.txt alone is not sufficient if the hosting provider rejects the request before the crawler ever sees your robots.txt. Creating llms.txt but not making it accessible. Some implementations add a link element with rel="llms-txt" in the HTML head to help crawlers discover the file. While not required by the spec, this improves discoverability on platforms that check for it. Listing too many pages in llms.txt. The value of llms.txt is prioritization. Listing 500 pages defeats the purpose. Focus on the 10-20 pages that are most important for your brand’s AI visibility — product pages, key comparison content, and authoritative reference pages. Forgetting to update llms.txt when site structure changes. A llms.txt pointing to pages that have moved or been deleted sends a negative signal. Treat it as a living document, not a one-time setup.

Expected timeline

The timeline for seeing results varies by change type and platform:
  • robots.txt changes: Crawlers typically notice within days to 2 weeks. This is the fastest-acting change.
  • llms.txt changes: Varies by platform. Perplexity checks frequently; other platforms check less often or do not yet support the standard.
  • Mention rate impact: Allow 4-8 weeks for training-data-dependent platforms (ChatGPT, Claude, Gemini) to reflect changes. Retrieval-first platforms like Perplexity may show effects within 1-2 weeks.
These timelines assume the content itself is already worth citing. Unblocking crawlers is necessary but not sufficient — if the content behind the newly-accessible pages is thin or poorly structured, crawler access alone will not produce mentions. Pair this playbook with Write content that gets cited for the content layer.

Frequently asked questions

Potentially — allowing GPTBot and similar crawlers means AI providers can use your content for training as well as for retrieval. If you want to allow retrieval (being cited in answers) but block training, some platforms support this distinction. OpenAI’s OAI-SearchBot is retrieval-only, meaning it fetches content for real-time answers without using it for model training. However, the practical reality is that blocking training crawlers often also blocks citation — the distinction is not cleanly enforced across all platforms. Most brands find the visibility benefit outweighs the training concern.
robots.txt is essential — without it (or with blocks in it), crawlers may not access your site at all. llms.txt is recommended but not yet universally supported. Perplexity and some other platforms check for llms.txt, but it is an emerging standard, not a requirement. Start with robots.txt configuration; add llms.txt as a second step once your core crawler access is confirmed.
Check your server access logs for the user-agent strings listed above (GPTBot, ClaudeBot, PerplexityBot, etc.). Most web analytics platforms like Google Analytics filter out bot traffic by default, so standard dashboards will not show these visits. You need raw server logs or a log analysis tool configured to include bot visits. Some hosting platforms (Cloudflare, Vercel) provide bot traffic dashboards that can show AI crawler activity specifically.
Yes, but the method varies by platform. Shopify auto-generates robots.txt with limited customization options — you may need to use the Shopify Liquid template for robots.txt. WordPress plugins like Yoast SEO provide a robots.txt editor in the admin panel. Wix has a robots.txt editor in the site settings under SEO. For llms.txt, most platforms allow uploading a static file to the site root, though the exact process differs. Check your platform’s documentation for the specific steps.