Before optimizing content for AI visibility, your site needs to be technically accessible to AI crawlers. Two files control this:Documentation Index
Fetch the complete documentation index at: https://docs.getcited.in/llms.txt
Use this file to discover all available pages before exploring further.
robots.txt (which tells crawlers what they can access) and llms.txt (which tells AI systems what content matters most). Getting these right is the highest-leverage technical fix most brands can make — it takes 15-30 minutes and removes the most common barrier to AI discoverability.
Check your current robots.txt
Every website should have arobots.txt file at its root (e.g., yourdomain.com/robots.txt). Most sites already have one. Open yours and check for three things:
- Blanket blocks. A
Disallow: /rule underUser-agent: *blocks all crawlers from your entire site — including AI crawlers. - AI-specific blocks. Many sites added explicit blocks for AI crawlers in 2023-2024 when the instinct was to protect content from AI training. Look for rules targeting GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot, or xAI.
- Sitemap directive. A
Sitemap:line pointing to your XML sitemap helps crawlers discover your full page inventory.
Configure robots.txt for AI crawlers
Add explicit allow rules for each AI crawler you want to be visible to. Place these after any general rules in yourrobots.txt:
- OpenAI: GPTBot handles training and retrieval. OAI-SearchBot is retrieval-only for ChatGPT Search. ChatGPT-User is the user-agent for real-time page fetching during user queries.
- Anthropic: ClaudeBot handles training data crawling. Claude-User fetches pages in real-time during user queries. Claude-SearchBot indexes pages for Claude’s search results.
- Perplexity: PerplexityBot indexes pages for Perplexity answers. Perplexity-User fetches pages in real-time during user queries.
- Google: Google-Extended is Google’s AI training crawler (for Gemini and AI Overviews), distinct from Googlebot which handles traditional search indexing.
Create or improve your llms.txt
Thellms.txt file is a plain-text file at your site root that tells AI systems which pages are most important. Think of it as a curated table of contents specifically for AI crawlers — a way to signal priority when your site has hundreds or thousands of pages.
The format follows the emerging standard at llmstxt.org:
llms.txt:
- List your 10-20 most important pages, not every page on your site. This is a priority signal, not a sitemap.
- Put the highest-priority pages first. Order communicates importance.
- Include a one-line description for each page. This helps AI systems understand page purpose without crawling.
- Update it when you publish significant new content. A stale
llms.txtis better than none, but a current one is better still.
Verify your changes
After updating both files, verify they are working:- Visit
yourdomain.com/robots.txtin your browser and confirm the new allow rules are live. - Visit
yourdomain.com/llms.txtand confirm it renders as expected. - Test with curl using AI crawler user-agents to verify they receive 200 responses:
curl -A "PerplexityBot" https://yourdomain.com/should return your page content, not a block page or 403 error. - Wait 1-2 weeks for crawlers to discover the changes — they do not check instantly.
- Monitor mention rate changes in your tracking over the following 4-8 weeks.
Common mistakes
Blocking crawlers at the CDN or hosting level while allowing them in robots.txt. Cloudflare, Vercel, and similar platforms have their own bot management settings that can block AI crawlers at the infrastructure layer. Both layers need to allow access — robots.txt alone is not sufficient if the hosting provider rejects the request before the crawler ever sees your robots.txt. Creating llms.txt but not making it accessible. Some implementations add a link element withrel="llms-txt" in the HTML head to help crawlers discover the file. While not required by the spec, this improves discoverability on platforms that check for it.
Listing too many pages in llms.txt. The value of llms.txt is prioritization. Listing 500 pages defeats the purpose. Focus on the 10-20 pages that are most important for your brand’s AI visibility — product pages, key comparison content, and authoritative reference pages.
Forgetting to update llms.txt when site structure changes. A llms.txt pointing to pages that have moved or been deleted sends a negative signal. Treat it as a living document, not a one-time setup.
Expected timeline
The timeline for seeing results varies by change type and platform:- robots.txt changes: Crawlers typically notice within days to 2 weeks. This is the fastest-acting change.
- llms.txt changes: Varies by platform. Perplexity checks frequently; other platforms check less often or do not yet support the standard.
- Mention rate impact: Allow 4-8 weeks for training-data-dependent platforms (ChatGPT, Claude, Gemini) to reflect changes. Retrieval-first platforms like Perplexity may show effects within 1-2 weeks.
Related concepts
Frequently asked questions
Will allowing AI crawlers let them train on my content?
Will allowing AI crawlers let them train on my content?
Potentially — allowing GPTBot and similar crawlers means AI providers can use your content for training as well as for retrieval. If you want to allow retrieval (being cited in answers) but block training, some platforms support this distinction. OpenAI’s OAI-SearchBot is retrieval-only, meaning it fetches content for real-time answers without using it for model training. However, the practical reality is that blocking training crawlers often also blocks citation — the distinction is not cleanly enforced across all platforms. Most brands find the visibility benefit outweighs the training concern.
Do I need both robots.txt and llms.txt?
Do I need both robots.txt and llms.txt?
robots.txt is essential — without it (or with blocks in it), crawlers may not access your site at all. llms.txt is recommended but not yet universally supported. Perplexity and some other platforms check for llms.txt, but it is an emerging standard, not a requirement. Start with robots.txt configuration; add llms.txt as a second step once your core crawler access is confirmed.
How do I know if AI crawlers are actually visiting my site?
How do I know if AI crawlers are actually visiting my site?
Check your server access logs for the user-agent strings listed above (GPTBot, ClaudeBot, PerplexityBot, etc.). Most web analytics platforms like Google Analytics filter out bot traffic by default, so standard dashboards will not show these visits. You need raw server logs or a log analysis tool configured to include bot visits. Some hosting platforms (Cloudflare, Vercel) provide bot traffic dashboards that can show AI crawler activity specifically.
My site is on Shopify, WordPress, or Wix — can I still configure these files?
My site is on Shopify, WordPress, or Wix — can I still configure these files?
Yes, but the method varies by platform. Shopify auto-generates robots.txt with limited customization options — you may need to use the Shopify Liquid template for robots.txt. WordPress plugins like Yoast SEO provide a robots.txt editor in the admin panel. Wix has a robots.txt editor in the site settings under SEO. For llms.txt, most platforms allow uploading a static file to the site root, though the exact process differs. Check your platform’s documentation for the specific steps.