Understanding AI Crawlers
AI crawlers are web bots that collect content for training language models and powering real-time AI features. Unlike search engine crawlers that index pages for search results, AI crawlers gather data to improve AI assistants' knowledge and capabilities.
If you want AI assistants to cite your website, you need to allow these crawlers access. This guide covers all major AI crawlers and how to configure your site for them.
Major AI Crawlers
GPTBot (OpenAI)
GPTBotMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)ChatGPT-User (OpenAI)
ChatGPT-UserClaudeBot (Anthropic)
ClaudeBot or anthropic-aiClaudeBot/1.0; +https://www.anthropic.com/claudePerplexityBot
PerplexityBotGoogle-Extended (Google)
Google-ExtendedCCBot (Common Crawl)
CCBotrobots.txt Configuration
To allow AI crawlers, add explicit rules to your robots.txt file:
Allow All AI Crawlers (Recommended)
# AI Crawlers - Allow for LLM visibility
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Allow: /
User-agent: Bytespider
Allow: /
Block Specific Sections
If you want AI access but need to protect certain areas:
User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Note: Many hosting platforms and CMS systems block AI crawlers by default. Check your robots.txt to ensure you haven't inadvertently blocked them.
Complete robots.txt Example
Here's a full robots.txt optimized for AI visibility:
# Standard crawlers
User-agent: *
Allow: /
# AI Crawlers - Explicitly allowed
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Allow: /
User-agent: Bytespider
Allow: /
# Sitemap
Sitemap: https://www.yoursite.com/sitemap.xml
# LLM Context Files
# See https://llmstxt.org for specification
Additional AI Crawlers
Other AI-related crawlers you may encounter:
- Bytespider: ByteDance (TikTok parent company) crawler for AI training
- FacebookBot: Meta's crawler, potentially used for LLaMA training
- Amazonbot: Amazon's crawler for Alexa and AI services
- AppleBot-Extended: Apple's AI training crawler (separate from search)
- Cohere-ai: Cohere's enterprise AI crawler
- AI2Bot: Allen Institute for AI crawler
Checking Crawler Access
To verify your robots.txt is configured correctly:
- Visit
yoursite.com/robots.txtdirectly - Use Google's robots.txt Tester in Search Console
- Check server logs for AI crawler visits
- Use online robots.txt validators
Beyond robots.txt
To maximize AI crawler effectiveness:
- Create an
llm.txtfile with a summary of your site - Add JSON-LD schema markup for better content understanding
- Keep your sitemap.xml updated with recent
lastmoddates - Ensure fast page load times (slow sites may be partially crawled)
- Use semantic HTML structure
Get AI-Optimized Visibility
Cite.sh is an AI Citation Directory with robots.txt optimized for all major AI crawlers. Get listed and earn a DR 26 dofollow backlink.
Submit Your Site — $29Related Resources
For more on AI optimization, check out: