GPTBot, ClaudeBot, PerplexityBot: The AI Crawler Guide

Understanding AI Crawlers

AI crawlers are web bots that collect content for training language models and powering real-time AI features. Unlike search engine crawlers that index pages for search results, AI crawlers gather data to improve AI assistants' knowledge and capabilities.

If you want AI assistants to cite your website, you need to allow these crawlers access. This guide covers all major AI crawlers and how to configure your site for them.

Major AI Crawlers

GPTBot (OpenAI)

User-Agent
GPTBot
Full User-Agent String
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
Purpose
Crawls websites to improve ChatGPT and other OpenAI models. Used for training data and potentially real-time features.
Documentation

ChatGPT-User (OpenAI)

User-Agent
ChatGPT-User
Purpose
Used when ChatGPT users enable browsing to fetch live web content. This is different from GPTBot—it's for real-time lookups, not training.

ClaudeBot (Anthropic)

User-Agent
ClaudeBot or anthropic-ai
Full User-Agent String
ClaudeBot/1.0; +https://www.anthropic.com/claude
Purpose
Crawls for Anthropic's Claude AI training and improvement.

PerplexityBot

User-Agent
PerplexityBot
Purpose
Powers Perplexity AI's real-time search and answer engine. Perplexity actively cites sources with links, making this especially valuable.

Google-Extended (Google)

User-Agent
Google-Extended
Purpose
Used by Google for Bard/Gemini AI training. Separate from Googlebot (search indexing).

CCBot (Common Crawl)

User-Agent
CCBot
Purpose
Common Crawl is a nonprofit that creates open web archives. Many AI models use Common Crawl data for training, including early GPT models.

robots.txt Configuration

To allow AI crawlers, add explicit rules to your robots.txt file:

Allow All AI Crawlers (Recommended)

# AI Crawlers - Allow for LLM visibility
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: Bytespider
Allow: /

Block Specific Sections

If you want AI access but need to protect certain areas:

User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /private/
Disallow: /api/

Note: Many hosting platforms and CMS systems block AI crawlers by default. Check your robots.txt to ensure you haven't inadvertently blocked them.

Complete robots.txt Example

Here's a full robots.txt optimized for AI visibility:

# Standard crawlers
User-agent: *
Allow: /

# AI Crawlers - Explicitly allowed
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: Bytespider
Allow: /

# Sitemap
Sitemap: https://www.yoursite.com/sitemap.xml

# LLM Context Files
# See https://llmstxt.org for specification

Additional AI Crawlers

Other AI-related crawlers you may encounter:

Checking Crawler Access

To verify your robots.txt is configured correctly:

  1. Visit yoursite.com/robots.txt directly
  2. Use Google's robots.txt Tester in Search Console
  3. Check server logs for AI crawler visits
  4. Use online robots.txt validators

Beyond robots.txt

To maximize AI crawler effectiveness:

Get AI-Optimized Visibility

Cite.sh is an AI Citation Directory with robots.txt optimized for all major AI crawlers. Get listed and earn a DR 26 dofollow backlink.

Submit Your Site — $29

Related Resources

For more on AI optimization, check out: