robots.txt for AI Crawlers

AI search engines use crawlers that are distinct from their training bots — and many sites block them accidentally.

The Crawlers You Need to Know

Bot	Company	Purpose
`GPTBot`	OpenAI	Content training (not search)
`OAI-SearchBot`	OpenAI	ChatGPT Search real-time indexing
`PerplexityBot`	Perplexity	Search indexing
`ClaudeBot`	Anthropic	Content indexing
`Google-Extended`	Google	AI Overviews training and search
`Applebot-Extended`	Apple	Apple Intelligence

Critical distinction: GPTBot is for OpenAI training data. OAI-SearchBot is for ChatGPT Search real-time indexing. If you block GPTBot to prevent training data collection but don’t explicitly allow OAI-SearchBot, your site will be invisible in ChatGPT Search answers.

The Common Mistake

Many sites use this pattern to block AI training:

User-agent: GPTBot
Disallow: /

This blocks OpenAI training — but OAI-SearchBot follows the same blanket rules unless specified separately. Result: invisible in ChatGPT Search.

Recommended Configuration

If you want to be indexed by AI search but not used for training:

# Block training bots
User-agent: GPTBot
Disallow: /

# Allow search indexing bots explicitly
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

If you want maximum AI search visibility (allow both training and search):

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

Blocking Specific Sections

You can allow AI search indexing while protecting private content:

User-agent: OAI-SearchBot
Allow: /blog/
Allow: /docs/
Disallow: /admin/
Disallow: /user/
Disallow: /checkout/

Meta Tag Alternative

For page-level control, use the noai meta tag:

<meta name="robots" content="noai, noimageai">

This is useful for pages you want to rank in Google but not be used in AI-generated answers.

Verification

After updating robots.txt:

Check your live robots.txt at your-domain.com/robots.txt
Use Google Search Console → robots.txt Tester for Google-Extended
Wait 24-48 hours for crawlers to re-check
Test visibility in Perplexity by searching for unique phrases from your content